Skip to content

JabRef/jabref#15565

Closed
Guru6446 wants to merge 5 commits intoJabRef:mainfrom
Guru6446:feature/url-identifier-parser
Closed

JabRef/jabref#15565
Guru6446 wants to merge 5 commits intoJabRef:mainfrom
Guru6446:feature/url-identifier-parser

Conversation

@Guru6446
Copy link
Copy Markdown

@Guru6446 Guru6446 commented Apr 16, 2026

Description

Closes #15411

Implements feature request - Allow users to add entries using full URLs instead of just plain identifiers.

Currently, when a user tries to add a new entry using a URL like https://doi.org/10.1145/3544548.3580995, JabRef shows an error. This PR adds support for parsing common URL formats automatically.

Changes

New Files

  • jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java - URL parser utility
  • jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java - 16 unit tests

Modified Files

  • jablib/src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java - Use new parser
  • jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java - Use new parser

Supported URL Formats

DOI

  • https://doi.org/10.1145/3544548.3580995
  • https://dx.doi.org/10.1145/3544548.3580995
  • http://doi.org/10.1145/3544548.3580995
  • https://dl.acm.org/doi/10.1145/3544548.3580995
  • https://dl.acm.org/doi/abs/10.1145/3544548.3580995

arXiv

  • https://arxiv.org/abs/2203.02155
  • https://arxiv.org/pdf/2203.02155.pdf
  • http://arxiv.org/abs/2203.02155
  • Old format: https://arxiv.org/abs/math.GT/0309136

Backward Compatibility

✅ All existing functionality preserved:

  • Plain DOIs: 10.1145/3544548.3580995 still works
  • Plain arXiv IDs: 2203.02155 still works

Testing

  • ✅ 16 new unit tests created and all passing
  • ✅ Code compiles successfully (./gradlew jablib:compileJava)
  • ✅ Tests verified with ./gradlew jablib:test --tests UrlIdentifierParserTest

Implementation Details

The UrlIdentifierParser uses regex patterns to:

  1. Detect if input is a URL or plain identifier
  2. Extract the actual identifier from URLs
  3. Pass extracted identifier to existing DOI.parse() or ArXivIdentifier.parse()
  4. Fall back to plain parsing if no URL pattern matches

This minimizes changes to existing code while adding new functionality.

Fixes #15411

- Create UrlIdentifierParser utility class
- Extract DOI from various URL formats (doi.org, dx.doi.org, dl.acm.org)
- Extract arXiv ID from URLs (arxiv.org/abs/, arxiv.org/pdf/)
- Add 16 comprehensive unit tests (all passing)
- Maintains backward compatibility with plain IDs

Supports:
- DOI URLs: https://doi.org/10.1145/..., https://dx.doi.org/..., https://dl.acm.org/doi/...
- arXiv URLs: https://arxiv.org/abs/..., https://arxiv.org/pdf/....pdf
- Plain IDs: 10.1145/... (DOI), 2203.02155 (arXiv)

Fixes JabRef#15411
- Use UrlIdentifierParser.parseDOI() instead of DOI.parse()
- Now supports DOI URLs (doi.org, dx.doi.org, dl.acm.org)
- Maintains backward compatibility with plain DOIs

Part of JabRef#15411
- Use UrlIdentifierParser.parseArXiv() instead of ArXivIdentifier.parse()
- Now supports arXiv URLs (arxiv.org/abs/, arxiv.org/pdf/)
- Maintains backward compatibility with plain arXiv IDs

Part of JabRef#15411
@qodo-free-for-open-source-projects
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Add URL identifier parsing for DOI and arXiv fetchers

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add UrlIdentifierParser utility to extract identifiers from URLs
• Support DOI URLs (doi.org, dx.doi.org, dl.acm.org formats)
• Support arXiv URLs (arxiv.org/abs/, arxiv.org/pdf/ formats)
• Update DoiFetcher and ArXivFetcher to use new parser
• Maintain backward compatibility with plain identifiers
Diagram
flowchart LR
  Input["User Input<br/>URL or Plain ID"]
  Parser["UrlIdentifierParser"]
  DOIParser["parseDOI()"]
  ArXivParser["parseArXiv()"]
  DOIFetcher["DoiFetcher"]
  ArXivFetcher["ArXivFetcher"]
  
  Input --> Parser
  Parser --> DOIParser
  Parser --> ArXivParser
  DOIParser --> DOIFetcher
  ArXivParser --> ArXivFetcher
Loading

Grey Divider

File Changes

1. jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java ✨ Enhancement +59/-0

New URL identifier parser utility class

• New utility class for parsing identifiers from URLs and plain text
• Implements parseDOI() method with regex patterns for doi.org, dx.doi.org, and dl.acm.org URLs
• Implements parseArXiv() method with regex pattern for arxiv.org URLs
• Falls back to plain identifier parsing if no URL pattern matches

jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java


2. jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java 🧪 Tests +103/-0

Unit tests for URL identifier parser

• 16 comprehensive unit tests for UrlIdentifierParser
• Tests cover DOI parsing from plain IDs and various URL formats
• Tests cover arXiv parsing from plain IDs and various URL formats
• Tests verify null/empty input handling and invalid URL rejection

jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java


3. jablib/src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java ✨ Enhancement +3/-2

Update DoiFetcher to use URL parser

• Add import for UrlIdentifierParser
• Replace DOI.parse() with UrlIdentifierParser.parseDOI() in doAPILimiting() method
• Replace DOI.parse() with UrlIdentifierParser.parseDOI() in performSearchById() method
• Enables DOI fetcher to accept full DOI URLs in addition to plain DOIs

jablib/src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java


View more (1)
4. jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java ✨ Enhancement +2/-1

Update ArXivFetcher to use URL parser

• Add import for UrlIdentifierParser
• Replace ArXivIdentifier.parse() with UrlIdentifierParser.parseArXiv() in performSearchById()
 method
• Enables arXiv fetcher to accept full arXiv URLs in addition to plain IDs

jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown
Contributor

qodo-free-for-open-source-projects Bot commented Apr 16, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (3) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Tests only assert presence 📘 Rule violation ≡ Correctness
Description
UrlIdentifierParserTest uses
assertTrue(optional.isPresent())/assertFalse(optional.isPresent()) instead of asserting the
exact parsed DOI/arXiv value. This weakens test precision and can allow incorrect-but-present
parsing results to pass.
Code

jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java[R13-102]

+    @Test
+    void parseDOIFromPlainDOI() {
+        String input = "10.1145/3544548.3580995";
+        assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
+    }
+
+    @Test
+    void parseDOIFromDoiOrgURL() {
+        String input = "https://doi.org/10.1145/3544548.3580995";
+        assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
+    }
+
+    @Test
+    void parseDOIFromDxDoiOrgURL() {
+        String input = "https://dx.doi.org/10.1145/3544548.3580995";
+        assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
+    }
+
+    @Test
+    void parseDOIFromHTTPURL() {
+        String input = "http://doi.org/10.1145/3544548.3580995";
+        assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
+    }
+
+    @Test
+    void parseDOIFromACMDigitalLibrary() {
+        String input = "https://dl.acm.org/doi/10.1145/3544548.3580995";
+        assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
+    }
+
+    @Test
+    void parseDOIFromACMAbsURL() {
+        String input = "https://dl.acm.org/doi/abs/10.1145/3544548.3580995";
+        assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
+    }
+
+    @Test
+    void parseDOIReturnsEmptyForNull() {
+        assertFalse(UrlIdentifierParser.parseDOI(null).isPresent());
+    }
+
+    @Test
+    void parseDOIReturnsEmptyForEmptyString() {
+        assertFalse(UrlIdentifierParser.parseDOI("").isPresent());
+    }
+
+    @Test
+    void parseDOIReturnsEmptyForInvalidURL() {
+        assertFalse(UrlIdentifierParser.parseDOI("https://example.com").isPresent());
+    }
+
+    @Test
+    void parseArXivFromPlainID() {
+        String input = "2203.02155";
+        assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
+    }
+
+    @Test
+    void parseArXivFromAbsURL() {
+        String input = "https://arxiv.org/abs/2203.02155";
+        assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
+    }
+
+    @Test
+    void parseArXivFromPDFURL() {
+        String input = "https://arxiv.org/pdf/2203.02155.pdf";
+        assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
+    }
+
+    @Test
+    void parseArXivFromHTTPURL() {
+        String input = "http://arxiv.org/abs/2203.02155";
+        assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
+    }
+
+    @Test
+    void parseArXivReturnsEmptyForNull() {
+        assertFalse(UrlIdentifierParser.parseArXiv(null).isPresent());
+    }
+
+    @Test
+    void parseArXivReturnsEmptyForInvalidURL() {
+        assertFalse(UrlIdentifierParser.parseArXiv("https://example.com").isPresent());
+    }
+
+    @Test
+    void parseArXivHandlesOldIDFormat() {
+        String input = "https://arxiv.org/abs/math.GT/0309136";
+        assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
+    }
Evidence
The compliance checklist requires unit tests to assert exact expected values rather than weak
predicate checks like isPresent(). The added tests repeatedly check only presence/absence (e.g.,
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent())) and never assert the extracted
identifier content.

jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java[13-102]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new tests only assert `Optional.isPresent()` / `isPresent()`-negation, which is a weak predicate and does not verify that the parser extracted the correct DOI/arXiv identifier.
## Issue Context
Per test compliance, assertions should compare against the full expected value/structure (e.g., `assertEquals(expectedOptional, actualOptional)`).
## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java[13-102]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. arXiv PDF URL fails 🐞 Bug ≡ Correctness
Description
ArXivFetcher.performSearchById still calls arXiv.asyncPerformSearchById(identifier) with the raw
input, so https://arxiv.org/pdf/.pdf fails because ArXivIdentifier.parse rejects the trailing
.pdf. The new UrlIdentifierParser.parseArXiv result is only used for the DOI-infusion
optimization, not for the actual arXiv lookup, so the feature doesn’t work end-to-end for PDF URLs.
Code

jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java[R340-344]

  public Optional<BibEntry> performSearchById(String identifier) throws FetcherException {
      CompletableFuture<Optional<BibEntry>> arXivBibEntryPromise = arXiv.asyncPerformSearchById(identifier);
      if (this.doiFetcher != null) {
-            inplaceAsyncInfuseArXivWithDoi(arXivBibEntryPromise, ArXivIdentifier.parse(identifier));
+            inplaceAsyncInfuseArXivWithDoi(arXivBibEntryPromise, UrlIdentifierParser.parseArXiv(identifier));
      }
Evidence
The main fetch path uses the raw identifier string (not the parsed/normalized arXiv ID). The
underlying arXiv lookup rejects inputs that don’t match ArXivIdentifier.parse (notably PDF URLs with
a .pdf suffix), returning Optional.empty().

jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java[339-346]
jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java[412-416]
jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java[21-55]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`ArXivFetcher.performSearchById` still passes the raw user input into `arXiv.asyncPerformSearchById(...)`. For inputs like `https://arxiv.org/pdf/2203.02155.pdf`, `ArXivIdentifier.parse(...)` rejects the `.pdf` suffix, so the actual fetch returns empty even though `UrlIdentifierParser.parseArXiv` can normalize this URL.
### Issue Context
`UrlIdentifierParser.parseArXiv(identifier)` is currently only used to accelerate DOI infusion, not to normalize the identifier used in the actual arXiv fetch.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java[339-346]
### Suggested fix
1. Parse once at the start of `performSearchById`:
- `Optional<ArXivIdentifier> parsed = UrlIdentifierParser.parseArXiv(identifier);`
2. If `parsed.isEmpty()`, return `Optional.empty()` (or keep existing behavior, but avoid calling the API with a non-parseable URL).
3. Call `arXiv.asyncPerformSearchById(parsed.get().asString())` instead of using `identifier`.
4. Pass `parsed` into `inplaceAsyncInfuseArXivWithDoi(...)` so both the fetch and the optimization use the same normalized value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. mEDRA called with URL🐞 Bug ≡ Correctness
Description
DoiFetcher.performSearchById now accepts DOI URLs via UrlIdentifierParser.parseDOI, but the mEDRA
special-case still calls Medra.performSearchById(identifier) with the original (possibly URL) input.
This builds an invalid mEDRA API URL and breaks lookups for mEDRA-registered DOIs when the user
pastes a DOI URL.
Code

jablib/src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java[R124-126]

  public Optional<BibEntry> performSearchById(String identifier) throws FetcherException {
-        DOI doi = DOI.parse(identifier)
+        DOI doi = UrlIdentifierParser.parseDOI(identifier)
                   .orElseThrow(() -> new FetcherException(Localization.lang("Invalid DOI: '%0'.", identifier)));
Evidence
DoiFetcher parses the identifier into a DOI object but still forwards the unparsed identifier to
the mEDRA fetcher. Medra builds its request URL by concatenating API_URL + "/" + identifier, which
fails if identifier is itself a URL.

jablib/src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java[124-146]
jablib/src/main/java/org/jabref/logic/importer/fetcher/Medra.java[103-106]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
When the DOI agency is mEDRA, `DoiFetcher.performSearchById` currently calls `new Medra().performSearchById(identifier)` using the original input string. With the new URL parsing support, `identifier` may be a full URL (e.g., `https://doi.org/...`), which causes `Medra.getUrlForIdentifier` to produce an invalid request URL.
### Issue Context
`DoiFetcher` already computes a parsed `DOI doi = UrlIdentifierParser.parseDOI(identifier)...` in this method.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java[124-146]
- jablib/src/main/java/org/jabref/logic/importer/fetcher/Medra.java[103-106]
### Suggested fix
In the mEDRA branch, call Medra with the normalized DOI string (e.g., `doi.asString()`) rather than the original `identifier`.
Example:
- Replace `return new Medra().performSearchById(identifier);`
- With `return new Medra().performSearchById(doi.asString());`

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
4. ACM DOI regex rejects pdf 🐞 Bug ≡ Correctness
Description
UrlIdentifierParser.parseDOI short-circuits on DOI_ACM_PATTERN and captures everything after
/doi/, so URLs like https://dl.acm.org/doi/pdf/10.... are turned into pdf/10.... and then
rejected by DOI.parse. This is a regression because DOI.parse is already able to extract a DOI
embedded later in an arbitrary https URL.
Code

jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[R19-40]

+    private static final Pattern DOI_ACM_PATTERN =
+            Pattern.compile("https?://dl\\.acm\\.org/doi/(?:abs/)?(.+)");
+
+    private static final Pattern ARXIV_URL_PATTERN =
+            Pattern.compile("https?://arxiv\\.org/(?:abs|pdf)/([\\w.\\-]+?)(?:\\.pdf)?$");
+
+    public static Optional<DOI> parseDOI(String input) {
+        if (input == null || input.isBlank()) {
+            return Optional.empty();
+        }
+
+        String trimmedInput = input.trim();
+
+        Matcher doiUrlMatcher = DOI_URL_PATTERN.matcher(trimmedInput);
+        if (doiUrlMatcher.find()) {
+            return DOI.parse(doiUrlMatcher.group(1));
+        }
+
+        Matcher acmMatcher = DOI_ACM_PATTERN.matcher(trimmedInput);
+        if (acmMatcher.find()) {
+            return DOI.parse(acmMatcher.group(1));
+        }
Evidence
The ACM regex captures arbitrary suffixes (not necessarily starting with 10.) and the method
returns immediately on match, preventing fallback to parsing the full URL. DOI.parse’s exact DOI
pattern explicitly allows an arbitrary https?://... prefix before the 10.x/... DOI group, so
parsing the full ACM URL is expected to work where pdf/10... fails.

jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[19-42]
jablib/src/main/java/org/jabref/model/entry/identifier/DOI.java[81-90]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`UrlIdentifierParser.parseDOI` uses `DOI_ACM_PATTERN = https?://dl.acm.org/doi/(?:abs/)?(.+)` and returns `DOI.parse(acmMatcher.group(1))`. For common ACM URLs such as `/doi/pdf/10.1145/...`, this extracts `pdf/10.1145/...` and causes parsing to fail.
### Issue Context
`DOI.parse(...)` is already designed to handle many URL forms by allowing an arbitrary `https?://...` prefix before the DOI group.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[16-43]
### Suggested fix options
Option A (simplest/robust):
- Remove the special-case ACM/doi.org regexes and just `return DOI.parse(trimmedInput);` (or use `DOI.findInText(trimmedInput)` first if you want to safely ignore query/fragment junk).
Option B (keep special-cases):
- Tighten the ACM pattern to capture only a DOI starting with `10.` and stop at query/fragment:
- e.g., `https?://dl\\.acm\\.org/doi/(?:abs/|pdf/|full/)?(10\\.[^\\s?#]+)`
- Use `matches()` (or anchor with `^...$`) instead of `find()` so you don’t accidentally capture trailing unrelated text.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

5. Null parameters lack @nullable 📘 Rule violation ⚙ Maintainability
Description
parseDOI/parseArXiv accept null inputs via ad-hoc null checks, but the nullness contract is
not expressed with JSpecify annotations. This makes the API contract unclear and encourages passing
null rather than using explicit nullness annotations.
Code

jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[R25-48]

+    public static Optional<DOI> parseDOI(String input) {
+        if (input == null || input.isBlank()) {
+            return Optional.empty();
+        }
+
+        String trimmedInput = input.trim();
+
+        Matcher doiUrlMatcher = DOI_URL_PATTERN.matcher(trimmedInput);
+        if (doiUrlMatcher.find()) {
+            return DOI.parse(doiUrlMatcher.group(1));
+        }
+
+        Matcher acmMatcher = DOI_ACM_PATTERN.matcher(trimmedInput);
+        if (acmMatcher.find()) {
+            return DOI.parse(acmMatcher.group(1));
+        }
+
+        return DOI.parse(trimmedInput);
+    }
+
+    public static Optional<ArXivIdentifier> parseArXiv(String input) {
+        if (input == null || input.isBlank()) {
+            return Optional.empty();
+        }
Evidence
The checklist requires using Optional and JSpecify nullness annotations to clarify null-handling
contracts. The new public methods accept null (if (input == null || input.isBlank())) but do not
annotate the parameter as @Nullable (or otherwise define a non-null contract).

AGENTS.md
jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[25-48]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`UrlIdentifierParser.parseDOI` and `parseArXiv` explicitly handle `null` inputs but do not declare the parameter nullness contract using JSpecify annotations.
## Issue Context
The codebase uses `org.jspecify.annotations.Nullable` in many places; these methods should either (a) declare `@Nullable` for inputs they accept as null or (b) reject null by contract and remove null-passing tests.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[25-58]
- jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java[49-52]
- jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java[88-91]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Trivial UrlIdentifierParser Javadoc 📘 Rule violation ⚙ Maintainability
Description
The added class Javadoc restates what the class does (parsing identifiers) without explaining design
intent or rationale. This adds noise rather than conveying the 'why'.
Code

jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[R10-13]

+/**
+ * Parses identifiers from URLs and plain text.
+ * Extracts DOI, arXiv ID, etc. from various URL formats.
+ */
Evidence
The checklist forbids trivial/paraphrasing comments and requires comments to explain
intent/rationale. The new Javadoc only paraphrases the class name/purpose (parsing identifiers from
URLs/plain text) without adding 'why' context.

AGENTS.md
jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[10-13]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The class-level comment is a paraphrase and does not explain the rationale/intent behind introducing this utility.
## Issue Context
Comments should explain *why* the code exists (e.g., centralizing URL normalization so existing identifier parsers can remain unchanged), or be removed if they add no value.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[10-13]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@github-actions github-actions Bot added good first issue An issue intended for project-newcomers. Varies in difficulty. component: fetcher labels Apr 16, 2026
Comment on lines +13 to +102
@Test
void parseDOIFromPlainDOI() {
String input = "10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromDoiOrgURL() {
String input = "https://doi.org/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromDxDoiOrgURL() {
String input = "https://dx.doi.org/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromHTTPURL() {
String input = "http://doi.org/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromACMDigitalLibrary() {
String input = "https://dl.acm.org/doi/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIFromACMAbsURL() {
String input = "https://dl.acm.org/doi/abs/10.1145/3544548.3580995";
assertTrue(UrlIdentifierParser.parseDOI(input).isPresent());
}

@Test
void parseDOIReturnsEmptyForNull() {
assertFalse(UrlIdentifierParser.parseDOI(null).isPresent());
}

@Test
void parseDOIReturnsEmptyForEmptyString() {
assertFalse(UrlIdentifierParser.parseDOI("").isPresent());
}

@Test
void parseDOIReturnsEmptyForInvalidURL() {
assertFalse(UrlIdentifierParser.parseDOI("https://example.com").isPresent());
}

@Test
void parseArXivFromPlainID() {
String input = "2203.02155";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}

@Test
void parseArXivFromAbsURL() {
String input = "https://arxiv.org/abs/2203.02155";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}

@Test
void parseArXivFromPDFURL() {
String input = "https://arxiv.org/pdf/2203.02155.pdf";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}

@Test
void parseArXivFromHTTPURL() {
String input = "http://arxiv.org/abs/2203.02155";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}

@Test
void parseArXivReturnsEmptyForNull() {
assertFalse(UrlIdentifierParser.parseArXiv(null).isPresent());
}

@Test
void parseArXivReturnsEmptyForInvalidURL() {
assertFalse(UrlIdentifierParser.parseArXiv("https://example.com").isPresent());
}

@Test
void parseArXivHandlesOldIDFormat() {
String input = "https://arxiv.org/abs/math.GT/0309136";
assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Tests only assert presence 📘 Rule violation ≡ Correctness

UrlIdentifierParserTest uses
assertTrue(optional.isPresent())/assertFalse(optional.isPresent()) instead of asserting the
exact parsed DOI/arXiv value. This weakens test precision and can allow incorrect-but-present
parsing results to pass.
Agent Prompt
## Issue description
The new tests only assert `Optional.isPresent()` / `isPresent()`-negation, which is a weak predicate and does not verify that the parser extracted the correct DOI/arXiv identifier.

## Issue Context
Per test compliance, assertions should compare against the full expected value/structure (e.g., `assertEquals(expectedOptional, actualOptional)`).

## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java[13-102]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines 340 to 344
public Optional<BibEntry> performSearchById(String identifier) throws FetcherException {
CompletableFuture<Optional<BibEntry>> arXivBibEntryPromise = arXiv.asyncPerformSearchById(identifier);
if (this.doiFetcher != null) {
inplaceAsyncInfuseArXivWithDoi(arXivBibEntryPromise, ArXivIdentifier.parse(identifier));
inplaceAsyncInfuseArXivWithDoi(arXivBibEntryPromise, UrlIdentifierParser.parseArXiv(identifier));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Arxiv pdf url fails 🐞 Bug ≡ Correctness

ArXivFetcher.performSearchById still calls arXiv.asyncPerformSearchById(identifier) with the raw
input, so https://arxiv.org/pdf/<id>.pdf fails because ArXivIdentifier.parse rejects the trailing
.pdf.
The new UrlIdentifierParser.parseArXiv result is only used for the DOI-infusion optimization, not
for the actual arXiv lookup, so the feature doesn’t work end-to-end for PDF URLs.
Agent Prompt
### Issue description
`ArXivFetcher.performSearchById` still passes the raw user input into `arXiv.asyncPerformSearchById(...)`. For inputs like `https://arxiv.org/pdf/2203.02155.pdf`, `ArXivIdentifier.parse(...)` rejects the `.pdf` suffix, so the actual fetch returns empty even though `UrlIdentifierParser.parseArXiv` can normalize this URL.

### Issue Context
`UrlIdentifierParser.parseArXiv(identifier)` is currently only used to accelerate DOI infusion, not to normalize the identifier used in the actual arXiv fetch.

### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java[339-346]

### Suggested fix
1. Parse once at the start of `performSearchById`:
   - `Optional<ArXivIdentifier> parsed = UrlIdentifierParser.parseArXiv(identifier);`
2. If `parsed.isEmpty()`, return `Optional.empty()` (or keep existing behavior, but avoid calling the API with a non-parseable URL).
3. Call `arXiv.asyncPerformSearchById(parsed.get().asString())` instead of using `identifier`.
4. Pass `parsed` into `inplaceAsyncInfuseArXivWithDoi(...)` so both the fetch and the optimization use the same normalized value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +19 to +40
private static final Pattern DOI_ACM_PATTERN =
Pattern.compile("https?://dl\\.acm\\.org/doi/(?:abs/)?(.+)");

private static final Pattern ARXIV_URL_PATTERN =
Pattern.compile("https?://arxiv\\.org/(?:abs|pdf)/([\\w.\\-]+?)(?:\\.pdf)?$");

public static Optional<DOI> parseDOI(String input) {
if (input == null || input.isBlank()) {
return Optional.empty();
}

String trimmedInput = input.trim();

Matcher doiUrlMatcher = DOI_URL_PATTERN.matcher(trimmedInput);
if (doiUrlMatcher.find()) {
return DOI.parse(doiUrlMatcher.group(1));
}

Matcher acmMatcher = DOI_ACM_PATTERN.matcher(trimmedInput);
if (acmMatcher.find()) {
return DOI.parse(acmMatcher.group(1));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

4. Acm doi regex rejects pdf 🐞 Bug ≡ Correctness

UrlIdentifierParser.parseDOI short-circuits on DOI_ACM_PATTERN and captures everything after
/doi/, so URLs like https://dl.acm.org/doi/pdf/10.... are turned into pdf/10.... and then
rejected by DOI.parse.
This is a regression because DOI.parse is already able to extract a DOI embedded later in an
arbitrary https URL.
Agent Prompt
### Issue description
`UrlIdentifierParser.parseDOI` uses `DOI_ACM_PATTERN = https?://dl.acm.org/doi/(?:abs/)?(.+)` and returns `DOI.parse(acmMatcher.group(1))`. For common ACM URLs such as `/doi/pdf/10.1145/...`, this extracts `pdf/10.1145/...` and causes parsing to fail.

### Issue Context
`DOI.parse(...)` is already designed to handle many URL forms by allowing an arbitrary `https?://...` prefix before the DOI group.

### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[16-43]

### Suggested fix options
Option A (simplest/robust):
- Remove the special-case ACM/doi.org regexes and just `return DOI.parse(trimmedInput);` (or use `DOI.findInText(trimmedInput)` first if you want to safely ignore query/fragment junk).

Option B (keep special-cases):
- Tighten the ACM pattern to capture only a DOI starting with `10.` and stop at query/fragment:
  - e.g., `https?://dl\\.acm\\.org/doi/(?:abs/|pdf/|full/)?(10\\.[^\\s?#]+)`
- Use `matches()` (or anchor with `^...$`) instead of `find()` so you don’t accidentally capture trailing unrelated text.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@github-actions github-actions Bot added the status: changes-required Pull requests that are not yet complete label Apr 16, 2026
- Add value assertions to tests (verify actual extracted DOI values)
- Use extracted DOI in mEDRA call (fixes mEDRA lookups with URLs)
- Properly extract arXiv ID before passing to fetcher

Addresses review comments on PR
@jabref-machine
Copy link
Copy Markdown
Collaborator

You introduced non-Markdown JavaDoc. Please use Markdown JavaDoc (///).

@jabref-machine
Copy link
Copy Markdown
Collaborator

Your code currently does not meet JabRef's code guidelines. IntelliJ auto format covers some cases. There seem to be issues with your code style and autoformat configuration. Please reformat your code (Ctrl+Alt+L) and commit, then push.

@faneeshh
Copy link
Copy Markdown
Contributor

You need to disclose the use of AI in the PR. Contributing Guide

@jabref-machine
Copy link
Copy Markdown
Collaborator

You have removed the section "Checklist" from your pull request description. Please adhere to our pull request template.

@calixtus
Copy link
Copy Markdown
Member

calixtus commented Apr 17, 2026

Hello @Guru6446 welcome to JabRef community and thank you for your interest.
Please use a proper title for your Pull Request.

I noticed that you made a PR for an issue, for which already another PR exists. If we decide to finish and merge the other PR (which is not unlikely, since we already put some review work in it), all your work would be in vain. This would be very sad, since we have many other issues, that still needs someone to take a look on.

In the future, please make sure first, that there is no other PR already open for your PR. Our assignment system for github has its limits and it does not guarantee that something is overlooked. You are still responsible.

Please understand that until the other PR mentioned is merged or closed, there wont be any work put in this PR from our side about reviewing your PR, to save our time.

@calixtus
Copy link
Copy Markdown
Member

Please also fix your PR description

@pluto-han
Copy link
Copy Markdown
Collaborator

pluto-han commented Apr 17, 2026

Please do not use AI to generate PR discription, jabRef has its own PR discription format

You have only done the backend part, please also implement the ui, and before then please mark this PR as a draft.

Edit: After a quick look, your code is wrong.

  1. Why change the fetcher of ArXiv and Doi?
  2. In UrlIdentifierParser.java‎, you should fetch URL, not Doi or ArXiv.

@subhramit
Copy link
Copy Markdown
Member

Contributor not responsive, and PR description format is completely changed.
A guess might be this contribution was done by a bot, and not checked on later.
Closing.

@subhramit subhramit closed this Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: fetcher good first issue An issue intended for project-newcomers. Varies in difficulty. status: changes-required Pull requests that are not yet complete

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add entry using URL

6 participants