JabRef/jabref#15565
Conversation
- Create UrlIdentifierParser utility class - Extract DOI from various URL formats (doi.org, dx.doi.org, dl.acm.org) - Extract arXiv ID from URLs (arxiv.org/abs/, arxiv.org/pdf/) - Add 16 comprehensive unit tests (all passing) - Maintains backward compatibility with plain IDs Supports: - DOI URLs: https://doi.org/10.1145/..., https://dx.doi.org/..., https://dl.acm.org/doi/... - arXiv URLs: https://arxiv.org/abs/..., https://arxiv.org/pdf/....pdf - Plain IDs: 10.1145/... (DOI), 2203.02155 (arXiv) Fixes JabRef#15411
- Use UrlIdentifierParser.parseDOI() instead of DOI.parse() - Now supports DOI URLs (doi.org, dx.doi.org, dl.acm.org) - Maintains backward compatibility with plain DOIs Part of JabRef#15411
- Use UrlIdentifierParser.parseArXiv() instead of ArXivIdentifier.parse() - Now supports arXiv URLs (arxiv.org/abs/, arxiv.org/pdf/) - Maintains backward compatibility with plain arXiv IDs Part of JabRef#15411
Review Summary by QodoAdd URL identifier parsing for DOI and arXiv fetchers
WalkthroughsDescription• Add UrlIdentifierParser utility to extract identifiers from URLs • Support DOI URLs (doi.org, dx.doi.org, dl.acm.org formats) • Support arXiv URLs (arxiv.org/abs/, arxiv.org/pdf/ formats) • Update DoiFetcher and ArXivFetcher to use new parser • Maintain backward compatibility with plain identifiers Diagramflowchart LR
Input["User Input<br/>URL or Plain ID"]
Parser["UrlIdentifierParser"]
DOIParser["parseDOI()"]
ArXivParser["parseArXiv()"]
DOIFetcher["DoiFetcher"]
ArXivFetcher["ArXivFetcher"]
Input --> Parser
Parser --> DOIParser
Parser --> ArXivParser
DOIParser --> DOIFetcher
ArXivParser --> ArXivFetcher
File Changes1. jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java
|
Code Review by Qodo
1. Tests only assert presence
|
| @Test | ||
| void parseDOIFromPlainDOI() { | ||
| String input = "10.1145/3544548.3580995"; | ||
| assertTrue(UrlIdentifierParser.parseDOI(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseDOIFromDoiOrgURL() { | ||
| String input = "https://doi.org/10.1145/3544548.3580995"; | ||
| assertTrue(UrlIdentifierParser.parseDOI(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseDOIFromDxDoiOrgURL() { | ||
| String input = "https://dx.doi.org/10.1145/3544548.3580995"; | ||
| assertTrue(UrlIdentifierParser.parseDOI(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseDOIFromHTTPURL() { | ||
| String input = "http://doi.org/10.1145/3544548.3580995"; | ||
| assertTrue(UrlIdentifierParser.parseDOI(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseDOIFromACMDigitalLibrary() { | ||
| String input = "https://dl.acm.org/doi/10.1145/3544548.3580995"; | ||
| assertTrue(UrlIdentifierParser.parseDOI(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseDOIFromACMAbsURL() { | ||
| String input = "https://dl.acm.org/doi/abs/10.1145/3544548.3580995"; | ||
| assertTrue(UrlIdentifierParser.parseDOI(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseDOIReturnsEmptyForNull() { | ||
| assertFalse(UrlIdentifierParser.parseDOI(null).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseDOIReturnsEmptyForEmptyString() { | ||
| assertFalse(UrlIdentifierParser.parseDOI("").isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseDOIReturnsEmptyForInvalidURL() { | ||
| assertFalse(UrlIdentifierParser.parseDOI("https://example.com").isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseArXivFromPlainID() { | ||
| String input = "2203.02155"; | ||
| assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseArXivFromAbsURL() { | ||
| String input = "https://arxiv.org/abs/2203.02155"; | ||
| assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseArXivFromPDFURL() { | ||
| String input = "https://arxiv.org/pdf/2203.02155.pdf"; | ||
| assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseArXivFromHTTPURL() { | ||
| String input = "http://arxiv.org/abs/2203.02155"; | ||
| assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseArXivReturnsEmptyForNull() { | ||
| assertFalse(UrlIdentifierParser.parseArXiv(null).isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseArXivReturnsEmptyForInvalidURL() { | ||
| assertFalse(UrlIdentifierParser.parseArXiv("https://example.com").isPresent()); | ||
| } | ||
|
|
||
| @Test | ||
| void parseArXivHandlesOldIDFormat() { | ||
| String input = "https://arxiv.org/abs/math.GT/0309136"; | ||
| assertTrue(UrlIdentifierParser.parseArXiv(input).isPresent()); | ||
| } |
There was a problem hiding this comment.
1. Tests only assert presence 📘 Rule violation ≡ Correctness
UrlIdentifierParserTest uses assertTrue(optional.isPresent())/assertFalse(optional.isPresent()) instead of asserting the exact parsed DOI/arXiv value. This weakens test precision and can allow incorrect-but-present parsing results to pass.
Agent Prompt
## Issue description
The new tests only assert `Optional.isPresent()` / `isPresent()`-negation, which is a weak predicate and does not verify that the parser extracted the correct DOI/arXiv identifier.
## Issue Context
Per test compliance, assertions should compare against the full expected value/structure (e.g., `assertEquals(expectedOptional, actualOptional)`).
## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java[13-102]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| public Optional<BibEntry> performSearchById(String identifier) throws FetcherException { | ||
| CompletableFuture<Optional<BibEntry>> arXivBibEntryPromise = arXiv.asyncPerformSearchById(identifier); | ||
| if (this.doiFetcher != null) { | ||
| inplaceAsyncInfuseArXivWithDoi(arXivBibEntryPromise, ArXivIdentifier.parse(identifier)); | ||
| inplaceAsyncInfuseArXivWithDoi(arXivBibEntryPromise, UrlIdentifierParser.parseArXiv(identifier)); | ||
| } |
There was a problem hiding this comment.
2. Arxiv pdf url fails 🐞 Bug ≡ Correctness
ArXivFetcher.performSearchById still calls arXiv.asyncPerformSearchById(identifier) with the raw input, so https://arxiv.org/pdf/<id>.pdf fails because ArXivIdentifier.parse rejects the trailing .pdf. The new UrlIdentifierParser.parseArXiv result is only used for the DOI-infusion optimization, not for the actual arXiv lookup, so the feature doesn’t work end-to-end for PDF URLs.
Agent Prompt
### Issue description
`ArXivFetcher.performSearchById` still passes the raw user input into `arXiv.asyncPerformSearchById(...)`. For inputs like `https://arxiv.org/pdf/2203.02155.pdf`, `ArXivIdentifier.parse(...)` rejects the `.pdf` suffix, so the actual fetch returns empty even though `UrlIdentifierParser.parseArXiv` can normalize this URL.
### Issue Context
`UrlIdentifierParser.parseArXiv(identifier)` is currently only used to accelerate DOI infusion, not to normalize the identifier used in the actual arXiv fetch.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java[339-346]
### Suggested fix
1. Parse once at the start of `performSearchById`:
- `Optional<ArXivIdentifier> parsed = UrlIdentifierParser.parseArXiv(identifier);`
2. If `parsed.isEmpty()`, return `Optional.empty()` (or keep existing behavior, but avoid calling the API with a non-parseable URL).
3. Call `arXiv.asyncPerformSearchById(parsed.get().asString())` instead of using `identifier`.
4. Pass `parsed` into `inplaceAsyncInfuseArXivWithDoi(...)` so both the fetch and the optimization use the same normalized value.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| private static final Pattern DOI_ACM_PATTERN = | ||
| Pattern.compile("https?://dl\\.acm\\.org/doi/(?:abs/)?(.+)"); | ||
|
|
||
| private static final Pattern ARXIV_URL_PATTERN = | ||
| Pattern.compile("https?://arxiv\\.org/(?:abs|pdf)/([\\w.\\-]+?)(?:\\.pdf)?$"); | ||
|
|
||
| public static Optional<DOI> parseDOI(String input) { | ||
| if (input == null || input.isBlank()) { | ||
| return Optional.empty(); | ||
| } | ||
|
|
||
| String trimmedInput = input.trim(); | ||
|
|
||
| Matcher doiUrlMatcher = DOI_URL_PATTERN.matcher(trimmedInput); | ||
| if (doiUrlMatcher.find()) { | ||
| return DOI.parse(doiUrlMatcher.group(1)); | ||
| } | ||
|
|
||
| Matcher acmMatcher = DOI_ACM_PATTERN.matcher(trimmedInput); | ||
| if (acmMatcher.find()) { | ||
| return DOI.parse(acmMatcher.group(1)); | ||
| } |
There was a problem hiding this comment.
4. Acm doi regex rejects pdf 🐞 Bug ≡ Correctness
UrlIdentifierParser.parseDOI short-circuits on DOI_ACM_PATTERN and captures everything after /doi/, so URLs like https://dl.acm.org/doi/pdf/10.... are turned into pdf/10.... and then rejected by DOI.parse. This is a regression because DOI.parse is already able to extract a DOI embedded later in an arbitrary https URL.
Agent Prompt
### Issue description
`UrlIdentifierParser.parseDOI` uses `DOI_ACM_PATTERN = https?://dl.acm.org/doi/(?:abs/)?(.+)` and returns `DOI.parse(acmMatcher.group(1))`. For common ACM URLs such as `/doi/pdf/10.1145/...`, this extracts `pdf/10.1145/...` and causes parsing to fail.
### Issue Context
`DOI.parse(...)` is already designed to handle many URL forms by allowing an arbitrary `https?://...` prefix before the DOI group.
### Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java[16-43]
### Suggested fix options
Option A (simplest/robust):
- Remove the special-case ACM/doi.org regexes and just `return DOI.parse(trimmedInput);` (or use `DOI.findInText(trimmedInput)` first if you want to safely ignore query/fragment junk).
Option B (keep special-cases):
- Tighten the ACM pattern to capture only a DOI starting with `10.` and stop at query/fragment:
- e.g., `https?://dl\\.acm\\.org/doi/(?:abs/|pdf/|full/)?(10\\.[^\\s?#]+)`
- Use `matches()` (or anchor with `^...$`) instead of `find()` so you don’t accidentally capture trailing unrelated text.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
- Add value assertions to tests (verify actual extracted DOI values) - Use extracted DOI in mEDRA call (fixes mEDRA lookups with URLs) - Properly extract arXiv ID before passing to fetcher Addresses review comments on PR
|
You introduced non-Markdown JavaDoc. Please use Markdown JavaDoc ( |
|
Your code currently does not meet JabRef's code guidelines. IntelliJ auto format covers some cases. There seem to be issues with your code style and autoformat configuration. Please reformat your code (Ctrl+Alt+L) and commit, then push. |
|
You need to disclose the use of AI in the PR. Contributing Guide |
|
You have removed the section "Checklist" from your pull request description. Please adhere to our pull request template. |
|
Hello @Guru6446 welcome to JabRef community and thank you for your interest. I noticed that you made a PR for an issue, for which already another PR exists. If we decide to finish and merge the other PR (which is not unlikely, since we already put some review work in it), all your work would be in vain. This would be very sad, since we have many other issues, that still needs someone to take a look on. In the future, please make sure first, that there is no other PR already open for your PR. Our assignment system for github has its limits and it does not guarantee that something is overlooked. You are still responsible. Please understand that until the other PR mentioned is merged or closed, there wont be any work put in this PR from our side about reviewing your PR, to save our time. |
|
Please also fix your PR description |
|
Please do not use AI to generate PR discription, jabRef has its own PR discription format You have only done the backend part, please also implement the ui, and before then please mark this PR as a draft. Edit: After a quick look, your code is wrong.
|
|
Contributor not responsive, and PR description format is completely changed. |
Description
Closes #15411
Implements feature request - Allow users to add entries using full URLs instead of just plain identifiers.
Currently, when a user tries to add a new entry using a URL like
https://doi.org/10.1145/3544548.3580995, JabRef shows an error. This PR adds support for parsing common URL formats automatically.Changes
New Files
jablib/src/main/java/org/jabref/logic/importer/util/UrlIdentifierParser.java- URL parser utilityjablib/src/test/java/org/jabref/logic/importer/util/UrlIdentifierParserTest.java- 16 unit testsModified Files
jablib/src/main/java/org/jabref/logic/importer/fetcher/DoiFetcher.java- Use new parserjablib/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java- Use new parserSupported URL Formats
DOI
https://doi.org/10.1145/3544548.3580995https://dx.doi.org/10.1145/3544548.3580995http://doi.org/10.1145/3544548.3580995https://dl.acm.org/doi/10.1145/3544548.3580995https://dl.acm.org/doi/abs/10.1145/3544548.3580995arXiv
https://arxiv.org/abs/2203.02155https://arxiv.org/pdf/2203.02155.pdfhttp://arxiv.org/abs/2203.02155https://arxiv.org/abs/math.GT/0309136Backward Compatibility
✅ All existing functionality preserved:
10.1145/3544548.3580995still works2203.02155still worksTesting
./gradlew jablib:compileJava)./gradlew jablib:test --tests UrlIdentifierParserTestImplementation Details
The
UrlIdentifierParseruses regex patterns to:DOI.parse()orArXivIdentifier.parse()This minimizes changes to existing code while adding new functionality.
Fixes #15411