Add RefChecker logic for reference validation#15478
Add RefChecker logic for reference validation#15478NishantDG-SST wants to merge 5 commits intoJabRef:mainfrom
Conversation
Review Summary by QodoAdd RefChecker logic for reference validation
WalkthroughsDescription• Add RefChecker logic for validating bibliographic entries against online sources • Implement entry comparison using DOI, CrossRef, and arXiv fetchers • Add compareEntries method to DuplicateCheck for similarity scoring • Introduce RefValidity enum with REAL, UNSURE, and FAKE classifications Diagramflowchart LR
Entry["BibEntry to validate"]
DOI["DOI Lookup"]
CrossRef["CrossRef Discovery"]
ArXiv["ArXiv Lookup"]
Compare["compareEntries Similarity"]
Result["RefCheckResult with validity"]
Entry --> DOI
Entry --> CrossRef
Entry --> ArXiv
DOI --> Compare
CrossRef --> Compare
ArXiv --> Compare
Compare --> Result
File Changes1. jablib/src/main/java/org/jabref/logic/database/DuplicateCheck.java
|
Code Review by Qodo
1.
|
| @Test | ||
| void entriesWithIdenticalTitles() { | ||
| BibEntry one = new BibEntry().withField(StandardField.TITLE, "Reinforcement learning: An introduction"); | ||
| BibEntry two = new BibEntry().withField(StandardField.TITLE, "Reinforcement learning: An introduction"); | ||
|
|
||
| double score = DuplicateCheck.compareEntries(one, two); | ||
|
|
||
| assertTrue(score >= DuplicateCheck.COMPARE_ENTRIES_THRESHOLD); | ||
| } | ||
|
|
||
| @Test | ||
| void entriesWithCompletelyDifferentFields() { | ||
| BibEntry one = new BibEntry() | ||
| .withField(StandardField.TITLE, "Performance on a Signal") | ||
| .withField(StandardField.AUTHOR, "Richard Atkinson"); | ||
| BibEntry two = new BibEntry() | ||
| .withField(StandardField.TITLE, "Rest in Treatment") | ||
| .withField(StandardField.AUTHOR, "Elizabeth Ballard"); | ||
|
|
||
| double score = DuplicateCheck.compareEntries(one, two); | ||
|
|
||
| assertTrue(score < DuplicateCheck.COMPARE_ENTRIES_THRESHOLD); | ||
| } |
There was a problem hiding this comment.
3. Weak threshold asserts in tests 📘 Rule violation ☼ Reliability
New tests use predicate assertions (assertTrue(score >= threshold) / < threshold) instead of asserting exact expected values, weakening regression detection. This violates the unit test requirement to assert exact values/outputs where possible.
Agent Prompt
## Issue description
The added `DuplicateCheck.compareEntries` tests use threshold-based `assertTrue` predicates, which are considered weak checks.
## Issue Context
Update the tests to assert exact expected values (or exact expected structures) to strengthen regression detection.
## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/database/DuplicateCheckTest.java[665-687]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| @Test | ||
| void entryWithCorrectDoiButWrongMetadataIsNotClassifiedAsReal() { | ||
| BibEntry entry = new BibEntry(StandardEntryType.Article) | ||
| .withField(StandardField.TITLE, "Not a Real Paper") | ||
| .withField(StandardField.AUTHOR, "Random Author") | ||
| .withField(StandardField.YEAR, "2099") | ||
| .withField(StandardField.DOI, "10.48550/arXiv.1706.03762"); | ||
|
|
||
| RefCheckResult result = refChecker.check(entry); | ||
|
|
||
| assertNotEquals(RefValidity.REAL, result.validity()); | ||
| } |
There was a problem hiding this comment.
4. assertnotequals weakens refchecker test 📘 Rule violation ☼ Reliability
The test only asserts the result is "not REAL" via assertNotEquals, which is a weak predicate and can pass for multiple unintended outcomes. The test should assert the exact expected RefValidity (or a complete expected result shape) to meet unit test strength requirements.
Agent Prompt
## Issue description
`RefCheckerTest.entryWithCorrectDoiButWrongMetadataIsNotClassifiedAsReal` uses `assertNotEquals(REAL, ...)`, which is a weak predicate check.
## Issue Context
Change the assertion to an exact expected validity (or assert the full expected `RefCheckResult` properties) so the test fails on near-miss behavior changes.
## Fix Focus Areas
- jablib/src/test/java/org/jabref/logic/refcheck/RefCheckerTest.java[47-58]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Pull request overview
This PR introduces initial “RefChecker” logic in jablib to validate bibliographic entries by resolving them via DOI/CrossRef/arXiv and classifying them based on similarity to fetched authoritative metadata.
Changes:
- Added new refcheck domain types (
RefChecker,RefCheckResult,RefValidity) and online-validation flow. - Added
DuplicateCheck.compareEntries(...)plus a shared threshold constant to support similarity-based validation. - Added initial integration-style tests for
RefCheckerand extendedDuplicateCheckTest; updatedCHANGELOG.md.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
jablib/src/main/java/org/jabref/logic/refcheck/RefChecker.java |
Implements the online lookup + classification flow (DOI → CrossRef → arXiv) and picks the best result. |
jablib/src/main/java/org/jabref/logic/refcheck/RefCheckResult.java |
Adds a result record carrying validity, optional matched entry, and similarity score. |
jablib/src/main/java/org/jabref/logic/refcheck/RefValidity.java |
Defines the classification enum (REAL/UNSURE/FAKE). |
jablib/src/main/java/org/jabref/logic/database/DuplicateCheck.java |
Adds similarity scoring (compareEntries) and a threshold constant used by refcheck. |
jablib/src/test/java/org/jabref/logic/refcheck/RefCheckerTest.java |
Adds initial fetcher-backed tests covering “real”, “not real”, and “nonexistent” cases. |
jablib/src/test/java/org/jabref/logic/database/DuplicateCheckTest.java |
Adds unit tests for compareEntries behavior (self-compare, internal field ignore, etc.). |
CHANGELOG.md |
Documents the addition of RefChecker logic. |
| @Test | ||
| void realPaperWithCorrectDoiIsClassifiedAsReal() { | ||
| BibEntry entry = new BibEntry(StandardEntryType.Article) | ||
| .withField(StandardField.TITLE, "Attention Is All You Need") | ||
| .withField(StandardField.AUTHOR, "Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others") | ||
| .withField(StandardField.YEAR, "2017") | ||
| .withField(StandardField.DOI, "10.48550/arXiv.1706.03762"); | ||
|
|
||
| RefCheckResult result = refChecker.check(entry); | ||
|
|
||
| assertEquals(RefValidity.REAL, result.validity()); | ||
| } | ||
|
|
||
| @Test | ||
| void entryWithCorrectDoiButWrongMetadataIsNotClassifiedAsReal() { | ||
| BibEntry entry = new BibEntry(StandardEntryType.Article) | ||
| .withField(StandardField.TITLE, "Not a Real Paper") | ||
| .withField(StandardField.AUTHOR, "Random Author") | ||
| .withField(StandardField.YEAR, "2099") | ||
| .withField(StandardField.DOI, "10.48550/arXiv.1706.03762"); | ||
|
|
||
| RefCheckResult result = refChecker.check(entry); | ||
|
|
||
| assertNotEquals(RefValidity.REAL, result.validity()); | ||
| } | ||
|
|
||
| @Test | ||
| void entryThatDoesNotExistAnywhereIsClassifiedAsFake() { | ||
| BibEntry entry = new BibEntry(StandardEntryType.Article) | ||
| .withField(StandardField.TITLE, "Nonexistent Paper with no Database") | ||
| .withField(StandardField.AUTHOR, "No Author") | ||
| .withField(StandardField.YEAR, "1800"); | ||
|
|
||
| RefCheckResult result = refChecker.check(entry); | ||
|
|
||
| assertEquals(RefValidity.FAKE, result.validity()); | ||
| } |
There was a problem hiding this comment.
These tests are integration-style and depend on live responses from external services (CrossRef/DOI/arXiv). That makes them prone to flakiness when metadata formatting or search results change (especially the “does not exist anywhere” case, where CrossRef could still return a fuzzy match). Prefer a deterministic unit test by injecting mocked DoiFetcher/ArXivFetcher/CrossRef via the 3-arg RefChecker constructor and asserting on controlled responses.
| String firstValue = one.getField(field).orElse(""); | ||
| String secondValue = two.getField(field).orElse(""); |
There was a problem hiding this comment.
compareEntries uses getField(...) (raw field content) which can penalize harmless formatting differences (LaTeX braces/escaping, whitespace, line breaks) and lead to false FAKE/UNSURE classifications. Since DuplicateCheck already normalizes via getFieldLatexFree(...) in its comparison logic, consider using latex-free/normalized values here as well to keep scoring consistent with the rest of the duplicate-checking implementation.
| String firstValue = one.getField(field).orElse(""); | |
| String secondValue = two.getField(field).orElse(""); | |
| String firstValue = one.getFieldLatexFree(field).orElse(""); | |
| String secondValue = two.getFieldLatexFree(field).orElse(""); |
| return bestOf(doiResult, crossRefResult, arXivResult); | ||
| } |
There was a problem hiding this comment.
If none of the sources yields any candidate (otherEntry == null / score 0.0), bestOf(...) currently returns FAKE (because each lookup returns FAKE on “not found”). That conflates “not found / could not verify” with “verified mismatch” and can mislabel obscure/older but real publications as fake. Consider returning UNSURE when no authoritative candidate was found from any source, and reserving FAKE for the case where a candidate exists but similarity is low.
There was a problem hiding this comment.
If the intended behavior is:
- no candidate found -> FAKE
- fetch failure -> FAKE
- low similarity candidate -> FAKE
then I think it’s consistent as it is now,
probably don’t need to change the “not found / could not verify” to UNSURE.
From the classify(...) comments, it seems like this is already the intended behavior.
@koppor do you think this is fine as-is, or do you see this as being too broad semantically?
|
Hi @NishantDG-SST, I tried this locally and the tests pass on my side. From a quick look, the logic now covers DOI lookup, CrossRef-based DOI discovery, and arXiv-based validation, which is a good first step 👍 A few scope / test questions after reading it:
Also, if more realistic samples are needed later, the RefChecker test suite might be a useful source of inspiration for real-world citation patterns? Happy to hear your thoughts on these. |
|
Hey @wanling0000 thanks for testing
Test for arXiv fallback: If a paper cannot be validated via its DOI or CrossRef the checker attempts to resolve it using its arXiv identifiers
|
|
Hi @NishantDG-SST thanks for the detailed explanation and for adding the additional tests, this helps a lot 👍
The fallback plan makes sense to me.
From reading Just wanted to confirm if this matches the intended behavior, mainly so I can align on testing. Maybe it would help to document this a bit more explicitly (e.g. in docs or a small test matrix), so the expected classification is clearer.
Happy for you to continue with your current approach (no need to block on this), I’ll focus on validation/testing on my side :) |
Yes you are right that is correct and I apologize for the confusion. I have documented this explicitly in the classify() JavaDoc and added |
|
So, I added 3 new test cases |
|
Thanks for adding these, this looks good to me
I’m fine with either keeping both for clarity or removing one if you prefer to avoid duplication. |
|
My next steps are to
Happy to hear your thoughts and will proceed accordingly. |
|
@NishantDG-SST Thank you for adding a new duplicate checker! The old |
|
Sorry, I don't want my delayed response to slow down your development, so if you think you have a reasonably solid solution, just go ahead and implement it.
Regarding your question, here’s what I think: It might be helpful to keep the “per-entry classification” logic and the “input orchestration / batch processing” a bit separate, so their behaviors can be tested more independently. So maybe, instead of adding this directly into RefChecker, it could stay focused as a “single BibEntry checker”, and a small batch/use-case layer could handle processing a whole .bib file? (This is just a thought from my side, happy to follow whatever direction makes more sense here.)
Similarly, it might be cleaner if the grouping / assignment is handled by a separate assembler/applicator layer. But, if the PR gets a bit large, splitting into a follow-up PR could also make review easier. From my side, I’ll mainly focus on validating the behavior and testing as things evolve. So I was thinking: based on the algorithm described by @koppor , maybe we can first align on the expected behavior (via tests), and then implementation can follow more freely. As a rough draft from a testing perspective, I tried to map the current logic into a small test matrix: Layer 1: Per-entry classification DOI path
CrossRef path
arXiv path
Selection logic (bestOf)
Classification semantics
One small observation: since current tests rely on live API responses, some similarity-based cases (like UNSURE) might be hard to test deterministically. |
|
@wanling0000 Yes definitely I'll keep the layers separate. So for the tests, will using Mockito to mock the fetchers in a new |
Covered these Tests in UnitTest till nowDOI path
CrossRef path
arXiv path
Selection logic (bestOf)
Classification semantics
I'll cover the rest after this |
More Coverage
|
|
In D2 same-level -> higher score wins |
That is a good idea |
Just wanted to check: is this PR blocking any of your work? If so maybe some parts could be split out into a smaller PR and merged earlier |
No no, all fine. Was just thinking of getting some more help here since it's a large PR and you're the only one reviewing it. @pluto-han was also waiting for the new duplicate checker, but I don't think it is a blocker since his PR is now merged. |
Yep I will help review this PR. PS: I made some little change to the existing duplicate checker and it now works well now. The new duplicate checker here looks also very good. |
|
So I have done some testing and kept the field weights but this triggers other tests like doi and usure tests are failing (i'll fix them), but the main thing is that its still failing the second author mismatch test so either i have to increase the author weight even more or we can add something like this inside |
Maybe you can try the existing weights in |
|
@pluto-han ok I'll try em out thank you |
|
Ok so to implement this functionality :
We can keep a map of StandardEntryTypes from Just wanted to recheck again if any one of the fields is non-fetchable (even if like 5 other fields are fetchable, assuming total 6 fields provided) it should return UNSURE right? |
I think what @koppor meant in #13604 (comment) is: If the reference is not checkable, for example @misc, references created by url fetcher etc , JabRef returns "unsure". But i am not sure for "REAL-LOW-QUALITY", whether JabRef should return "unsure" or "real". Because in the existing duplicate checker, if two references have the same DOI, JabRef thinks they are duplicate even if other non-core fields are different; but this PR seems to implement different logic. |
|
|
I think we might be mixing two different questions here:
From this comment, my current understanding is:
Because of that, I’m wondering whether we should keep the first step smaller. If we are still not aligned on the exact meaning of From what I see now, there might be two possible ways to move forward: (1) Keep the scope minimal and just wire up the existing infrastructure so it runs end-to-end (e.g. use isDuplicate to classify checkable cases into REAL / FAKE). or (2) Start from that LinkedIn test case, define the semantics more precisely, and possibly adjust the assumptions in DuplicateCheck. Personally, I lean towards (1) as a first step, and then treat (2) as a follow-up anchor case to refine the behavior. If you have already try to adapt |
|
@wanling0000 I agree we should keep scope small but I think using isDuplicate would actually decrease the quality here as it bypasses metadata validation as soon as an identifier match is found. This means a hallucinated reference with a real DOI would be classified REAL. I'd prefer to keep compareEntries as the classifier since it's already written and tested and just keep the UNSURE refinement ongoing. |
|
Currently I'm working on comparing author fields positionally by individual author rather than as raw strings also using Levenshtein similarity on full names so hallucinated first names are detected even when the family name matches exactly Just wanted to be sure again if this test case should return UNSURE or REAL as the TITLE and YEAR match exactly keeping the score high in current implementation. |
447d615 to
1174222
Compare
|
Please no force pushes. |
|
@NishantDG-SST can you please clean up the git tree? |
|
@subhramit should I create a new branch with a single squashed commit and open a new pr? |
you can reset with upstream/main and push a single fresh commit with these changes at the end. Please be mindful of rebasing in future. Read about the perils of rebasing as well as force-pushes. |
f246e80 to
4332569
Compare
4332569 to
52be9eb
Compare
|
Thanks @subhramit I'll try to be more mindful of rebasing from next time, sorry for the trouble |
LoayTarek5
left a comment
There was a problem hiding this comment.
Thanks @NishantDG-SST , clean structure overall, good work
I have a few concerns from my understanding that i will address
| /// | ||
| /// FAKE with a non-null matchedEntry means a candidate was found but did not match. | ||
| /// FAKE with null matchedEntry means nothing was found at all. | ||
| private static RefCheckResult classify(BibEntry local, BibEntry authoritative) { |
There was a problem hiding this comment.
The current mapping:
score >= 0.8 → REAL
score >= 0.5 → UNSURE
score < 0.5 → FAKE
but @koppor clarified in #13604 that UNSURE means "JabRef has no way to verify this at all",
not "partial match found.",
his example was a @misc entry pointing to a GitHub URL that no fetcher can check
This was my understanding, is it correct?
and i think this would also resolve the ongoing confusion about the LinkedIn/Mussgnug case,since "found but author doesn't match well" would be FAKE, not UNSURE
| if (doi.isEmpty()) { | ||
| return new RefCheckResult(RefValidity.FAKE, null, 0.0); | ||
| } |
There was a problem hiding this comment.
I think that if (doi.isEmpty()) → FAKE is wrong, no DOI not equal fake paper, it means this source can't help, same for CrossRef/arXiv "not found" cases.
it will result, if all three sources have nothing to check, bestOf returns FAKE.
as i understand from Dr.Oliver(koppor), this should be UNSURE.
my suggestion is to use Optional for check methods, return UNSURE when all sources returned empty.
There was a problem hiding this comment.
I think that if (doi.isEmpty()) → FAKE is wrong, no DOI not equal fake paper, it means this source can't help, same for CrossRef/arXiv "not found" cases.
Agreed.
as i understand from Dr.Oliver(koppor), this should be UNSURE.
Im not sure if this should be UNSURE, what if that entry only misses DOI and all other fields are the same? Then maybe this should return TRUE. I think what koppor meant is for online urls not for literatures?
There was a problem hiding this comment.
After looking again into the comments i think you are right @pluto-han, thanks.
yeah i think UNSURE is for non-verifiable types (@misc, @online) already handled by VERIFIABLE_TYPES, a verifiable @article with no results -> FAKE is correct.
There was a problem hiding this comment.
The hallucinated-author case ("Mussgnug, Anna Maria" vs "Alexander Martin") was the motivation for adding weights, but has no locked-in unit test.
without it, threshold changes can silently regress this
| /// @param one the local entry to check (drives which fields are scored) | ||
| /// @param two the authoritative entry fetched from an online source | ||
| /// @return weighted similarity score in [0.0, 1.0] | ||
| public static double compareEntries(BibEntry one, BibEntry two) { |
There was a problem hiding this comment.
compareEntries only iterates local fields, entry with just {TITLE} matching an authoritative entry -> score 1.0 -> REAL, regardless of mismatching author/year in the authoritative entry.
i think this means a hallucinated reference with just {title, year} faking a real DOI could land as REAL even if the authoritative entry has author/journal info that contradicts nothing (because those fields aren't compared)(right?)
try to consider requiring a minimum number of comparable fields or penalizing missing core fields
| double similarity; | ||
| if (field.getProperties().contains(FieldProperty.PERSON_NAMES)) { | ||
| List<Author> localAuthors = AuthorList.parse(firstValue).getAuthors().stream() | ||
| .filter(a -> !a.getFamilyGiven(false).equalsIgnoreCase("others")) |
There was a problem hiding this comment.
getFamilyGiven(false).equalsIgnoreCase("others"), Author.OTHERS sentinel may not stringify to "others" with this method, please verify and add a test.
There was a problem hiding this comment.
Update: traced through source, Author.OTHERS.getFamilyGiven(false) returns "others" exactly, the filter works.
a small unit test to lock this in would still be nice though.
| double givenSimilarity = (localGiven.isEmpty() || authGiven.isEmpty()) | ||
| ? 1.0 | ||
| : stringSimilarity.similarity(localGiven, authGiven); | ||
|
|
There was a problem hiding this comment.
If local has "Smith" (no given) and authoritative has "Smith, John", given similarity = 1.0, this is reasonable for abbreviation, but the two branches (local-empty vs authoritative-empty vs both-empty) have different semantic meanings, i think it worth adding explicit tests for each branch so future tweaks don't shift behavior silently
Related issues and pull requests
Refs #13604
PR Description
Only logic for the RefCheck functionality.
Steps to test
Run
Checklist
CHANGELOG.mdin a way that can be understood by the average user (if change is visible to the user)