Description
medical_named_entity_recognition.find_diseases (medical-named-entity-recognition==0.4) appearently returns incorrect MeSH ID <-> heading mappings. Ex.: the text mention "breast cancer" is returned as name "Breast Neoplasms" with mesh_id D000072656, and "cancer" is returned as name "Neoplasms" with mesh_id D009362. These IDs do not match the official MeSH headings (see links below), so disease normalization is unreliable.
MeSH Browser evidence:
Environment
- Package: medical-named-entity-recognition==0.4
- OS: macOS (Apple Silicon)
- Python: 3.11.9
- Install method: pip
How to Reproduce
- Run the following code:
from medical_named_entity_recognition import find_diseases
import re
RE_TOKENISE = re.compile(r"((?:\w|'|’)+)")
text = "mouse models of human breast cancer"
tokens = RE_TOKENISE.findall(text.lower())
print(tokens)
for d, i, j in find_diseases(tokens):
print(d, i, j)
- Observe the returned dicts for matching_string = “breast cancer” and “cancer”.
['mouse', 'models', 'of', 'human', 'breast', 'cancer']
{'mesh_id': 'D000072656', 'name': 'Breast Neoplasms', 'matching_string': 'breast cancer', ...}
{'mesh_id': 'D009362', 'name': 'Neoplasms', 'matching_string': 'cancer', ...}
Expected Behaviour
- The mention “breast cancer” / heading “Breast Neoplasms” should map to MeSH ID D001943 (not D000072656).
- The heading “Neoplasms” should map to MeSH ID D009369 (not D009362).
- More generally, returned mesh_id values should correspond to the MeSH Browser heading for that ID.
Description
medical_named_entity_recognition.find_diseases(medical-named-entity-recognition==0.4) appearently returns incorrect MeSH ID <-> heading mappings. Ex.: the text mention "breast cancer" is returned as name "Breast Neoplasms" with mesh_idD000072656, and "cancer" is returned as name "Neoplasms" with mesh_idD009362. These IDs do not match the official MeSH headings (see links below), so disease normalization is unreliable.MeSH Browser evidence:
Environment
How to Reproduce
Expected Behaviour