GitHub - sciknoworg/tib-sid: TIB-SID: A bilingual (English/German) dataset of library catalog records with GND subject indexing for research on automated subject tagging and extreme multi-label classification.

The TIB Subject Indexing Dataset (TIB-SID) is a bilingual benchmark for extreme multi-label text classification (XMTC) over real library records, designed for domain classification and GND-based subject indexing. The dataset combines a large, structured, authority-controlled label space with long-tail sparsity, cross-lingual variation, and real-world domain imbalance, making it substantially closer to operational library cataloging than standard text classification benchmarks.

✨ At a glance

136,569 library records in JSON-LD with predefined train / dev / test benchmark splits
Languages: English and German
28 domains
Record types: article, book, conference, report, thesis

⬇️ Download

Download the dataset here: data

🔗 Related Links

TIB-SID was introduced through the LLMs4Subjects shared tasks organized in 2025. More than 12 LLM-based systems were developed and evaluated on the dataset by participating teams worldwide. The shared task websites provide additional context, task details, and leaderboard results.

📖 Citation

If TIB-SID useful for your research or project, please consider citing it.

The main dataset paper is listed below. It has been accepted to LREC 2026, and the official proceedings citation will be added here as soon as it is available.

@inproceedings{dsouza-etal-2026-extreme,
  title = {An Extreme Multi-label Text Classification (XMTC) Library Dataset: What If We Took "Use of Practical AI in Digital Libraries" Seriously?},
  author = {D'Souza, Jennifer and Sadruddin, Sameer and Kaehler, Maximilian and Salfinger, Andrea and Zaccagna, Luca and Incitti, Francesca and Snidaro, Lauro and Suominen, Osma},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  month = {May},
  year = {2026},
  pages = {169--184},
  address = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  editor = {Piperidis, Stelios and Bel, Núria and van den Heuvel, Henk and Ide, Nancy and Krek, Simon and Toral, Antonio},
  doi = {10.63317/5kag6gjg636f},
}

If you would also like to cite the shared task that introduced the broader benchmark setting, please use:

@InProceedings{dsouza-EtAl:2025:SemEval2025,
author    = {D'Souza, Jennifer and Sadruddin, Sameer and Israel, Holger and Begoin, Mathias and Slawig, Diana},
title     = {SemEval-2025 Task 5: LLMs4Subjects - LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog},
booktitle = {Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)},
month     = {August},
year      = {2025},
address   = {Vienna, Austria},
publisher = {Association for Computational Linguistics},
pages     = {1082--1095},
url       = {https://aclanthology.org/2025.semeval2025-1.139}
}

⭐ Acknowledgements

This work was supported by the NFDI4DataScience initiative (DFG, German Research Foundation, Grant ID: 460234259) and the TIB – Leibniz Information Centre for Science and Technology. We also gratefully acknowledge the subject specialists at TIB who contributed to the curated human evaluation of this work.

⚖️ License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
GND		GND
assets		assets
evaluation		evaluation
library-records-dataset		library-records-dataset
.gitattributes		.gitattributes
.gitignore		.gitignore
28_domains_list.csv		28_domains_list.csv
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ At a glance

⬇️ Download

🔗 Related Links

📖 Citation

⭐ Acknowledgements

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

✨ At a glance

⬇️ Download

🔗 Related Links

📖 Citation

⭐ Acknowledgements

⚖️ License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages