Skip to content

sciknoworg/tib-sid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TIB-SID logo

The TIB Subject Indexing Dataset (TIB-SID) is a bilingual benchmark for extreme multi-label text classification (XMTC) over real library records, designed for domain classification and GND-based subject indexing. The dataset combines a large, structured, authority-controlled label space with long-tail sparsity, cross-lingual variation, and real-world domain imbalance, making it substantially closer to operational library cataloging than standard text classification benchmarks.

✨ At a glance

  • 136,569 library records in JSON-LD with predefined train / dev / test benchmark splits
  • Languages: English and German
  • 28 domains
  • Record types: article, book, conference, report, thesis

⬇️ Download

Download the dataset here: data

🔗 Related Links

TIB-SID was introduced through the LLMs4Subjects shared tasks organized in 2025. More than 12 LLM-based systems were developed and evaluated on the dataset by participating teams worldwide. The shared task websites provide additional context, task details, and leaderboard results.

📖 Citation

If TIB-SID useful for your research or project, please consider citing it.

The main dataset paper is listed below. It has been accepted to LREC 2026, and the official proceedings citation will be added here as soon as it is available.

@inproceedings{dsouza-etal-2026-extreme,
  title = {An Extreme Multi-label Text Classification (XMTC) Library Dataset: What If We Took "Use of Practical AI in Digital Libraries" Seriously?},
  author = {D'Souza, Jennifer and Sadruddin, Sameer and Kaehler, Maximilian and Salfinger, Andrea and Zaccagna, Luca and Incitti, Francesca and Snidaro, Lauro and Suominen, Osma},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
  month = {May},
  year = {2026},
  pages = {169--184},
  address = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  editor = {Piperidis, Stelios and Bel, Núria and van den Heuvel, Henk and Ide, Nancy and Krek, Simon and Toral, Antonio},
  doi = {10.63317/5kag6gjg636f},
}

If you would also like to cite the shared task that introduced the broader benchmark setting, please use:

@InProceedings{dsouza-EtAl:2025:SemEval2025,
author    = {D'Souza, Jennifer and Sadruddin, Sameer and Israel, Holger and Begoin, Mathias and Slawig, Diana},
title     = {SemEval-2025 Task 5: LLMs4Subjects - LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog},
booktitle = {Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)},
month     = {August},
year      = {2025},
address   = {Vienna, Austria},
publisher = {Association for Computational Linguistics},
pages     = {1082--1095},
url       = {https://aclanthology.org/2025.semeval2025-1.139}
}

⭐ Acknowledgements

This work was supported by the NFDI4DataScience initiative (DFG, German Research Foundation, Grant ID: 460234259) and the TIB – Leibniz Information Centre for Science and Technology. We also gratefully acknowledge the subject specialists at TIB who contributed to the curated human evaluation of this work.

⚖️ License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0

About

TIB-SID: A bilingual (English/German) dataset of library catalog records with GND subject indexing for research on automated subject tagging and extreme multi-label classification.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages