Investigating Data Dependency Refactorings and Technical Debt in Machine Learning (ML) Systems

NYU GSTEM 2025 project at CUNY Hunter College.

Instructions

Run python mining/1getHFdatasets.py to extract datasets from Hugging Face. Filters such as modularity and file format can be adjusted by editing, adding, or removing lines such as those below:
```
is_tabular = 'modality:tabular' in tags
is_csv = 'format:csv' in tags
# Saves the datasets in "filtered_datasets.json"
```
Use python mining/2getHFcommits.py to extract more information, including the commit logs, from all the datasets included in filtered_datasets.json. The script extracts datasetId, tags, downloads, likes, lastModified, created_at, commits; saves this information in FilteredHFDatasets.csv.
Run python mining/3HFcommitFormatting.py FilteredHFDatasets.csv outputFilename.csv. Formats all previously extracted commits into separate rows; includes DatasetID, CommitId, Authors, Date, Log message, and message.

Provenance, License, and Citation

This tool was developed by Ayla Zhang, a high-school student (Thomas Jefferson High School for Science and Technology) participating in NYU GSTEM (Summer 2025), under the mentorship of Raffi Khatchadourian (CUNY Hunter College), as a preliminary study of data-dependency refactorings and technical debt in machine learning systems.

The Hugging Face mining in this repository is original to this work.
The GitHub-side commit analysis reuses the dataset of Tang et al., "An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems," ICSE 2021.
This material is based upon work supported by the National Science Foundation under Grant No. CCF-2343750. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Licensed under the MIT License (see LICENSE).
Please cite using CITATION.cff.

This is a preliminary research prototype; the mining methodology (keyword filtering plus manual inspection) is exploratory and not exhaustive.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
analysis		analysis
data		data
mining		mining
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Investigating Data Dependency Refactorings and Technical Debt in Machine Learning (ML) Systems

Instructions

Provenance, License, and Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Investigating Data Dependency Refactorings and Technical Debt in Machine Learning (ML) Systems

Instructions

Provenance, License, and Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages