diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 42e34136e..303986747 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -283,6 +283,10 @@ title: Performing data transformations - local: datasets-polars-optimizations title: Performance optimizations + - local: datasets-pyarrow + title: PyArrow + - local: datasets-pyiceberg + title: PyIceberg - local: datasets-spark title: Spark - local: datasets-webdataset diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md index 3030fc6e4..72508f106 100644 --- a/docs/hub/datasets-duckdb.md +++ b/docs/hub/datasets-duckdb.md @@ -1,11 +1,7 @@ # DuckDB [DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system. -You can use the Hugging Face paths (`hf://`) to access data on the Hub: - -
- -
+You can use the Hugging Face paths (`hf://`) to access data on the Hub, or an Iceberg Datasets Catalog. The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable. There are also other APIs available for running DuckDB, including Python, C++, Go, Java, Rust, and more. For additional details, visit their [clients](https://duckdb.org/docs/api/overview.html) page. @@ -20,16 +16,27 @@ Starting from version `v0.10.3`, the DuckDB CLI includes native support for acce - Combine datasets and export it to different formats - Conduct vector similarity search on embedding datasets - Implement full-text search on datasets +- Use an Iceberg Datasets Catalog For a complete list of DuckDB features, visit the DuckDB [documentation](https://duckdb.org/docs/). -To start the CLI, execute the following command in the installation folder: +## Authentication + +To access gated and private datasets, login to Hugging Face with: ```bash -./duckdb +hf auth login +``` + +Then in DuckDB, load the hf_token with this command: + +```sql +CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain); ``` -## Forging the Hugging Face URL +See more details on authentication in the [DuckDB authentication documentation for Hugging face](./datasets-duckdb-auth). + +## Querying files on Hugging Face To access Hugging Face datasets, use the following URL format: @@ -37,38 +44,38 @@ To access Hugging Face datasets, use the following URL format: hf://datasets/{my-username}/{my-dataset}/{path_to_file} ``` -- **my-username**, the user or organization of the dataset, e.g. `ibm` -- **my-dataset**, the dataset name, e.g: `duorc` +- **my-username**, the user or organization of the dataset, e.g. `stanfordnlp` +- **my-dataset**, the dataset name, e.g: `imdb` - **path_to_parquet_file**, the parquet file path which supports glob patterns, e.g `**/*.parquet`, to query all parquet files -> [!TIP] -> You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet. -> -> To reference the `refs/convert/parquet` revision of a dataset, use the following syntax: -> -> ```plaintext -> hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file} -> ``` -> -> Here is a sample URL following the above syntax: -> -> ```plaintext -> hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/0000.parquet -> ``` - -Let's start with a quick demo to query all the rows of a dataset: +For example, to query the train split of the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset: ```sql -FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; +SELECT * FROM 'hf://datasets/stanfordnlp/imdb/**/train-*.parquet' LIMIT 10; ``` -Or using traditional SQL syntax: +Which returns: -```sql -SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; ``` -In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets. +┌──────────────────────────────────────────────────────────────────────┬───────┐ +│ text │ label │ +│ varchar │ int64 │ +├──────────────────────────────────────────────────────────────────────┼───────┤ +│ I rented I AM CURIOUS-YELLOW from my video store because of all th… │ 0 │ +│ "I Am Curious: Yellow" is a risible and pretentious steaming pile.… │ 0 │ +│ If only to avoid making this type of film in the future. This film… │ 0 │ +│ This film was probably inspired by Godard's Masculin, féminin and … │ 0 │ +│ Oh, brother...after hearing about this ridiculous film for umpteen… │ 0 │ +│ I would put this at the top of my list of films in the category of… │ 0 │ +│ Whoever wrote the screenplay for this movie obviously never consul… │ 0 │ +│ When I first saw a glimpse of this movie, I quickly noticed the ac… │ 0 │ +│ Who are these "They"- the actors? the filmmakers? Certainly couldn… │ 0 │ +│ This is said to be a personal film for Peter Bogdonavitch. He base… │ 0 │ +├──────────────────────────────────────────────────────────────────────┴───────┤ +│ 10 rows 2 columns │ +└──────────────────────────────────────────────────────────────────────────────┘ +``` > [!TIP] > **Querying Storage Buckets**: When using the DuckDB Python client, you can query data stored in [Storage Buckets](./storage-buckets) by registering the Hugging Face filesystem: @@ -79,3 +86,136 @@ In the following sections, we will cover more complex operations you can perform > duckdb.sql("SELECT * FROM 'hf://buckets/username/my-bucket/data.parquet' LIMIT 10") > ``` Native `hf://buckets/` support in DuckDB is expected in a future release. + +## Query an Iceberg Datasets Catalog + +Use the PyIceberg library `faceberg` to deploy an Iceberg catalog (see next section) you can use to query datasets on Huggging Face using an easy syntax. + +In particular you can query datasets as `faceberg.namespace.dataset_name` instead of having to pass a file pattern, and it automatically adds a `split` column to differentiate between train/test/validation splits. + +For example, here is the syntax to query the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset: + +```sql +SELECT * FROM faceberg.stanfordnlp.imdb LIMIT 10; +``` + +``` +┌─────────┬────────────────────────────────────────────────────────────┬───────┐ +│ split │ text │ label │ +│ varchar │ varchar │ int64 │ +├─────────┼────────────────────────────────────────────────────────────┼───────┤ +│ train │ I rented I AM CURIOUS-YELLOW from my video store because… │ 0 │ +│ train │ "I Am Curious: Yellow" is a risible and pretentious stea… │ 0 │ +│ train │ If only to avoid making this type of film in the future.… │ 0 │ +│ train │ This film was probably inspired by Godard's Masculin, fé… │ 0 │ +│ train │ Oh, brother...after hearing about this ridiculous film f… │ 0 │ +│ train │ I would put this at the top of my list of films in the c… │ 0 │ +│ train │ Whoever wrote the screenplay for this movie obviously ne… │ 0 │ +│ train │ When I first saw a glimpse of this movie, I quickly noti… │ 0 │ +│ train │ Who are these "They"- the actors? the filmmakers? Certai… │ 0 │ +│ train │ This is said to be a personal film for Peter Bogdonavitc… │ 0 │ +├─────────┴────────────────────────────────────────────────────────────┴───────┤ +│ 10 rows 3 columns │ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +And you can simply filter by split like this: + + +```sql +SELECT * FROM faceberg.stanfordnlp.imdb WHERE split = 'test' LIMIT 10; +``` + +``` +┌─────────┬────────────────────────────────────────────────────────────┬───────┐ +│ split │ text │ label │ +│ varchar │ varchar │ int64 │ +├─────────┼────────────────────────────────────────────────────────────┼───────┤ +│ test │ I love sci-fi and am willing to put up with a lot. Sci-f… │ 0 │ +│ test │ Worth the entertainment value of a rental, especially if… │ 0 │ +│ test │ its a totally average film with a few semi-alright actio… │ 0 │ +│ test │ STAR RATING: ***** Saturday Night **** Friday Night *** … │ 0 │ +│ test │ First off let me say, If you haven't enjoyed a Van Damme… │ 0 │ +│ test │ I had high hopes for this one until they changed the nam… │ 0 │ +│ test │ Isaac Florentine has made some of the best western Marti… │ 0 │ +│ test │ It actually pains me to say it, but this movie was horri… │ 0 │ +│ test │ Technically I'am a Van Damme Fan, or I was. this movie i… │ 0 │ +│ test │ Honestly awful film, bad editing, awful lighting, dire d… │ 0 │ +├─────────┴────────────────────────────────────────────────────────────┴───────┤ +│ 10 rows 3 columns │ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +### Deploy a catalog on HuggingFace Hub + +To deploy an Iceberg Datasets catalog, run `pip install faceberg` and run this command using your own Hugging Face username instead of "user": + +```bash +faceberg user/mycatalog init +``` + +### Add datasets + +Once your catalog is ready, add datasets using the following command: + +```bash +faceberg user/mycatalog add stanfordnlp/imdb +faceberg user/mycatalog add openai/gsm8k --config main +``` + +### Query with interactive DuckDB shell + +`faceberg` comes with an builtin DuckDB shell you can run like this: + +```bash +faceberg user/mycatalog quack +``` + +```sql +SELECT label, substr(text, 1, 100) as preview +FROM faceberg.stanfordnlp.imdb +LIMIT 10; +``` + +Alternatively DuckDB shell is also available in the catalog's web interface: + +
+ +
+ +### More information + +Find more information on `faceberg` and the PyIceberg integration with the Hugging Face Hub in the [documentation](./datasets-pyiceberg). + + +## Auto-converted Parquet files + +You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet: + + +
+ +
+ +To reference the `refs/convert/parquet` revision of a dataset, use the following syntax: + +```plaintext +hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file} +``` + +Here is a sample URL following the above syntax for the [fka/prompts.chat](https://huggingface.co/datasets/fka/prompts.chat) dataset to a file in its [Parquet branch](https://huggingface.co/datasets/fka/prompts.chat/tree/refs%2Fconvert%2Fparquet): + +```plaintext +hf://datasets/fka/prompts.chat@~parquet/default/train/0000.parquet +``` + +In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets. + +## Use-cases and examples + +Find more use-cases and examples with Hugging Face Datasets here: + +* [Query datasets](./datasets-duckdb-select.md) +* [Perform SQL operations](./datasets-duckdb-sql) +* [Combine datasets and export](./datasets-duckdb-combine-and-export.md) +* [Perform vector similarity search](./datasets-duckdb-vector-similarity-search.md) diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md index 9ecdc728f..13786047e 100644 --- a/docs/hub/datasets-libraries.md +++ b/docs/hub/datasets-libraries.md @@ -24,6 +24,7 @@ The table below summarizes the supported libraries and their level of integratio | [Pandas](./datasets-pandas) | Python data analysis toolkit. | ✅ | ❌ | ✅ | ❌ | ✅* | | [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. | ✅ | ✅ | ✅ | ❌ | ❌ | | [PyArrow](./datasets-pyarrow) | Apache Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics. | ✅ | ✅ | ✅ | ❌ | ✅* | +| [PyIceberg](./datasets-pyiceberg) | Apache Iceberg is a high performance open-source format for large analytic tables. | ✅ | ✅ | ❌ | ❌ | ❌ | | [Spark](./datasets-spark) | Real-time, large-scale data processing tool in a distributed environment. | ✅ | ✅ | ✅ | ✅ | ✅ | | [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets. | ✅ | ✅ | ❌ | ❌ | ❌ | diff --git a/docs/hub/datasets-pyiceberg.md b/docs/hub/datasets-pyiceberg.md new file mode 100644 index 000000000..c66a5baf5 --- /dev/null +++ b/docs/hub/datasets-pyiceberg.md @@ -0,0 +1,201 @@ +# PyIceberg + +PyIceberg is a Python implementation for accessing Iceberg tables. + +You can use the PyIceberg library [faceberg](https://github.com/kszucs/faceberg) to deploy an Iceberg Datasets catalog and add datasets from the Hugging Face Hub. + +Once your catalog is ready, use your favorite Iceberg client to query datasets in your catalog. +For example: run SQL queries to explore datasets, do analytics, mix datasets together, or run large processing jobs. + +## Set up + +### Installation + +To be able to add Hugging Face Datasets to a catalog, you need to install the `faceberg` library: + +``` +pip install faceberg +``` + +This will also install required dependencies like `huggingface_hub` for authentication, and `datasets` for metadata discovery. + +### Authentication + +You need to authenticate to Hugging Face to read private/gated dataset repositories or to write to your dataset repositories. + +You can use the CLI for example: + +``` +hf auth login +``` + +It's also possible to provide your Hugging Face token with the `HF_TOKEN` environment variable or passing the `token` option to the reader. +For more details about authentication, check out [this guide](https://huggingface.co/docs/huggingface_hub/quick-start#authentication). + +## Deploy a Datasets Catalog + +Use `faceberg /catalog-name> init` to deploy an Iceberg Datasets Catalog under your account on Hugging Face Spaces (free !) + +```bash +faceberg username/my-catalog init +``` + +This command will show you the created catalog information and some helpful commands: + +``` +🤗🧊 Catalog: hf://spaces/username/my-catalog + +Initializing remote catalog: hf://spaces/username/my-catalog +✓ Catalog initialized successfully! + +Space URL: https://username-my-catalog.hf.space +Repository: https://huggingface.co/spaces/username/my-catalog + +Next steps: + • Run faceberg add to add tables + • Run faceberg sync to sync tables from datasets + • Use faceberg scan to view sample data + • Run faceberg serve to start the REST catalog server + • Run faceberg quack to open DuckDB with the catalog +``` + +In particular, note the Space URL that ends with `.hf.space`. +It is your catalog uri for iceberg clients, and also the web interface you can open in your browser: + +
+ +
+ +Alternatively, deploy a catalog locally using using `"/path/to/catalog"` or `"file:///path/to/catalog"` instead of a Space repository name. + +## Add a dataset + +The `faceberg` command line makes it easy to add datasets from Hugging Face to an Iceberg catalog. + +This is compatible with all the dataset in [supported format](https://huggingface.co/docs/hub/datasets-adding#file-formats) on Hugging Face. +Under the hood, the catalog points to the dataset's Parquet files. + +For example here is how to load the [stanfordnlp/imdb](https://huggingface.co/stanfordnlp/imdb) dataset and the `faceberg add` command: + +```bash +faceberg username/my-catalog add stanfordnlp/imdb +``` + +which shows: + +``` +🤗🧊 Catalog: hf://spaces/username/my-catalog + +Adding dataset: stanfordnlp/imdb +Table identifier: stanfordnlp.imdb + +stanfordnlp.imdb ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • Complete + +✓ Added stanfordnlp.imdb to catalog + Dataset: stanfordnlp/imdb + Location: hf://spaces/username/my-catalog/stanfordnlp/imdb/metadata/v1.metadata.json + +Table schema: +imdb( + 1: split: optional string, + 2: text: optional string, + 3: label: optional long +), +partition by: [split], +sort order: [], +snapshot: Operation.APPEND: id=1, schema_id=0 +``` + +
+ +
+ +> [!TIP] +> On Hugging Face, datasets that are not in Parquet format are automatically converted to Parquet in a separate git branch `refs/convert/parquet`. +> Therefore it is possible to add to an Iceberg catalog a dataset that is not originally in Parquet. + +Here is another example with the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset. +It is a gated repository, users have to accept the terms of use before accessing it. +It also has multiple subsets, namely, "3M" and "7M". So we need to specify which one to load. + +```bash +faceberg username/my-catalog add BAAI/Infinity-Instruct --config 7M +``` + +
+ +
+ +## Load a dataset table + +### Using `faceberg` + +Use `faceberg` to get the PyIceberg catalog in python, and `.load_table()` to load the dataset table (more precisely the config or subset named "7M" containing 7M samples). Here is how to compute the number of dialogue per language and filter the dataset. + +After logging-in to access the gated dataset, you can run: + +```python +>>> import faceberg +>>> catalog = faceberg.catalog("username/my-catalog") +>>> table = catalog.load_table("BAAI.Infinity-Instruct") +>>> table.scan(limit=5).to_pandas() +Out[9]: + split id conversations label langdetect source reward +0 train 0 [{'from': 'human', 'value': 'def extinguish_fi... {'ability_en': ['programming ability'], 'abili... en code_exercises 3.718750 +1 train 1 [{'from': 'human', 'value': 'See the multi-cho... {'ability_en': ['logical reasoning'], 'ability... en flan -3.359375 +2 train 2 [{'from': 'human', 'value': 'This is some data... {'ability_en': ['geographic knowledge', 'text ... en flan -1.171875 +3 train 3 [{'from': 'human', 'value': 'If you don't want... {'ability_en': ['logical reasoning'], 'ability... en flan -12.187500 +4 train 4 [{'from': 'human', 'value': 'In a United State... {'ability_en': ['text understanding', 'informa... en flan 12.687500 +``` + +### Using `pyiceberg` + +Here is how to instantiate the catalog using native PyIceberg. +The catalog is a REST catalog so we use `pyiceberg.catalog.rest.RestCatalog`. + +The uri of the catalog is the Space HTTP URL that ends with `.hf.space`, and the warehouse property should point to the Space repository on Hugging face, which contains the metadata files of the iceberg tables: + +```python +>>> from pyiceberg.catalog.rest import RestCatalog +>>> properties = { +... "uri": "https://username-my-catalog.hf.space", +... "warehouse": "hf://spaces/username/my-catalog", +... } +>>> catalog = RestCatalog("username/my-catalog", **properties) +>>> table = catalog.load_table("BAAI.Infinity-Instruct") +``` + +## Run SQL queries + +Once you have your PyIceberg table ready, you can run SQL queries from the catalog Space: + +
+ +
+ +## More information on `faceberg` + +Find more information about the `faceberg` library in its official documentation: + +* [Getting Started](https://faceberg.kszucs.dev/), and in particular: + - [Query data with the CLI](https://faceberg.kszucs.dev/#query-data) + - [Query data with DuckDB](https://faceberg.kszucs.dev/#interactive-queries-with-duckdb) + - [Query data with PyIceberg](https://faceberg.kszucs.dev/#query-using-pyiceberg) + - [The Catalog API](https://faceberg.kszucs.dev/#catalog-api) + - [How it works](https://faceberg.kszucs.dev/#how-it-works) +* [Local Catalogs](https://faceberg.kszucs.dev/local.html) +* [Architecture](https://faceberg.kszucs.dev/design.html) +* [DuckDB integration](https://faceberg.kszucs.dev/duckdb.html) +* [Pandas integration](https://faceberg.kszucs.dev/pandas.html) + +## Use other Iceberg clients + +Access datasets via your Iceberg Datasets Catalog via other clients: + +* **DuckDB** to run SQL, see the [DuckDB integration with faceberg](https://faceberg.kszucs.dev/duckdb.html) +* **Pandas** for easy dataframe processing, see the [Pandas integration with faceberg](https://faceberg.kszucs.dev/pandas.html) + +More generally, any client that supports REST catalogs and `hf://` URIs can now use access your Iceberg Datasets Catalog. +In addition to native support in DuckDB, `hf://` URIs are also supported in any `fsspec`-based client in Python and in any `object_store_opendal`-based client in Rust. + +Find more examples and advanced usage in the main [faceberg documentation](https://faceberg.kszucs.dev/).