diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml
index 42e34136e..303986747 100644
--- a/docs/hub/_toctree.yml
+++ b/docs/hub/_toctree.yml
@@ -283,6 +283,10 @@
title: Performing data transformations
- local: datasets-polars-optimizations
title: Performance optimizations
+ - local: datasets-pyarrow
+ title: PyArrow
+ - local: datasets-pyiceberg
+ title: PyIceberg
- local: datasets-spark
title: Spark
- local: datasets-webdataset
diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md
index 3030fc6e4..72508f106 100644
--- a/docs/hub/datasets-duckdb.md
+++ b/docs/hub/datasets-duckdb.md
@@ -1,11 +1,7 @@
# DuckDB
[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system.
-You can use the Hugging Face paths (`hf://`) to access data on the Hub:
-
-
-

-
+You can use the Hugging Face paths (`hf://`) to access data on the Hub, or an Iceberg Datasets Catalog.
The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable.
There are also other APIs available for running DuckDB, including Python, C++, Go, Java, Rust, and more. For additional details, visit their [clients](https://duckdb.org/docs/api/overview.html) page.
@@ -20,16 +16,27 @@ Starting from version `v0.10.3`, the DuckDB CLI includes native support for acce
- Combine datasets and export it to different formats
- Conduct vector similarity search on embedding datasets
- Implement full-text search on datasets
+- Use an Iceberg Datasets Catalog
For a complete list of DuckDB features, visit the DuckDB [documentation](https://duckdb.org/docs/).
-To start the CLI, execute the following command in the installation folder:
+## Authentication
+
+To access gated and private datasets, login to Hugging Face with:
```bash
-./duckdb
+hf auth login
+```
+
+Then in DuckDB, load the hf_token with this command:
+
+```sql
+CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain);
```
-## Forging the Hugging Face URL
+See more details on authentication in the [DuckDB authentication documentation for Hugging face](./datasets-duckdb-auth).
+
+## Querying files on Hugging Face
To access Hugging Face datasets, use the following URL format:
@@ -37,38 +44,38 @@ To access Hugging Face datasets, use the following URL format:
hf://datasets/{my-username}/{my-dataset}/{path_to_file}
```
-- **my-username**, the user or organization of the dataset, e.g. `ibm`
-- **my-dataset**, the dataset name, e.g: `duorc`
+- **my-username**, the user or organization of the dataset, e.g. `stanfordnlp`
+- **my-dataset**, the dataset name, e.g: `imdb`
- **path_to_parquet_file**, the parquet file path which supports glob patterns, e.g `**/*.parquet`, to query all parquet files
-> [!TIP]
-> You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet.
->
-> To reference the `refs/convert/parquet` revision of a dataset, use the following syntax:
->
-> ```plaintext
-> hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file}
-> ```
->
-> Here is a sample URL following the above syntax:
->
-> ```plaintext
-> hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/0000.parquet
-> ```
-
-Let's start with a quick demo to query all the rows of a dataset:
+For example, to query the train split of the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset:
```sql
-FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3;
+SELECT * FROM 'hf://datasets/stanfordnlp/imdb/**/train-*.parquet' LIMIT 10;
```
-Or using traditional SQL syntax:
+Which returns:
-```sql
-SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3;
```
-In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets.
+┌──────────────────────────────────────────────────────────────────────┬───────┐
+│ text │ label │
+│ varchar │ int64 │
+├──────────────────────────────────────────────────────────────────────┼───────┤
+│ I rented I AM CURIOUS-YELLOW from my video store because of all th… │ 0 │
+│ "I Am Curious: Yellow" is a risible and pretentious steaming pile.… │ 0 │
+│ If only to avoid making this type of film in the future. This film… │ 0 │
+│ This film was probably inspired by Godard's Masculin, féminin and … │ 0 │
+│ Oh, brother...after hearing about this ridiculous film for umpteen… │ 0 │
+│ I would put this at the top of my list of films in the category of… │ 0 │
+│ Whoever wrote the screenplay for this movie obviously never consul… │ 0 │
+│ When I first saw a glimpse of this movie, I quickly noticed the ac… │ 0 │
+│ Who are these "They"- the actors? the filmmakers? Certainly couldn… │ 0 │
+│ This is said to be a personal film for Peter Bogdonavitch. He base… │ 0 │
+├──────────────────────────────────────────────────────────────────────┴───────┤
+│ 10 rows 2 columns │
+└──────────────────────────────────────────────────────────────────────────────┘
+```
> [!TIP]
> **Querying Storage Buckets**: When using the DuckDB Python client, you can query data stored in [Storage Buckets](./storage-buckets) by registering the Hugging Face filesystem:
@@ -79,3 +86,136 @@ In the following sections, we will cover more complex operations you can perform
> duckdb.sql("SELECT * FROM 'hf://buckets/username/my-bucket/data.parquet' LIMIT 10")
> ```
Native `hf://buckets/` support in DuckDB is expected in a future release.
+
+## Query an Iceberg Datasets Catalog
+
+Use the PyIceberg library `faceberg` to deploy an Iceberg catalog (see next section) you can use to query datasets on Huggging Face using an easy syntax.
+
+In particular you can query datasets as `faceberg.namespace.dataset_name` instead of having to pass a file pattern, and it automatically adds a `split` column to differentiate between train/test/validation splits.
+
+For example, here is the syntax to query the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset:
+
+```sql
+SELECT * FROM faceberg.stanfordnlp.imdb LIMIT 10;
+```
+
+```
+┌─────────┬────────────────────────────────────────────────────────────┬───────┐
+│ split │ text │ label │
+│ varchar │ varchar │ int64 │
+├─────────┼────────────────────────────────────────────────────────────┼───────┤
+│ train │ I rented I AM CURIOUS-YELLOW from my video store because… │ 0 │
+│ train │ "I Am Curious: Yellow" is a risible and pretentious stea… │ 0 │
+│ train │ If only to avoid making this type of film in the future.… │ 0 │
+│ train │ This film was probably inspired by Godard's Masculin, fé… │ 0 │
+│ train │ Oh, brother...after hearing about this ridiculous film f… │ 0 │
+│ train │ I would put this at the top of my list of films in the c… │ 0 │
+│ train │ Whoever wrote the screenplay for this movie obviously ne… │ 0 │
+│ train │ When I first saw a glimpse of this movie, I quickly noti… │ 0 │
+│ train │ Who are these "They"- the actors? the filmmakers? Certai… │ 0 │
+│ train │ This is said to be a personal film for Peter Bogdonavitc… │ 0 │
+├─────────┴────────────────────────────────────────────────────────────┴───────┤
+│ 10 rows 3 columns │
+└──────────────────────────────────────────────────────────────────────────────┘
+```
+
+And you can simply filter by split like this:
+
+
+```sql
+SELECT * FROM faceberg.stanfordnlp.imdb WHERE split = 'test' LIMIT 10;
+```
+
+```
+┌─────────┬────────────────────────────────────────────────────────────┬───────┐
+│ split │ text │ label │
+│ varchar │ varchar │ int64 │
+├─────────┼────────────────────────────────────────────────────────────┼───────┤
+│ test │ I love sci-fi and am willing to put up with a lot. Sci-f… │ 0 │
+│ test │ Worth the entertainment value of a rental, especially if… │ 0 │
+│ test │ its a totally average film with a few semi-alright actio… │ 0 │
+│ test │ STAR RATING: ***** Saturday Night **** Friday Night *** … │ 0 │
+│ test │ First off let me say, If you haven't enjoyed a Van Damme… │ 0 │
+│ test │ I had high hopes for this one until they changed the nam… │ 0 │
+│ test │ Isaac Florentine has made some of the best western Marti… │ 0 │
+│ test │ It actually pains me to say it, but this movie was horri… │ 0 │
+│ test │ Technically I'am a Van Damme Fan, or I was. this movie i… │ 0 │
+│ test │ Honestly awful film, bad editing, awful lighting, dire d… │ 0 │
+├─────────┴────────────────────────────────────────────────────────────┴───────┤
+│ 10 rows 3 columns │
+└──────────────────────────────────────────────────────────────────────────────┘
+```
+
+### Deploy a catalog on HuggingFace Hub
+
+To deploy an Iceberg Datasets catalog, run `pip install faceberg` and run this command using your own Hugging Face username instead of "user":
+
+```bash
+faceberg user/mycatalog init
+```
+
+### Add datasets
+
+Once your catalog is ready, add datasets using the following command:
+
+```bash
+faceberg user/mycatalog add stanfordnlp/imdb
+faceberg user/mycatalog add openai/gsm8k --config main
+```
+
+### Query with interactive DuckDB shell
+
+`faceberg` comes with an builtin DuckDB shell you can run like this:
+
+```bash
+faceberg user/mycatalog quack
+```
+
+```sql
+SELECT label, substr(text, 1, 100) as preview
+FROM faceberg.stanfordnlp.imdb
+LIMIT 10;
+```
+
+Alternatively DuckDB shell is also available in the catalog's web interface:
+
+
+

+
+
+### More information
+
+Find more information on `faceberg` and the PyIceberg integration with the Hugging Face Hub in the [documentation](./datasets-pyiceberg).
+
+
+## Auto-converted Parquet files
+
+You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet:
+
+
+
+

+
+
+To reference the `refs/convert/parquet` revision of a dataset, use the following syntax:
+
+```plaintext
+hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file}
+```
+
+Here is a sample URL following the above syntax for the [fka/prompts.chat](https://huggingface.co/datasets/fka/prompts.chat) dataset to a file in its [Parquet branch](https://huggingface.co/datasets/fka/prompts.chat/tree/refs%2Fconvert%2Fparquet):
+
+```plaintext
+hf://datasets/fka/prompts.chat@~parquet/default/train/0000.parquet
+```
+
+In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets.
+
+## Use-cases and examples
+
+Find more use-cases and examples with Hugging Face Datasets here:
+
+* [Query datasets](./datasets-duckdb-select.md)
+* [Perform SQL operations](./datasets-duckdb-sql)
+* [Combine datasets and export](./datasets-duckdb-combine-and-export.md)
+* [Perform vector similarity search](./datasets-duckdb-vector-similarity-search.md)
diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md
index 9ecdc728f..13786047e 100644
--- a/docs/hub/datasets-libraries.md
+++ b/docs/hub/datasets-libraries.md
@@ -24,6 +24,7 @@ The table below summarizes the supported libraries and their level of integratio
| [Pandas](./datasets-pandas) | Python data analysis toolkit. | ✅ | ❌ | ✅ | ❌ | ✅* |
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. | ✅ | ✅ | ✅ | ❌ | ❌ |
| [PyArrow](./datasets-pyarrow) | Apache Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics. | ✅ | ✅ | ✅ | ❌ | ✅* |
+| [PyIceberg](./datasets-pyiceberg) | Apache Iceberg is a high performance open-source format for large analytic tables. | ✅ | ✅ | ❌ | ❌ | ❌ |
| [Spark](./datasets-spark) | Real-time, large-scale data processing tool in a distributed environment. | ✅ | ✅ | ✅ | ✅ | ✅ |
| [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets. | ✅ | ✅ | ❌ | ❌ | ❌ |
diff --git a/docs/hub/datasets-pyiceberg.md b/docs/hub/datasets-pyiceberg.md
new file mode 100644
index 000000000..c66a5baf5
--- /dev/null
+++ b/docs/hub/datasets-pyiceberg.md
@@ -0,0 +1,201 @@
+# PyIceberg
+
+PyIceberg is a Python implementation for accessing Iceberg tables.
+
+You can use the PyIceberg library [faceberg](https://github.com/kszucs/faceberg) to deploy an Iceberg Datasets catalog and add datasets from the Hugging Face Hub.
+
+Once your catalog is ready, use your favorite Iceberg client to query datasets in your catalog.
+For example: run SQL queries to explore datasets, do analytics, mix datasets together, or run large processing jobs.
+
+## Set up
+
+### Installation
+
+To be able to add Hugging Face Datasets to a catalog, you need to install the `faceberg` library:
+
+```
+pip install faceberg
+```
+
+This will also install required dependencies like `huggingface_hub` for authentication, and `datasets` for metadata discovery.
+
+### Authentication
+
+You need to authenticate to Hugging Face to read private/gated dataset repositories or to write to your dataset repositories.
+
+You can use the CLI for example:
+
+```
+hf auth login
+```
+
+It's also possible to provide your Hugging Face token with the `HF_TOKEN` environment variable or passing the `token` option to the reader.
+For more details about authentication, check out [this guide](https://huggingface.co/docs/huggingface_hub/quick-start#authentication).
+
+## Deploy a Datasets Catalog
+
+Use `faceberg /catalog-name> init` to deploy an Iceberg Datasets Catalog under your account on Hugging Face Spaces (free !)
+
+```bash
+faceberg username/my-catalog init
+```
+
+This command will show you the created catalog information and some helpful commands:
+
+```
+🤗🧊 Catalog: hf://spaces/username/my-catalog
+
+Initializing remote catalog: hf://spaces/username/my-catalog
+✓ Catalog initialized successfully!
+
+Space URL: https://username-my-catalog.hf.space
+Repository: https://huggingface.co/spaces/username/my-catalog
+
+Next steps:
+ • Run faceberg add to add tables
+ • Run faceberg sync to sync tables from datasets
+ • Use faceberg scan to view sample data
+ • Run faceberg serve to start the REST catalog server
+ • Run faceberg quack to open DuckDB with the catalog
+```
+
+In particular, note the Space URL that ends with `.hf.space`.
+It is your catalog uri for iceberg clients, and also the web interface you can open in your browser:
+
+
+

+
+
+Alternatively, deploy a catalog locally using using `"/path/to/catalog"` or `"file:///path/to/catalog"` instead of a Space repository name.
+
+## Add a dataset
+
+The `faceberg` command line makes it easy to add datasets from Hugging Face to an Iceberg catalog.
+
+This is compatible with all the dataset in [supported format](https://huggingface.co/docs/hub/datasets-adding#file-formats) on Hugging Face.
+Under the hood, the catalog points to the dataset's Parquet files.
+
+For example here is how to load the [stanfordnlp/imdb](https://huggingface.co/stanfordnlp/imdb) dataset and the `faceberg add` command:
+
+```bash
+faceberg username/my-catalog add stanfordnlp/imdb
+```
+
+which shows:
+
+```
+🤗🧊 Catalog: hf://spaces/username/my-catalog
+
+Adding dataset: stanfordnlp/imdb
+Table identifier: stanfordnlp.imdb
+
+stanfordnlp.imdb ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • Complete
+
+✓ Added stanfordnlp.imdb to catalog
+ Dataset: stanfordnlp/imdb
+ Location: hf://spaces/username/my-catalog/stanfordnlp/imdb/metadata/v1.metadata.json
+
+Table schema:
+imdb(
+ 1: split: optional string,
+ 2: text: optional string,
+ 3: label: optional long
+),
+partition by: [split],
+sort order: [],
+snapshot: Operation.APPEND: id=1, schema_id=0
+```
+
+
+

+
+
+> [!TIP]
+> On Hugging Face, datasets that are not in Parquet format are automatically converted to Parquet in a separate git branch `refs/convert/parquet`.
+> Therefore it is possible to add to an Iceberg catalog a dataset that is not originally in Parquet.
+
+Here is another example with the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset.
+It is a gated repository, users have to accept the terms of use before accessing it.
+It also has multiple subsets, namely, "3M" and "7M". So we need to specify which one to load.
+
+```bash
+faceberg username/my-catalog add BAAI/Infinity-Instruct --config 7M
+```
+
+
+

+
+
+## Load a dataset table
+
+### Using `faceberg`
+
+Use `faceberg` to get the PyIceberg catalog in python, and `.load_table()` to load the dataset table (more precisely the config or subset named "7M" containing 7M samples). Here is how to compute the number of dialogue per language and filter the dataset.
+
+After logging-in to access the gated dataset, you can run:
+
+```python
+>>> import faceberg
+>>> catalog = faceberg.catalog("username/my-catalog")
+>>> table = catalog.load_table("BAAI.Infinity-Instruct")
+>>> table.scan(limit=5).to_pandas()
+Out[9]:
+ split id conversations label langdetect source reward
+0 train 0 [{'from': 'human', 'value': 'def extinguish_fi... {'ability_en': ['programming ability'], 'abili... en code_exercises 3.718750
+1 train 1 [{'from': 'human', 'value': 'See the multi-cho... {'ability_en': ['logical reasoning'], 'ability... en flan -3.359375
+2 train 2 [{'from': 'human', 'value': 'This is some data... {'ability_en': ['geographic knowledge', 'text ... en flan -1.171875
+3 train 3 [{'from': 'human', 'value': 'If you don't want... {'ability_en': ['logical reasoning'], 'ability... en flan -12.187500
+4 train 4 [{'from': 'human', 'value': 'In a United State... {'ability_en': ['text understanding', 'informa... en flan 12.687500
+```
+
+### Using `pyiceberg`
+
+Here is how to instantiate the catalog using native PyIceberg.
+The catalog is a REST catalog so we use `pyiceberg.catalog.rest.RestCatalog`.
+
+The uri of the catalog is the Space HTTP URL that ends with `.hf.space`, and the warehouse property should point to the Space repository on Hugging face, which contains the metadata files of the iceberg tables:
+
+```python
+>>> from pyiceberg.catalog.rest import RestCatalog
+>>> properties = {
+... "uri": "https://username-my-catalog.hf.space",
+... "warehouse": "hf://spaces/username/my-catalog",
+... }
+>>> catalog = RestCatalog("username/my-catalog", **properties)
+>>> table = catalog.load_table("BAAI.Infinity-Instruct")
+```
+
+## Run SQL queries
+
+Once you have your PyIceberg table ready, you can run SQL queries from the catalog Space:
+
+
+

+
+
+## More information on `faceberg`
+
+Find more information about the `faceberg` library in its official documentation:
+
+* [Getting Started](https://faceberg.kszucs.dev/), and in particular:
+ - [Query data with the CLI](https://faceberg.kszucs.dev/#query-data)
+ - [Query data with DuckDB](https://faceberg.kszucs.dev/#interactive-queries-with-duckdb)
+ - [Query data with PyIceberg](https://faceberg.kszucs.dev/#query-using-pyiceberg)
+ - [The Catalog API](https://faceberg.kszucs.dev/#catalog-api)
+ - [How it works](https://faceberg.kszucs.dev/#how-it-works)
+* [Local Catalogs](https://faceberg.kszucs.dev/local.html)
+* [Architecture](https://faceberg.kszucs.dev/design.html)
+* [DuckDB integration](https://faceberg.kszucs.dev/duckdb.html)
+* [Pandas integration](https://faceberg.kszucs.dev/pandas.html)
+
+## Use other Iceberg clients
+
+Access datasets via your Iceberg Datasets Catalog via other clients:
+
+* **DuckDB** to run SQL, see the [DuckDB integration with faceberg](https://faceberg.kszucs.dev/duckdb.html)
+* **Pandas** for easy dataframe processing, see the [Pandas integration with faceberg](https://faceberg.kszucs.dev/pandas.html)
+
+More generally, any client that supports REST catalogs and `hf://` URIs can now use access your Iceberg Datasets Catalog.
+In addition to native support in DuckDB, `hf://` URIs are also supported in any `fsspec`-based client in Python and in any `object_store_opendal`-based client in Rust.
+
+Find more examples and advanced usage in the main [faceberg documentation](https://faceberg.kszucs.dev/).