Hugging Face Dataset Viewer
Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
Core workflow
Optionally validate dataset availability with
/is-valid
.
Resolve
config
+
split
with
/splits
.
Preview with
/first-rows
.
Paginate content with
/rows
using
offset
and
length
(max 100).
Use
/search
for text matching and
/filter
for row predicates.
Retrieve parquet links via
/parquet
and totals/metadata via
/size
and
/statistics
.
Defaults
Base URL:
https://datasets-server.huggingface.co
Default API method:
GET
Query params should be URL-encoded.
offset
is 0-based.
length
max is usually
100
for row-like endpoints.
Gated/private datasets require
Authorization: Bearer
.
Dataset Viewer
Validate dataset
:
/is-valid?dataset=
List subsets and splits
:
/splits?dataset=
Preview first rows
:
/first-rows?dataset=&config=&split=
Paginate rows
:
/rows?dataset=&config=&split=&offset=&length=
Search text
:
/search?dataset=&config=&split=&query=&offset=&length=
Filter with predicates
:
/filter?dataset=&config=&split=&where=&orderby=&offset=&length=
List parquet shards
:
/parquet?dataset=
Get size totals
:
/size?dataset=
Get column statistics
:
/statistics?dataset=&config=&split=
Get Croissant metadata (if available)
:
/croissant?dataset=
Pagination pattern:
curl
"https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl
"https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"
When pagination is partial, use response fields such as
num_rows_total
,
num_rows_per_page
, and
partial
to drive continuation logic.
Search/filter notes:
/search
matches string columns (full-text style behavior is internal to the API).
/filter
requires predicate syntax in
where
and optional sort in
orderby
.
Keep filtering and searches read-only and side-effect free.
Querying Datasets
Use
npx parquetlens
with Hub parquet alias paths for SQL querying.
Parquet alias shape:
hf://datasets//@~parquet///.parquet
Derive
,
, and
from Dataset Viewer
/parquet
:
curl
-s
"https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats"
\
|
jq
-r
'.parquet_files[] | "hf://datasets/(.dataset)@~parquet/(.config)/(.split)/(.filename)"'
Run SQL query:
npx
-y
-p
parquetlens
-p
@parquetlens/sql parquetlens
\
"hf://datasets//@~parquet///.parquet"
\
--sql
"SELECT * FROM data LIMIT 20"
SQL export
CSV:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"
JSON:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"
Parquet:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"
Creating and Uploading Datasets
Use one of these flows depending on dependency constraints.
Zero local dependencies (Hub UI):
Create dataset repo in browser:
https://huggingface.co/new-dataset
Upload parquet files in the repo "Files and versions" page.
Verify shards appear in Dataset Viewer:
curl
-s
"https://datasets-server.huggingface.co/parquet?dataset=/"
Low dependency CLI flow (
npx @huggingface/hub
/
hfjs
):
Set auth token:
export
HF_TOKEN
=
<
your_hf_token
Upload parquet folder to a dataset repo (auto-creates repo if missing):
npx
-y
@huggingface/hub upload datasets/
<
namespace
/
<
repo
./local/parquet-folder data
Upload as private repo on creation:
npx
-y
@huggingface/hub upload datasets/
<
namespace
/
<
repo
./local/parquet-folder data
--private
After upload, call
/parquet
to discover
//
values for querying with
@~parquet
.