hugging-face-dataset-viewer

安装量: 74
排名: #10447

安装

npx skills add https://github.com/huggingface/skills --skill hugging-face-dataset-viewer

Hugging Face Dataset Viewer Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction. Core workflow Optionally validate dataset availability with /is-valid . Resolve config + split with /splits . Preview with /first-rows . Paginate content with /rows using offset and length (max 100). Use /search for text matching and /filter for row predicates. Retrieve parquet links via /parquet and totals/metadata via /size and /statistics . Defaults Base URL: https://datasets-server.huggingface.co Default API method: GET Query params should be URL-encoded. offset is 0-based. length max is usually 100 for row-like endpoints. Gated/private datasets require Authorization: Bearer . Dataset Viewer Validate dataset : /is-valid?dataset= List subsets and splits : /splits?dataset= Preview first rows : /first-rows?dataset=&config=&split= Paginate rows : /rows?dataset=&config=&split=&offset=&length= Search text : /search?dataset=&config=&split=&query=&offset=&length= Filter with predicates : /filter?dataset=&config=&split=&where=&orderby=&offset=&length= List parquet shards : /parquet?dataset= Get size totals : /size?dataset= Get column statistics : /statistics?dataset=&config=&split= Get Croissant metadata (if available) : /croissant?dataset= Pagination pattern: curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100" curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100" When pagination is partial, use response fields such as num_rows_total , num_rows_per_page , and partial to drive continuation logic. Search/filter notes: /search matches string columns (full-text style behavior is internal to the API). /filter requires predicate syntax in where and optional sort in orderby . Keep filtering and searches read-only and side-effect free. Querying Datasets Use npx parquetlens with Hub parquet alias paths for SQL querying. Parquet alias shape: hf://datasets//@~parquet///.parquet Derive , , and from Dataset Viewer /parquet : curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \ | jq -r '.parquet_files[] | "hf://datasets/(.dataset)@~parquet/(.config)/(.split)/(.filename)"' Run SQL query: npx -y -p parquetlens -p @parquetlens/sql parquetlens \ "hf://datasets//@~parquet///.parquet" \ --sql "SELECT * FROM data LIMIT 20" SQL export CSV: --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')" JSON: --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)" Parquet: --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)" Creating and Uploading Datasets Use one of these flows depending on dependency constraints. Zero local dependencies (Hub UI): Create dataset repo in browser: https://huggingface.co/new-dataset Upload parquet files in the repo "Files and versions" page. Verify shards appear in Dataset Viewer: curl -s "https://datasets-server.huggingface.co/parquet?dataset=/" Low dependency CLI flow ( npx @huggingface/hub / hfjs ): Set auth token: export HF_TOKEN = < your_hf_token

Upload parquet folder to a dataset repo (auto-creates repo if missing): npx -y @huggingface/hub upload datasets/ < namespace

/ < repo

./local/parquet-folder data Upload as private repo on creation: npx -y @huggingface/hub upload datasets/ < namespace

/ < repo

./local/parquet-folder data --private After upload, call /parquet to discover // values for querying with @~parquet .

返回排行榜