chdb DataStore — It's Just Faster Pandas The Key Insight

Change this:

import pandas as pd

To this:

import chdb . datastore as pd

Everything else stays the same.

DataStore is a lazy, ClickHouse-backed pandas replacement . Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g., print() , len() , iteration). pip install chdb Decision Tree: Pick the Right Approach 1. "I have a file/database and want to analyze it with pandas" → DataStore.from_file() / from_mysql() / from_s3() etc. → See references/connectors.md 2. "I need to join data from different sources" → Create DataStores from each source, use .join() → See examples/examples.md #3-5 3. "My pandas code is too slow" → import chdb.datastore as pd — change one line, keep the rest 4. "I need raw SQL queries" → Use the chdb-sql skill instead Connect to Any Data Source — One Pattern from datastore import DataStore

Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)

ds

DataStore . from_file ( "sales.parquet" )

Database

ds

DataStore . from_mysql ( host = "db:3306" , database = "shop" , table = "orders" , user = "root" , password = "pass" )

Cloud storage

ds

DataStore . from_s3 ( "s3://bucket/data.parquet" , nosign = True )

URI shorthand — auto-detects source type

ds

DataStore . uri ( "mysql://root:pass@db:3306/shop/orders" ) All 16+ sources and URI schemes → connectors.md After Connecting — Full Pandas API result = ds [ ds [ "age" ]

25 ]

filter

result

ds [ [ "name" , "city" ] ]

select columns

result

ds . sort_values ( "revenue" , ascending = False )

sort

result

ds . groupby ( "dept" ) [ "salary" ] . mean ( )

groupby

result

ds . assign ( margin = lambda x : x [ "profit" ] / x [ "revenue" ] )

computed column

ds [ "name" ] . str . upper ( )

string accessor

ds [ "date" ] . dt . year

datetime accessor

result

ds1 . join ( ds2 , on = "id" )

join

result

ds . head ( 10 )

preview

print ( ds . to_sql ( ) )

see generated SQL

209 DataFrame methods supported. Full API → api-reference.md Cross-Source Join — The Killer Feature from datastore import DataStore customers = DataStore . from_mysql ( host = "db:3306" , database = "crm" , table = "customers" , user = "root" , password = "pass" ) orders = DataStore . from_file ( "orders.parquet" ) result = ( orders . join ( customers , left_on = "customer_id" , right_on = "id" ) . groupby ( "country" ) . agg ( { "amount" : "sum" , "rating" : "mean" } ) . sort_values ( "sum" , ascending = False ) ) print ( result ) More join examples → examples.md Writing Data source = DataStore . from_mysql ( host = "db:3306" , database = "shop" , table = "orders" , user = "root" , password = "pass" ) target = DataStore ( "file" , path = "summary.parquet" , format = "Parquet" ) target . insert_into ( "category" , "total" , "count" ) . select_from ( source . groupby ( "category" ) . select ( "category" , "sum(amount) AS total" , "count() AS count" ) ) . execute ( ) Troubleshooting Problem Fix ImportError: No module named 'chdb' pip install chdb ImportError: cannot import 'DataStore' Use from datastore import DataStore or from chdb.datastore import DataStore Database connection timeout Include port in host: host="db:3306" not host="db" Join returns empty result Check key types match (both int or both string); use .to_sql() to inspect Unexpected results Call ds.to_sql() to see the generated SQL and debug Environment check Run python scripts/verify_install.py (from skill directory) References API Reference — Full DataStore method signatures Connectors — All 16+ data source connection methods Examples — 10+ runnable examples with expected output Verify Install — Environment verification script Official Docs Note: This skill teaches how to use chdb DataStore. For raw SQL queries, use the chdb-sql skill. For contributing to chdb source code, see CLAUDE.md in the project root.

chdb-datastore

安装

Change this:

To this:

Everything else stays the same.

Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)

ds

Database

ds

Cloud storage

ds

URI shorthand — auto-detects source type

ds

filter

result

select columns

result

sort

result

groupby

result

computed column

string accessor

datetime accessor

result

join

result

preview

see generated SQL