DuckDB
DuckDB is a database that supports reading and querying Parquet files really fast. Begin by creating a connection to DuckDB, and then install and load the httpfs
extension to read and write remote files:
Python
JavaScript
import duckdb
url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
Now you can write and execute your SQL query on the Parquet file:
Python
JavaScript
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '{url}' GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
βββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ
β horoscope β count_star() β avg_blog_length β
β varchar β int64 β double β
βββββββββββββΌβββββββββββββββΌβββββββββββββββββββββ€
β Aquarius β 34062 β 1129.218836239798 β
β Cancer β 41509 β 1098.366812016671 β
β Capricorn β 33961 β 1073.2002002296751 β
β Libra β 40302 β 1072.0718326633914 β
β Leo β 40587 β 1064.0536871412028 β
βββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββ
To query multiple files - for example, if the dataset is sharded:
Python
JavaScript
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet({urls[:2]}) GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ
β horoscope β count_star() β avg_blog_length β
β varchar β int64 β double β
βββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββ€
β Aquarius β 49568 β 1125.8306770497095 β
β Cancer β 63512 β 1097.95608703867 β
β Libra β 60304 β 1060.6110539931017 β
β Capricorn β 49402 β 1059.5552609206104 β
β Sagittarius β 50431 β 1057.4589835616982 β
βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββ
DuckDB-Wasm, a package powered by WebAssembly, is also available for running DuckDB in any browser. This could be useful, for instance, if you want to create a web app to query Parquet files from the browser!