Streaming Data into DuckDB with Arrow and Python Generators
When working with large datasets, I strongly believe itโs almost always better to stream the data rather than loading everything into memory at once.
If youโve used DuckDB with Python, you probably know that DuckDB can query Pandas DataFrames directly. What I wasnโt sure about, however, was whether DuckDB could consume data produced by a Python generator function.
It turns out that it canโand in this post, Iโll show you how.
import duckdb
import pyarrow as pa
# Example schema
= pa.schema([
schema "color", pa.string()),
pa.field("count", pa.int64())
pa.field(
])
# An example record batch of 1024 rows
= pa.RecordBatch.from_pylist(
fixed_batch "color": "green" * 10, "count": 1}] * 1024,
[{=schema,
schema
)
# Here is a generator function that will produce 50,000,000 batches
# which is far more than what will fit into memory on my MacBook Air.
1def batch_generator():
for i in range(50000000):
print("Getting batch for iteration:", i)
yield fixed_batch
print("Finished generation")
# Create a RecordBatchReader from generator
2= pa.RecordBatchReader.from_batches(schema, batch_generator())
reader
# Register the reader as a DuckDB table
= duckdb.connect()
con 3"example", reader)
con.register(print(con.execute("COPY (select * from example) to 'foo.parquet'").fetchall())
- 1
-
Streaming Generator โ The
batch_generator()
function doesnโt load the entire dataset into memory. Instead, it yields batches on demand. In this example, each batch contains 1,024 rows, but imagine we have 50 million batchesโall handled efficiently without exhausting memory. - 2
-
Arrow RecordBatchReader โ By wrapping our generator in a
RecordBatchReader
, we make it fully compatible with DuckDB. This allows DuckDB to consume the data in a streaming, memory-efficient way. - 3
-
Register as a Table โ DuckDB treats the reader like any other SQL table. You can query it, join it with other sources, or stream it directly to Parquet using
COPY
.
If you were to scan the data more than once, this pattern isnโt going to work well because the reader is not able to be reset to the beginning of the generator. Additionally if DuckDB tries to scan the generator in parallel that will not work either.
Why This Matters
- Memory-friendly: Never load huge datasets all at once, just stream them.
- SQL on live data: Run full SQL queries on streams.
- Direct export: Write to Parquet while applying partitioning is very easy.
- Composable: Batches can come from anywhere โ an API, a message queue, simulations, or another Arrow-based system.
This pattern turns DuckDB into capable streaming query engine, not just a static file processor. At Query.Farm, this is how we handle massive, continuously generated datasets efficiently, elegantly, and reliably.
Streaming data into DuckDB using Arrow is one of those little tricks that opens up big possibilities. Once you start thinking in batches instead of monolithic tables, you can build pipelines that were previously impossible with pure in-memory workflows.