Streaming Data into DuckDB with Arrow and Python Generators

Author

Rusty Conover

Published

August 30, 2025

When working with large datasets, I strongly believe it’s almost always better to stream the data rather than loading everything into memory at once.

If you’ve used DuckDB with Python, you probably know that DuckDB can query Pandas DataFrames directly. What I wasn’t sure about, however, was whether DuckDB could consume data produced by a Python generator function.

It turns out that it can—and in this post, I’ll show you how.

import duckdb
import pyarrow as pa

# Example schema
schema = pa.schema([
  pa.field("color", pa.string()),
  pa.field("count", pa.int64())
])

# An example record batch of 1024 rows
fixed_batch = pa.RecordBatch.from_pylist(
    [{"color": "green" * 10, "count": 1}] * 1024,
    schema=schema,
)

# Here is a generator function that will produce 50,000,000 batches
# which is far more than what will fit into memory on my MacBook Air.
1def batch_generator():
    for i in range(50000000):
        print("Getting batch for iteration:", i)
        yield fixed_batch
    print("Finished generation")

# Create a RecordBatchReader from generator
2reader = pa.RecordBatchReader.from_batches(schema, batch_generator())

# Register the reader as a DuckDB table
con = duckdb.connect()
3con.register("example", reader)
print(con.execute("COPY (select * from example) to 'foo.parquet'").fetchall())

1: Streaming Generator – The batch_generator() function doesn’t load the entire dataset into memory. Instead, it yields batches on demand. In this example, each batch contains 1,024 rows, but imagine we have 50 million batches—all handled efficiently without exhausting memory.
2: Arrow RecordBatchReader – By wrapping our generator in a RecordBatchReader, we make it fully compatible with DuckDB. This allows DuckDB to consume the data in a streaming, memory-efficient way.
3: Register as a Table – DuckDB treats the reader like any other SQL table. You can query it, join it with other sources, or stream it directly to Parquet using COPY.

Warning

If you were to scan the data more than once, this pattern isn’t going to work well because the reader is not able to be reset to the beginning of the generator. Additionally if DuckDB tries to scan the generator in parallel that will not work either.

Why This Matters

Memory-friendly: Never load huge datasets all at once, just stream them.
SQL on live data: Run full SQL queries on streams.
Direct export: Write to Parquet while applying partitioning is very easy.
Composable: Batches can come from anywhere — an API, a message queue, simulations, or another Arrow-based system.

This pattern turns DuckDB into capable streaming query engine, not just a static file processor. At Query.Farm, this is how we handle massive, continuously generated datasets efficiently, elegantly, and reliably.

Streaming data into DuckDB using Arrow is one of those little tricks that opens up big possibilities. Once you start thinking in batches instead of monolithic tables, you can build pipelines that were previously impossible with pure in-memory workflows.