ShellFS DuckDB Extension
The ShellFS extension, developed by Query.Farm, enables DuckDB to read from and write to shell commands as if they were files. This powerful integration allows you to stream data into DuckDB from the output of any shell command (such as curl
, wget
, or ls
) and to send query results directly to command-line tools for further processing or storage.
DuckDB extensions are plugins that expand the core DuckDB engine with new capabilities.
Key Features
- Read input from the standard output of a shell command, treating it as a virtual file in DuckDB.
- Write output to the standard input of a shell command, enabling seamless integration with compression tools (
gzip
,xz
), cloud storage utilities (aws s3 cp
,gsutil
), or custom scripts. - Compose pipelines that combine SQL queries with shell utilities, making it easy to build efficient, ad hoc ETL workflows.
- Avoid intermediate files, reducing disk I/O and simplifying automation.
Use Cases
- Querying live data streams from APIs or command-line tools.
- Exporting query results directly to cloud storage or remote servers.
- Integrating DuckDB with existing UNIX-style data processing workflows.
Getting Started
ShellFS is a DuckDB community extension maintained and supported by Query.Farm.
Install ShellFS in DuckDB by running:
FROM community; INSTALL shellfs
Then load it with:
LOAD shellfs;
Functionality
The ShellFS extension for DuckDB enables the use of Unix pipes for input and output.
By appending a pipe character |
to a filename, DuckDB will treat it as a series of commands to execute and capture the output. Conversely, if you prefix a filename with |
, DuckDB will treat it as an output pipe.
While the examples provided are simple, in practical scenarios, you might use this feature to run another program that generates CSV, JSON, or other formats to manage complexity that DuckDB cannot handle directly.
Reading input from a pipe
Create a program to generate CSV in Python:
#!/usr/bin/env python3
print("counter1,counter2")
for i in range(10000000):
print(f"{i},{i}")
Run that program and determine the number of distinct values it produces:
select count(distinct counter1)
from read_csv('./test-csv.py |');
ββββββββββββββββββββββββββββcount(DISTINCT counter1) β
β
β int64 β
ββββββββββββββββββββββββββββ€10000000 β
β ββββββββββββββββββββββββββββ
When a command is not found or able to be executed, this is the result:
SELECT count(distinct column0) from read_csv('foo |');
not found
sh: foo: command
βββββββββββββββββββββββββββcount(DISTINCT column0) β
β
β int64 β
βββββββββββββββββββββββββββ€0 β
β βββββββββββββββββββββββββββ
The reason why there isnβt an exception raised in this cause is because the popen()
implementation starts a shell process, but that shell process
Writing output to a pipe
-- Write all numbers from 1 to 30 out, but then filter via grep
-- for only lines that contain 6.
COPY (select * from unnest(generate_series(1, 30)))
TO '| grep 6 > numbers.csv' (FORMAT 'CSV');
6
16
26
-- Copy the result set to the clipboard on Mac OS X using pbcopy
COPY (select 'hello' as type, from unnest(generate_series(1, 30)))
TO '| grep 3 | pbcopy' (FORMAT 'CSV');
type,"generate_series(1, 30)"
3
hello,13
hello,23
hello,30
hello,
-- Write an encrypted file out via openssl
COPY (select 'hello' as type, * from unnest(generate_series(1, 30)))
TO '| openssl enc -aes-256-cbc -salt -in - -out example.enc -pbkdf2 -iter 1000 -pass pass:testing12345' (FORMAT 'JSON');
Configuration
This extension introduces a new configuration option:
ignore_sigpipe
- a boolean option that, when set to true, ignores the SIGPIPE signal. This is useful when writing to a pipe that stops reading input. For example:
COPY (select 'hello' as type, * from unnest(generate_series(1, 300))) TO '| head -n 100';
In this scenario, DuckDB attempts to write 300 lines to the pipe, but the head
command only reads the first 100 lines. After head
reads the first 100 lines and exits, it closes the pipe. The next time DuckDB tries to write to the pipe, it receives a SIGPIPE signal. By default, this causes DuckDB to exit. However, if ignore_sigpipe
is set to true, the SIGPIPE signal is ignored, allowing DuckDB to continue without error even if the pipe is closed.
You can enable this option by setting it with the following command:
set ignore_sigpipe = true;
Caveats
When using read_text()
or read_blob()
the contents of the data read from a pipe is limited to 2GB in size. This is the maximum length of a single rowβs value.
When using read_csv()
or read_json()
the contents of the pipe can be unlimited as it is processed in a streaming fashion.
A demonstration of this would be:
#!/usr/bin/env python3
print("counter1,counter2")
for i in range(10000000):
print(f"{i},{i}")
select count(distinct counter1) from read_csv('./test-csv.py |');
ββββββββββββββββββββββββββββcount(DISTINCT counter1) β
β
β int64 β
ββββββββββββββββββββββββββββ€10000000 β
β ββββββββββββββββββββββββββββ
If a limit
clause is used you may see an error like this:
select * from read_csv('./test-csv.py |') limit 3;
ββββββββββββ¬βββββββββββ
β counter1 β counter2 β
β int64 β int64 β
ββββββββββββΌβββββββββββ€0 β 0 β
β 1 β 1 β
β 2 β 2 β
β
ββββββββββββ΄βββββββββββcall last):
Traceback (most recent File "/Development/shellfs/./test-csv.py", line 5, in <module>
"{i},{i}")
print(f32] Broken pipe
BrokenPipeError: [Errno Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
32] Broken pipe BrokenPipeError: [Errno
DuckDB continues to run, but the program that was producing output received a SIGPIPE signal because DuckDB closed the pipe after reading the necessary number of rows. It is up to the user of DuckDB to decide whether to suppress this behavior by setting the ignore_sigpipe
configuration parameter.
Love β€οΈ this DuckDB extension? Youβll Love This.
Get the best from Query.Farm β smart tips, powerful tools, and project updates sent directly to your inbox, but only when weβve got something great to share.