Getting Started with VGI

Two paths: connect SQL to a hosted worker, or build your own worker and expose it as SQL.

Path 1

Attach a hosted VGI worker

VGI lets Haybarn turn a remote service into a SQL catalog. In this example, a VGI worker at https://vgi-volcanos.fly.dev/ wraps volcano-related datasets, including Smithsonian Global Volcanism Program data. You do not download the dataset, write ETL, or install a custom local extension. Haybarn attaches to the worker, discovers the schemas and tables it exposes, and lets you query them with ordinary SQL. If you want to expose your own API, dataset, or service this way, see Make your own worker.

The key idea: the data access logic lives behind the VGI URL; Haybarn provides the SQL interface.

Run Haybarn

Haybarn is Query.Farm's DuckDB-compatible distribution. The commands below run the latest Haybarn release without pinning a version.

terminal
# Fastest path: run the latest Haybarn with npm / npx.
npx haybarn@latest

# Or run Haybarn with uvx / pipx.
uvx haybarn-cli
pipx run haybarn-cli

Need platform-specific binaries, provenance verification, or more install channels? See the full Haybarn install guide.

Install and load VGI

Once you are inside the Haybarn shell, install and load the vgi extension.

Haybarn shell
-- Install VGI from the community extension repository.
INSTALL vgi FROM community;

-- If you need to refresh an existing install, use:
-- FORCE INSTALL vgi FROM community;

-- Load VGI into this Haybarn session.
LOAD vgi;

-- Verify it is ready. installed and loaded should both be true.
SELECT extension_name, installed, loaded
FROM duckdb_extensions()
WHERE extension_name = 'vgi';

After the verification query, the result should show both installed and loaded as true:

expected result
┌────────────────┬───────────┬─────────┐
│ extension_name │ installed │ loaded  │
│    varchar     │  boolean  │ boolean │
├────────────────┼───────────┼─────────┤
│ vgi            │ true      │ true    │
└────────────────┴───────────┴─────────┘

Attach the hosted worker

With VGI loaded, attach the remote worker. LOCATION is the worker endpoint.

The alias after AS becomes the catalog name. Here, AS volcanos means later queries use names like volcanos.smithsonian.pleistocene_volcanoes.

Haybarn shell
-- Attach a hosted VGI worker.
ATTACH 'volcanos' AS volcanos (
  TYPE vgi,
  LOCATION 'https://vgi-volcanos.fly.dev/'
);

After attaching, Haybarn may open an auth screen. Log in there to continue; the check is there to limit automated traffic to the demo worker. The worker only receives the VGI requests needed to answer your SQL queries.

Query it with SQL

Once the worker is attached, volcanos behaves like an attached database. You can inspect its schemas and tables through information_schema, then query worker-backed tables with normal SQL.

The discovery queries show schemas such as earthquakes, hans, smithsonian, and vsc. From there, tables like volcanos.smithsonian.pleistocene_volcanoes are available directly in SQL.

Haybarn shell
-- See the schemas exposed by the worker.
SELECT catalog_name, schema_name
FROM information_schema.schemata
WHERE catalog_name = 'volcanos';

-- See the tables and views exposed by the worker.
SELECT table_schema, table_name, table_type
FROM information_schema.tables
WHERE table_catalog = 'volcanos'
ORDER BY table_schema, table_name;

-- Query the hosted worker like any other SQL catalog.
SELECT *
FROM volcanos.smithsonian.pleistocene_volcanoes
LIMIT 10;

Use it from a Python script

This script example assumes the Haybarn Python package is available from the public Python package index. Until that package is public, use the Haybarn shell example above.

Install Haybarn in the Python environment that will run your script.

terminal
# This will work once the Haybarn Python package is public.
pip install haybarn

Then use the same attach and query flow from application code through the Haybarn Python API.

volcanos.py
import haybarn

con = haybarn.connect()

con.sql("INSTALL vgi FROM community")
con.sql("LOAD vgi")

con.sql("""
  ATTACH 'volcanos' AS volcanos (
    TYPE vgi,
    LOCATION 'https://vgi-volcanos.fly.dev/'
  )
  """)

rows = con.sql("""
SELECT *
FROM volcanos.smithsonian.pleistocene_volcanoes
LIMIT 10
""").fetchall()

for row in rows:
    print(row)

Path 2

Make your own worker

Start with a dataset, API, model, or internal service you want to make queryable. In this example, we will expose the NYC Department of Finance Summary of Neighborhood Sales by Neighborhood dataset as a hosted VGI worker.

You will need an AI coding agent such as Claude Code or Codex, plus a Fly.io account. The agent reads the VGI Python examples, builds the worker, deploys it to Fly.io, and gives you a URL that Haybarn can attach as a SQL catalog.

Fly.io is only needed to host the worker as a public URL. You most likely will not be charged for this demo because the Fly Machines can scale to zero when idle. If you do not want to sign up for Fly.io, change the prompt and ask the agent to make the worker run locally with uv run instead.

Clone the VGI Python repo

For now, clone the repository and give your agent the local path. In the future, VGI Python will be installable as a package, so this setup step will be simpler.

terminal
git clone https://github.com/Query-farm/vgi-python.git
cd vgi-python

Ask an agent to build and deploy the worker

Use a prompt like this. Replace the repository path with the actual path on your machine if needed.

agent prompt
Look at this vgi-python repository and make me a new VGI worker that exposes
the NYC Department of Finance "Summary of Neighborhood Sales by Neighborhood"
dataset to DuckDB consumers:

https://data.cityofnewyork.us/City-Government/DOF-Summary-of-Neighborhood-Sales-by-Neighborhood-/5ebm-myj7/about_data

Follow the examples and test fixtures in vgi-python.

Requirements:
- Expose the data as a SQL table named nyc_dof_sales.main.neighborhood_sales.
- Add useful column names and types for neighborhood housing analysis.
- Include comments in the SQL/catalog metadata so a future agent understands the dataset.
- Add tests that prove Haybarn/DuckDB consumers can query the worker.
- Deploy the worker to Fly.io and give me the hosted VGI URL and ATTACH statement.
- Configure Fly Machines to scale to zero when idle.

If you want to skip Fly.io, add this follow-up instruction:

local prompt
Change this worker so I can run it locally with uv run instead of deploying it.
Give me the Haybarn ATTACH statement for the local process.

Attach the deployed worker

When the agent finishes, the generated project should give you the hosted Fly.io URL and the exact ATTACH statement. Open Haybarn, install and load VGI if you have not already, then run that attach statement.

Haybarn shell
-- Replace the LOCATION with the Fly.io URL your agent gives you.
ATTACH 'nyc_dof_sales' AS nyc_dof_sales (
  TYPE vgi,
  LOCATION 'https://your-nyc-dof-sales-worker.fly.dev/'
);

-- Confirm the worker exposes the expected table.
SELECT table_schema, table_name, table_type
FROM information_schema.tables
WHERE table_catalog = 'nyc_dof_sales'
ORDER BY table_schema, table_name;

Ask a useful question

Now the DOF dataset behaves like a SQL table. This query looks for neighborhoods that are still under $1M median sale price but have grown quickly over five years.

result
┌──────────────┬───────────────┬───────────────────────────┬───────────────────┬───────────────┐
│ borough_name │ neighborhood  │       type_of_home        │ median_sale_price │ growth_5y_pct │
│   varchar    │    varchar    │          varchar          │       int64       │    double     │
├──────────────┼───────────────┼───────────────────────────┼───────────────────┼───────────────┤
│ Queens       │ Kew Gardens   │ 03 THREE FAMILY DWELLINGS │            915000 │         357.5 │
│ Queens       │ Rockaway Park │ 03 THREE FAMILY DWELLINGS │            929000 │         298.3 │
│ Queens       │ Holliswood    │ 02 TWO FAMILY DWELLINGS   │            879400 │         265.0 │
└──────────────┴───────────────┴───────────────────────────┴───────────────────┴───────────────┘

In plain English: among homes under $1M in this result, Kew Gardens is growing the fastest, followed by Rockaway Park and Holliswood.

Show the SQL
Haybarn shell
-- Find neighborhoods that are still relatively cheap but growing fast.
-- In other words: candidates for the next "Williamsburg" style growth story.
WITH recent_growth AS (
  SELECT
    borough_name,
    neighborhood,
    type_of_home,
    year,
    median_sale_price,
    LAG(median_sale_price, 5) OVER (
      PARTITION BY borough_name, neighborhood, type_of_home
      ORDER BY year
    ) AS price_5y_ago
  FROM nyc_dof_sales.main.neighborhood_sales
)

SELECT
  borough_name,
  neighborhood,
  type_of_home,
  median_sale_price,
  ROUND(
    100.0 * (median_sale_price - price_5y_ago)
    / NULLIF(price_5y_ago, 0),
    1
  ) AS growth_5y_pct
FROM recent_growth
WHERE price_5y_ago IS NOT NULL
  AND median_sale_price < 1000000
ORDER BY growth_5y_pct DESC
LIMIT 3;

Join it with another VGI worker

The important part is that workers compose. You can attach another worker in the same session, whether you wrote it or someone else published it, and query across both catalogs with SQL.

Haybarn shell
-- Attach another hosted VGI worker in the same Haybarn session.
ATTACH 'volcanos' AS volcanos (
  TYPE vgi,
  LOCATION 'https://vgi-volcanos.fly.dev/'
);

This example joins NYC housing growth with global volcanic eruption counts by year. It is intentionally exploratory: the value is that both external services are now queryable together without building a bespoke data platform first.

result
┌───────┬───────────────┬──────────────────────────┬───────────────────────────┬───────────────────┬────────────────┬───────────────┐
│ year  │ borough_name  │       neighborhood       │       type_of_home        │ median_sale_price │ yoy_growth_pct │ volcano_count │
│ int64 │    varchar    │         varchar          │          varchar          │       int64       │     double     │     int64     │
├───────┼───────────────┼──────────────────────────┼───────────────────────────┼───────────────────┼────────────────┼───────────────┤
│  2023 │ Manhattan     │ Kips Bay                 │ 01 ONE FAMILY DWELLINGS   │           6850000 │          930.1 │            14 │
│  2023 │ Manhattan     │ Tribeca                  │ 01 ONE FAMILY DWELLINGS   │          23500000 │          180.7 │            14 │
│  2023 │ Staten Island │ Rossville-Port Mobil     │ 01 ONE FAMILY DWELLINGS   │           1999900 │          166.7 │            14 │
└───────┴───────────────┴──────────────────────────┴───────────────────────────┴───────────────────┴────────────────┴───────────────┘

In plain English: for this sample, 2023 had the highest volcano count in the joined range, and Kips Bay had the largest year-over-year housing jump among the returned rows.

Show the SQL
Haybarn shell
-- Black Swan Years vs. NYC housing:
-- Did NYC home prices behave differently in years with unusually large volcanic activity?
WITH eruption_years AS (
  SELECT
    last_eruption_year AS year,
    COUNT(*) AS volcano_count
  FROM volcanos.smithsonian.holocene_volcanoes
  WHERE last_eruption_year BETWEEN 2010 AND 2024
  GROUP BY 1
),

nyc_growth AS (
  SELECT
    borough_name,
    neighborhood,
    type_of_home,
    year,
    median_sale_price,
    LAG(median_sale_price) OVER (
      PARTITION BY borough_name, neighborhood, type_of_home
      ORDER BY year
    ) AS prev_price
  FROM nyc_dof_sales.main.neighborhood_sales
  WHERE median_sale_price > 0
)

SELECT
  n.year,
  n.borough_name,
  n.neighborhood,
  n.type_of_home,
  n.median_sale_price,
  ROUND(100.0 * (n.median_sale_price - n.prev_price) / NULLIF(n.prev_price, 0), 1) AS yoy_growth_pct,
  COALESCE(e.volcano_count, 0) AS volcano_count
FROM nyc_growth n
LEFT JOIN eruption_years e USING (year)
WHERE n.prev_price IS NOT NULL
ORDER BY volcano_count DESC, yoy_growth_pct DESC
LIMIT 3;

What is unique here?

VGI is not unique because it can fetch a public dataset or deploy code to Fly.io. The distinctive part is how quickly those pieces become usable SQL: a worker can wrap an API, dataset, service, or model; Haybarn can discover its schemas and attach it as a catalog; and DuckDB users can query and join multiple remote workers without hand-building connector plumbing, schema registration, or a full data platform first.

Stuck or want a hand?

We help teams design and ship VGI-based extensions.

Talk to us