DuckDB Extension Best Practices: A Developer Guide

Learn how to build, optimize, and deploy DuckDB extensions with this comprehensive guide covering architecture, performance, and testing.

Query.Farm Team
February 1, 2024
duckdb extensions tutorial best-practices

DuckDB Extension Best Practices

Building high-quality DuckDB extensions requires understanding both the DuckDB architecture and extension API. This guide shares best practices we’ve learned while developing over 20 extensions at Query.Farm.

Extension Architecture

1. Keep Extensions Focused

Each extension should solve one problem well. Don’t create monolithic extensions that try to do everything.

Good: json-transform extension focused on JSON operations Bad: data-utils extension that does JSON, CSV, and XML processing

2. Minimize Dependencies

External dependencies increase complexity and potential conflicts. Use DuckDB’s built-in capabilities whenever possible.

// Prefer DuckDB's string utilities
auto result = StringUtil::Format("Value: %s", value);

// Avoid external libraries for common operations
// #include <boost/algorithm/string.hpp> // Not ideal

3. Leverage DuckDB’s Type System

DuckDB’s rich type system handles complex data structures efficiently. Use LogicalType appropriately:

// Register function with proper types
CreateScalarFunctionInfo::Create(
    "my_function",
    {LogicalType::VARCHAR, LogicalType::INTEGER},
    LogicalType::BOOLEAN,
    my_function_impl
);

Performance Optimization

1. Use Vectorized Execution

DuckDB processes data in vectors (chunks of ~2000 rows). Write functions that operate on entire vectors:

// Good: Vectorized processing
void VectorFunction(DataChunk &args, ExpressionState &state, Vector &result) {
    auto &input = args.data[0];
    auto input_data = FlatVector::GetData<string_t>(input);
    auto result_data = FlatVector::GetData<int64_t>(result);

    for (idx_t i = 0; i < args.size(); i++) {
        result_data[i] = ProcessString(input_data[i]);
    }
}

// Bad: Row-by-row processing

2. Avoid Memory Allocations

Minimize allocations in hot paths. Reuse buffers when possible:

// Reuse vector instead of creating new ones
Vector intermediate(LogicalType::VARCHAR, args.size());
// Process data...

3. Profile Your Code

Use DuckDB’s profiling tools to identify bottlenecks:

PRAGMA enable_profiling='json';
PRAGMA profiling_output='/tmp/profile.json';

-- Your query here
SELECT my_extension_function(column) FROM table;

Testing Strategies

1. Unit Tests

Test individual functions thoroughly:

TEST_CASE("Test JSON extraction") {
    DuckDB db(nullptr);
    Connection con(db);

    auto result = con.Query("SELECT json_extract('{\"key\": 42}', 'key')");
    REQUIRE(result->success);
    REQUIRE(result->GetValue(0, 0).GetValue<int64_t>() == 42);
}

2. Edge Cases

Always test edge cases:

  • NULL values
  • Empty strings/arrays
  • Large datasets (>100k rows)
  • Special characters in strings
  • Type mismatches

3. Performance Regression Tests

Track performance over time:

def test_performance():
    # Baseline: Process 1M rows
    start = time.time()
    con.execute("SELECT transform_data(x) FROM range(1000000) t(x)")
    duration = time.time() - start

    # Should complete in under 1 second
    assert duration < 1.0

Error Handling

Provide Clear Error Messages

Users should understand what went wrong and how to fix it:

// Good
throw InvalidInputException(
    "JSON path '%s' not found in object. Available keys: %s",
    path, available_keys
);

// Bad
throw Exception("Invalid input");

Validate Input Early

Catch errors before expensive operations:

if (args.data[0].GetType() != LogicalType::VARCHAR) {
    throw InvalidTypeException(
        "Expected VARCHAR, got %s",
        args.data[0].GetType().ToString()
    );
}

Documentation

1. Function Descriptions

Document each function clearly:

-- Extract value from JSON path
-- Parameters:
--   json: JSON string to parse
--   path: JSONPath expression (e.g., '$.user.name')
-- Returns: Extracted value as VARCHAR or NULL if path not found
-- Example: SELECT json_extract('{"user": {"name": "Alice"}}', '$.user.name')
--          Returns: 'Alice'

2. Performance Characteristics

Explain performance implications:

Time Complexity: O(n) where n is JSON string length
Memory: O(m) where m is extracted value size
Best for: Small to medium JSON documents (<10MB)

3. Compatibility Notes

Document version requirements and limitations:

Requires: DuckDB >= 0.9.0
Thread-safe: Yes
NULL handling: Returns NULL on NULL input

Distribution

1. Build for Multiple Platforms

Support major platforms:

  • Linux (x86_64, ARM64)
  • macOS (Intel, Apple Silicon)
  • Windows (x86_64)

2. Version Your Extensions

Use semantic versioning:

  • MAJOR: Breaking changes
  • MINOR: New features
  • PATCH: Bug fixes

3. Provide Installation Instructions

Make it easy to install:

-- Install from Query.Farm repository
INSTALL my_extension FROM 'query.farm';
LOAD my_extension;

-- Or from GitHub
INSTALL my_extension FROM 'github://queryfarm/my-extension';

Security Considerations

1. Input Validation

Never trust user input:

// Validate string lengths
if (input.GetStringLength() > MAX_SAFE_LENGTH) {
    throw InvalidInputException("Input too large");
}

// Sanitize paths
if (path.contains("..")) {
    throw InvalidInputException("Path traversal not allowed");
}

2. Resource Limits

Prevent resource exhaustion:

const size_t MAX_ITERATIONS = 1000000;
size_t count = 0;

while (hasMore && count++ < MAX_ITERATIONS) {
    // Process...
}

if (count >= MAX_ITERATIONS) {
    throw Exception("Iteration limit exceeded");
}

Conclusion

Building great DuckDB extensions requires attention to:

  • Clean architecture and focused functionality
  • Vectorized, high-performance implementations
  • Comprehensive testing and error handling
  • Clear documentation and easy installation

Want to learn more? Check out our extension catalog or schedule a consultation to discuss your custom extension needs.

Happy building! 🚜