Stochastic DuckDB Extension

The stochastic developed by Query.Farm extension adds comprehensive statistical distribution functions to DuckDB, enabling advanced statistical analysis, probability calculations, and random sampling directly within SQL queries.

Installation

stochastic is a DuckDB Community Extension.

You can install and use it in DuckDB SQL:

INSTALL stochastic FROM community;
LOAD stochastic;

What are statistical distributions?

Statistical distributions are mathematical functions that describe the probability of different outcomes in a dataset. They are fundamental to statistics, data science, machine learning, and scientific computing. This extension provides functions to:

  • Calculate probability density and mass functions (PDF/PMF)
  • Compute cumulative distribution functions (CDF)
  • Generate quantiles (inverse CDF)
  • Sample random values from distributions
  • Access distribution properties (mean, variance, etc.)

Available Distributions

The extension supports a comprehensive set of probability distributions:

Continuous Distributions

Discrete Distributions

Continuous Distributions

Discrete Distributions

Each distribution provides the following function types:

Sampling Functions

  • dist_{distribution}_sample(params...) - Generate random samples

Density/Mass Functions

  • dist_{distribution}_pdf(params..., x) - Probability density function
  • dist_{distribution}_log_pdf(params..., x) - Log probability density function

Cumulative Functions

  • dist_{distribution}_cdf(params..., x) - Cumulative distribution function
  • dist_{distribution}_log_cdf(params..., x) - Log cumulative distribution function
  • dist_{distribution}_cdf_complement(params..., x) - Survival function (1 - CDF)
  • dist_{distribution}_log_cdf_complement(params..., x) - Log survival function

Quantile Functions

  • dist_{distribution}_quantile(params..., p) - Quantile function (inverse CDF)
  • dist_{distribution}_quantile_complement(params..., p) - Complementary quantile function

Hazard Functions

  • dist_{distribution}_hazard(params..., x) - Hazard function
  • dist_{distribution}_chf(params..., x) - Cumulative hazard function

Distribution Properties

  • dist_{distribution}_kurtosis_excess(params...) - Excess kurtosis
  • dist_{distribution}_kurtosis(params...) - Kurtosis
  • dist_{distribution}_mean(params...) - Expected value
  • dist_{distribution}_median(params...) - Median (50th percentile)
  • dist_{distribution}_mode(params...) - Mode (most likely value)
  • dist_{distribution}_range(params...) - Support range
  • dist_{distribution}_skewness(params...) - Skewness
  • dist_{distribution}_stddev(params...) - Standard deviation
  • dist_{distribution}_support(params...) - Distribution support
  • dist_{distribution}_variance(params...) - Variance

Usage Examples

Normal Distribution

-- Generate random samples from N(0, 1)
SELECT dist_normal_sample(0.0, 1.0) AS random_value;

-- Calculate PDF at x = 0.5 for N(0, 1)
SELECT dist_normal_pdf(0.0, 1.0, 0.5) AS density;

-- Calculate CDF (probability that X ≤ 1.96)
SELECT dist_normal_cdf(0.0, 1.0, 1.96) AS probability;

-- Find 95th percentile
SELECT dist_normal_quantile(0.0, 1.0, 0.95) AS percentile_95;

-- Get distribution properties
SELECT
    dist_normal_mean(0.0, 1.0) AS mean,
    dist_normal_variance(0.0, 1.0) AS variance,
    dist_normal_skewness(0.0, 1.0) AS skewness;

Binomial Distribution

-- Probability mass function for 10 trials, p=0.3
SELECT dist_binomial_pdf(10, 0.3, 7) AS prob_exactly_7;

-- Cumulative probability (≤ 5 successes)
SELECT dist_binomial_cdf(10, 0.3, 5) AS prob_at_most_5;

-- Generate random binomial samples
SELECT dist_binomial_sample(10, 0.3) AS random_successes;

Working with Data Tables

-- Generate synthetic dataset
CREATE TABLE synthetic_data AS
SELECT
    i,
    dist_normal_sample(100, 15) AS height_cm,
    dist_normal_sample(70, 10) AS weight_kg,
    dist_binomial_sample(1, 0.5) AS gender  -- 0 or 1
FROM range(1000) t(i);

-- Calculate z-scores
SELECT
    height_cm,
    (height_cm - dist_normal_mean(100, 15)) / dist_normal_stddev(100, 15) AS height_zscore
FROM synthetic_data;

-- Probability calculations
SELECT
    weight_kg,
    dist_normal_cdf(70, 10, weight_kg) AS percentile
FROM synthetic_data
LIMIT 10;

Real-World Applications

A/B Testing and Statistical Significance

Common Task: Determine if there’s a statistically significant difference between conversion rates.

Relevant Functions: dist_normal_cdf, dist_normal_cdf_complement, dist_normal_pdf

Financial Risk Assessment and VaR Calculation

Common Task: Calculate Value at Risk (VaR) for portfolio management.

Relevant Functions: dist_normal_sample, dist_normal_quantile, dist_normal_cdf

Quality Control and Process Monitoring

Common Task: Monitor manufacturing processes and detect out-of-control conditions.

Relevant Functions: dist_normal_sample, dist_normal_cdf, dist_normal_pdf

Predictive Analytics and Confidence Intervals

Common Task: Build prediction intervals for forecasting models.

Relevant Functions: dist_normal_quantile, dist_normal_cdf, dist_normal_sample

Customer Analytics and CLV Modeling

Common Task: Model customer lifetime value with uncertainty quantification.

Relevant Functions: dist_normal_sample, dist_exponential_sample, dist_normal_quantile, dist_normal_cdf

Anomaly Detection and Outlier Analysis

Common Task: Detect anomalies in time series data using statistical methods.

Relevant Functions: dist_normal_pdf, dist_normal_cdf, dist_normal_cdf_complement

Monte Carlo Simulations

Common Task: Run Monte Carlo simulations for risk analysis, optimization, or modeling.

Relevant Functions: dist_normal_sample, dist_uniform_real_sample, dist_gamma_sample, dist_beta_sample

Hypothesis Testing

Common Task: Perform statistical hypothesis tests (t-tests, chi-square tests, etc.).

Relevant Functions: dist_students_t_cdf, dist_chi_squared_cdf, dist_normal_cdf, dist_fisher_f_cdf

Bayesian Analysis

Common Task: Implement Bayesian statistical models and posterior analysis.

Relevant Functions: dist_beta_pdf, dist_gamma_pdf, dist_normal_pdf, dist_beta_sample

Survival Analysis

Common Task: Analyze time-to-event data in medical research or reliability engineering.

Relevant Functions: dist_exponential_pdf, dist_weibull_pdf, dist_gamma_pdf, dist_exponential_cdf

Why Use DuckDB + Stochastic vs Python/R?

Advantages

  • No Data Movement: Analysis happens where your data lives
  • SQL Familiarity: Use existing SQL skills instead of learning specialized libraries
  • Performance: Columnar processing with vectorized statistical operations
  • Integration: Works seamlessly with existing BI tools and SQL workflows
  • Real-time: Analyze streaming data without export/import cycles

📊 Performance Benefits

Statistical operations are vectorized and optimized for DuckDB’s columnar engine.

Parameter Validation

All distribution functions include comprehensive parameter validation:

-- This will throw an error: standard deviation must be > 0
SELECT dist_normal_pdf(0.0, -1.0, 0.5);
-- Error: normal: Standard deviation must be > 0 was: -1.000000

-- This will throw an error: probability must be between 0 and 1
SELECT dist_binomial_pdf(10, 1.5, 5);
-- Error: binomial: Probability must be between 0 and 1 was: 1.500000

Love ❤️ this DuckDB extension? You’ll Love This.

Get the best from Query.Farm — smart tips, powerful tools, and project updates sent directly to your inbox, but only when we’ve got something great to share.