DocsAdvancedPerformance Tuning

Troubleshooting

DocsAdvancedPerformance Tuning

Performance Tuning

Learn how to optimize DataChonk-generated models for faster queries, lower costs, and better scalability.

Query Optimization Principles

1. Filter Early

DataChonk generates models that push predicates down to source tables:

-- Generated staging model with early filtering
SELECT
  order_id,
  customer_id,
  order_date,
  amount
FROM {{ source('shopify', 'orders') }}
WHERE 
  _fivetran_deleted = false  -- Filter soft deletes early
  AND order_date >= '2023-01-01'  -- Date range filter
  AND status != 'cancelled'  -- Business logic filter

2. Select Only Needed Columns

Never use SELECT *. DataChonk explicitly lists columns:

Avoid

SELECT * FROM orders

DataChonk generates

SELECT
  order_id,
  customer_id,
  amount
FROM orders

3. Aggregate at the Right Level

Pre-aggregate in fact tables, not in BI tools:

-- fct_daily_revenue (pre-aggregated)
SELECT
  date_trunc('day', order_date) as revenue_date,
  customer_segment,
  region,
  COUNT(DISTINCT order_id) as order_count,
  SUM(amount) as total_revenue,
  AVG(amount) as avg_order_value
FROM {{ ref('fct_orders') }}
GROUP BY 1, 2, 3

Warehouse-Specific Optimizations

Snowflake

Technique	When to Use	Config
Clustering	Large tables with common filter columns	`cluster_by: ['col1', 'col2']`
Search Optimization	Point lookups on high-cardinality columns	`search_optimization: true`
Transient Tables	Intermediate models, dev/test	`transient: true`

{{
  config(
    materialized='table',
    cluster_by=['customer_id', 'order_date'],
    transient=false,
    post_hook="ALTER TABLE {{ this }} ADD SEARCH OPTIMIZATION ON EQUALITY(order_id)"
  )
}}

BigQuery

Technique	When to Use	Config
Partitioning	Time-series data, date filters	`partition_by: date field`
Clustering	Common filter/join columns	`cluster_by: ['col1', 'col2']`
Require Partition Filter	Prevent full table scans	`require_partition_filter: true`

{{
  config(
    materialized='table',
    partition_by={
      'field': 'event_date',
      'data_type': 'date',
      'granularity': 'day'
    },
    cluster_by=['user_id', 'event_type'],
    require_partition_filter=true
  )
}}

Redshift

Technique	When to Use	Config
Distribution Style	Optimize joins	`dist: 'key_column'`
Sort Keys	Range filters, ORDER BY	`sort: ['col1', 'col2']`
Late Binding Views	Schema flexibility	`bind: false`

{{
  config(
    materialized='table',
    dist='customer_id',
    sort=['order_date', 'order_id'],
    sort_type='compound'
  )
}}

Databricks

Technique	When to Use	Config
Delta Lake	All tables (default)	`file_format: 'delta'`
Liquid Clustering	Replace static partitioning	`liquid_clustered_by: ['col']`
Z-Ordering	Multi-column filters	`zorder: ['col1', 'col2']`

{{
  config(
    materialized='table',
    file_format='delta',
    liquid_clustered_by=['customer_id', 'order_date'],
    post_hook="OPTIMIZE {{ this }} ZORDER BY (product_category)"
  )
}}

Incremental Model Performance

Choosing the Right Strategy

Data Pattern	Recommended Strategy
Append-only events	append
Updates to existing rows	merge
Daily aggregations	delete+insert by date
Partitioned BigQuery	insert_overwrite

Lookback Windows

Handle late-arriving data with lookback:

{% if is_incremental() %}
WHERE event_timestamp >= (
  SELECT DATEADD('day', -3, MAX(event_timestamp)) 
  FROM {{ this }}
)
{% endif %}

Full Refresh Schedule

Even incremental models benefit from occasional full refreshes:

# dbt_project.yml
models:
  your_project:
    marts:
      +full_refresh: "{{ var('monthly_refresh', false) }}"
      
# Run monthly: dbt run --vars '{"monthly_refresh": true}'

Cost Optimization

View vs Table Trade-offs

Views: No storage cost, compute on query
Tables: Storage cost, faster queries
Rule of thumb: Table if queried > 3x daily

Warehouse Sizing

# datachonk.config.yml
environments:
  dev:
    snowflake:
      warehouse: COMPUTE_XS
  prod:
    snowflake:
      warehouse: COMPUTE_L
      
# Models can override
models:
  fct_events:
    snowflake_warehouse: COMPUTE_XL  # Large model needs bigger warehouse

Monitoring & Debugging

Query Profiling

Add profiling to understand performance:

# Check query history in Snowflake
SELECT 
  query_id,
  query_text,
  execution_time / 1000 as seconds,
  bytes_scanned / 1e9 as gb_scanned,
  rows_produced
FROM snowflake.account_usage.query_history
WHERE query_text ILIKE '%fct_orders%'
ORDER BY start_time DESC
LIMIT 10;

dbt Artifacts

Analyze model timing from dbt run:

# Generate timing report
dbt run --profiles-dir . 2>&1 | grep "completed"

# Or use dbt artifacts
cat target/run_results.json | jq '.results[] | 
  {model: .unique_id, seconds: .execution_time} | 
  select(.seconds > 60)'

DataChonk Performance Features

Smart Defaults

DataChonk automatically applies optimizations based on:

Detected data volume (row counts from scans)
Column cardinality analysis
Query patterns from your existing models
Warehouse-specific best practices

Performance Hints

Tell the Brain about performance requirements:

Prompt:

"Create fct_page_views optimized for queries that filter by user_id and date range. The table has 500M+ rows and grows by 10M daily."

Pro tip: Run datachonk analyze to get performance recommendations for your existing models based on actual query patterns and data volumes.

Custom Materializations