
Loading...
Learn how to optimize DataChonk-generated models for faster queries, lower costs, and better scalability.
DataChonk generates models that push predicates down to source tables:
-- Generated staging model with early filtering
SELECT
order_id,
customer_id,
order_date,
amount
FROM {{ source('shopify', 'orders') }}
WHERE
_fivetran_deleted = false -- Filter soft deletes early
AND order_date >= '2023-01-01' -- Date range filter
AND status != 'cancelled' -- Business logic filterNever use SELECT *. DataChonk explicitly lists columns:
Avoid
SELECT * FROM orders
DataChonk generates
SELECT order_id, customer_id, amount FROM orders
Pre-aggregate in fact tables, not in BI tools:
-- fct_daily_revenue (pre-aggregated)
SELECT
date_trunc('day', order_date) as revenue_date,
customer_segment,
region,
COUNT(DISTINCT order_id) as order_count,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value
FROM {{ ref('fct_orders') }}
GROUP BY 1, 2, 3| Technique | When to Use | Config |
|---|---|---|
| Clustering | Large tables with common filter columns | cluster_by: ['col1', 'col2'] |
| Search Optimization | Point lookups on high-cardinality columns | search_optimization: true |
| Transient Tables | Intermediate models, dev/test | transient: true |
{{
config(
materialized='table',
cluster_by=['customer_id', 'order_date'],
transient=false,
post_hook="ALTER TABLE {{ this }} ADD SEARCH OPTIMIZATION ON EQUALITY(order_id)"
)
}}| Technique | When to Use | Config |
|---|---|---|
| Partitioning | Time-series data, date filters | partition_by: date field |
| Clustering | Common filter/join columns | cluster_by: ['col1', 'col2'] |
| Require Partition Filter | Prevent full table scans | require_partition_filter: true |
{{
config(
materialized='table',
partition_by={
'field': 'event_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['user_id', 'event_type'],
require_partition_filter=true
)
}}| Technique | When to Use | Config |
|---|---|---|
| Distribution Style | Optimize joins | dist: 'key_column' |
| Sort Keys | Range filters, ORDER BY | sort: ['col1', 'col2'] |
| Late Binding Views | Schema flexibility | bind: false |
{{
config(
materialized='table',
dist='customer_id',
sort=['order_date', 'order_id'],
sort_type='compound'
)
}}| Technique | When to Use | Config |
|---|---|---|
| Delta Lake | All tables (default) | file_format: 'delta' |
| Liquid Clustering | Replace static partitioning | liquid_clustered_by: ['col'] |
| Z-Ordering | Multi-column filters | zorder: ['col1', 'col2'] |
{{
config(
materialized='table',
file_format='delta',
liquid_clustered_by=['customer_id', 'order_date'],
post_hook="OPTIMIZE {{ this }} ZORDER BY (product_category)"
)
}}| Data Pattern | Recommended Strategy |
|---|---|
| Append-only events | append |
| Updates to existing rows | merge |
| Daily aggregations | delete+insert by date |
| Partitioned BigQuery | insert_overwrite |
Handle late-arriving data with lookback:
{% if is_incremental() %}
WHERE event_timestamp >= (
SELECT DATEADD('day', -3, MAX(event_timestamp))
FROM {{ this }}
)
{% endif %}Even incremental models benefit from occasional full refreshes:
# dbt_project.yml
models:
your_project:
marts:
+full_refresh: "{{ var('monthly_refresh', false) }}"
# Run monthly: dbt run --vars '{"monthly_refresh": true}'# datachonk.config.yml
environments:
dev:
snowflake:
warehouse: COMPUTE_XS
prod:
snowflake:
warehouse: COMPUTE_L
# Models can override
models:
fct_events:
snowflake_warehouse: COMPUTE_XL # Large model needs bigger warehouseAdd profiling to understand performance:
# Check query history in Snowflake SELECT query_id, query_text, execution_time / 1000 as seconds, bytes_scanned / 1e9 as gb_scanned, rows_produced FROM snowflake.account_usage.query_history WHERE query_text ILIKE '%fct_orders%' ORDER BY start_time DESC LIMIT 10;
Analyze model timing from dbt run:
# Generate timing report
dbt run --profiles-dir . 2>&1 | grep "completed"
# Or use dbt artifacts
cat target/run_results.json | jq '.results[] |
{model: .unique_id, seconds: .execution_time} |
select(.seconds > 60)'DataChonk automatically applies optimizations based on:
Tell the Brain about performance requirements:
Prompt:
"Create fct_page_views optimized for queries that filter by user_id and date range. The table has 500M+ rows and grows by 10M daily."
Pro tip: Run datachonk analyze to get performance recommendations for your existing models based on actual query patterns and data volumes.