DocsWarehouse GuidesDatabricks

Redshift

PostgreSQL

DocsWarehouse GuidesDatabricks

Warehouse Guide

Databricks

Leverage Delta Lake, Unity Catalog, and Databricks SQL with DataChonk for lakehouse analytics.

Delta Lake

Full support for Delta Lake features: ACID transactions, time travel, and optimized merges.

Unity Catalog

Generate models with Unity Catalog namespacing and data governance policies.

Photon Engine

Optimize queries for Photon-accelerated SQL Warehouses with vectorized execution.

Liquid Clustering

Automatic recommendations for Z-ordering and liquid clustering on large tables.

Connection Setup

.datachonk.yml

version: 1
warehouse: databricks

connections:
  prod:
    type: databricks
    host: dbc-abc123-xyz.cloud.databricks.com
    http_path: /sql/1.0/warehouses/abcd1234
    catalog: main  # Unity Catalog
    schema: analytics
    # Token stored in env var: DATABRICKS_TOKEN
    
  dev:
    type: databricks
    host: dbc-abc123-xyz.cloud.databricks.com
    http_path: /sql/1.0/warehouses/dev5678
    catalog: dev
    schema: analytics_dev

SQL Warehouse vs All-Purpose Cluster

Use SQL Warehouses (Serverless or Pro) for dbt. They're optimized for SQL workloads and support auto-scaling with Photon.

Databricks-Specific Patterns

Delta Lake Tables

All tables are Delta by default. Use liquid clustering for automatic optimization:

{{ config(
    materialized='table',
    file_format='delta',
    liquid_clustered_by=['customer_id', 'order_date'],
    tblproperties={
      'delta.autoOptimize.optimizeWrite': 'true',
      'delta.autoOptimize.autoCompact': 'true'
    }
) }}

select
    order_id,
    customer_id,
    order_date,
    total_amount
from {{ source('raw', 'orders') }}

Incremental with Delta Merge

Delta Lake's MERGE is highly optimized. DataChonk generates efficient merge patterns:

{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge',
    merge_update_columns=['status', 'updated_at'],
    file_format='delta'
) }}

select
    event_id,
    user_id,
    status,
    updated_at
from {{ source('raw', 'events') }}

{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Streaming Tables

Use Delta Live Tables or streaming for real-time pipelines:

{{ config(
    materialized='streaming_table',
    file_format='delta'
) }}

-- Streaming from Auto Loader
select
    event_id,
    event_timestamp,
    event_data
from stream({{ source('streaming', 'raw_events') }})

Unity Catalog Namespacing

Use three-level namespacing with Unity Catalog:

-- Source definition with catalog
{{ config(
    database='prod_catalog',  -- Unity Catalog
    schema='analytics'
) }}

-- Or use generate_database_name macro
-- profiles.yml: database = prod_catalog
-- Resulting table: prod_catalog.analytics.dim_customers

select * from {{ ref('stg_customers') }}

Performance Optimization

Enable Liquid Clustering

Liquid clustering automatically reorganizes data as it changes. Better than Z-ordering for evolving workloads.

Use Photon-Enabled Warehouses

Photon provides 2-8x speedup for most SQL queries. Always use SQL Warehouse Pro or Serverless.

Enable Auto-Optimize

Set autoOptimize.optimizeWrite to automatically compact small files during writes.

Watch for Small Files

Many small files hurt performance. Enable auto-compaction or run OPTIMIZE regularly.

Common Issues

OPTIMIZE taking too long

Cause: Too many small files accumulated, or running OPTIMIZE too frequently.

Solution: Enable auto-compaction instead of manual OPTIMIZE. Use liquid clustering.

Permission denied in Unity Catalog

Cause: Missing grants on catalog, schema, or table level.

Solution: Ensure your service principal has USE CATALOG, USE SCHEMA, and CREATE TABLE grants.

Warehouse auto-terminated during dbt run

Cause: Auto-stop timeout shorter than dbt run duration.

Solution: Increase auto-stop timeout or use Serverless warehouse which scales automatically.

Pro Tips

Predictive Optimization

Enable Predictive Optimization in Unity Catalog. It automatically runs OPTIMIZE and VACUUM based on table access patterns - no manual maintenance needed.