Databricks Best Practices

Professional patterns, optimization techniques, and anti-patterns to avoid when working with Databricks at scale.

Cluster Configuration & Management
Delta Lake Optimization
Performance Tuning
Cost Optimization
Code Organization
Security & Governance
Streaming Best Practices
ML & MLflow

1. Cluster Configuration & Management

✅ DO: Use Job Clusters for Production Workloads

Why: Lower DBU costs, isolated environments, automatic termination.

Benefits:

~50% lower DBU rate vs all-purpose clusters
Automatic start/stop with job
Reproducible configurations

✅ DO: Right-Size Your Clusters

Anti-Pattern:

Best Practice:

How to Size:

Start with 2-4 workers
Monitor cluster metrics (CPU, memory, shuffle)
Enable auto-scaling
Adjust based on actual utilization

✅ DO: Use Spot/Preemptible Instances

Savings: 60-90% on compute costs

When NOT to use spot:

Real-time streaming with strict SLAs
Short-running jobs (startup overhead)
Jobs that can't tolerate interruptions

✅ DO: Configure Auto-Termination

Guidelines:

Interactive clusters: 30-60 minutes
Shared team clusters: 60-120 minutes
Job clusters: Always auto-terminate (default)

❌ DON'T: Share Clusters Across Environments

Anti-Pattern:

Best Practice:

2. Delta Lake Optimization

✅ DO: Use OPTIMIZE and ZORDER

Why: Improves query performance by compacting small files and co-locating data.

When to OPTIMIZE:

After many small writes (streaming, upserts)
When you see thousands of small files
Query performance degrades

ZORDER Guidelines:

Choose 2-4 most-filtered columns
Put highest-cardinality columns first
Re-run when filter patterns change

✅ DO: Enable Auto-Optimize

Trade-offs:

✅ Automatic file compaction
✅ Better query performance
❌ Slightly slower writes
❌ Higher write costs

Recommendation: Enable for production tables

✅ DO: Vacuum Regularly (But Carefully)

Guidelines:

Default retention: 7 days (168 hours)
Extend if you use time travel frequently
Never reduce below your longest-running job
Coordinate with backup strategies

✅ DO: Partition Large Tables Thoughtfully

Anti-Pattern:

Best Practice:

Guidelines:

Aim for 1GB+ per partition
Use partition when filtering removes >80% of data
For <1TB tables, ZORDER often better than partitioning
Max 10,000 partitions per table

✅ DO: Use Constraints for Data Quality

✅ DO: Enable Change Data Feed for CDC

Use cases:

Incremental processing
Downstream system sync
Audit trails

3. Performance Tuning

✅ DO: Enable Adaptive Query Execution (AQE)

Benefits:

Automatically adjusts shuffle partitions
Handles data skew in joins
2-10x faster for complex queries

✅ DO: Use Photon for SQL Workloads

Performance gains:

3-8x faster for aggregations
10x+ for complex SQL
Lower cost per query

When to use:

SQL-heavy workloads
BI and analytics
Large aggregations

✅ DO: Broadcast Small Tables

Auto-broadcast threshold:

✅ DO: Cache Strategically

Anti-Pattern:

Best Practice:

Guidelines:

Cache only if used 3+ times
Monitor memory usage
Unpersist after use
Consider persist(StorageLevel.DISK_ONLY) for large data

✅ DO: Minimize Shuffles

Anti-Pattern:

Best Practice:

✅ DO: Manage Shuffle Partitions

4. Cost Optimization

✅ DO: Monitor and Set Budget Alerts

✅ DO: Use Cluster Policies

Benefits:

Prevent expensive cluster configurations
Enforce spot instances
Ensure auto-termination

✅ DO: Choose the Right Workload Type

Workload	Recommended	DBU Rate
Ad-hoc queries	SQL Warehouse (Serverless)	Medium
Scheduled ETL	Job Cluster	Low
Interactive dev	All-Purpose (auto-terminate)	High
Real-time streaming	Dedicated cluster	Medium
ML training	Job Cluster with ML Runtime	Low-Medium

✅ DO: Optimize Storage Costs

❌ DON'T: Leave Clusters Running

Anti-Pattern:

Best Practice:

5. Code Organization

✅ DO: Use Modular Notebooks

Anti-Pattern:

Best Practice:

Import pattern:

✅ DO: Use Git Integration (Repos)

Benefits:

Version control for notebooks
Code review process
CI/CD integration
Environment promotion (dev → staging → prod)

✅ DO: Parameterize Notebooks

Run programmatically:

✅ DO: Implement Data Quality Checks

6. Security & Governance

✅ DO: Use Unity Catalog

✅ DO: Use Secrets for Credentials

✅ DO: Implement Audit Logging

7. Streaming Best Practices

✅ DO: Use Structured Streaming with Delta

✅ DO: Set Appropriate Triggers

Guidelines:

Use availableNow for cost-effective near-real-time
Use processingTime for consistent latency
Monitor lag and adjust trigger interval

✅ DO: Handle Late Data with Watermarks

8. ML & MLflow

✅ DO: Use MLflow for Experiment Tracking

✅ DO: Use Feature Store

Summary of Key Best Practices

Performance

✅ Enable AQE and Photon
✅ Use OPTIMIZE and ZORDER
✅ Broadcast small tables
✅ Minimize shuffles

Cost

✅ Use job clusters and spot instances
✅ Enable auto-termination
✅ Right-size clusters
✅ Monitor with budget alerts

Quality

✅ Implement data quality checks
✅ Use Delta constraints
✅ Enable Change Data Feed
✅ Parameterize notebooks

Security

✅ Use Unity Catalog
✅ Store credentials in Secrets
✅ Implement audit logging
✅ Apply least-privilege access

Next Steps:

Tutorials - Apply these practices
Use Cases - See real-world examples
Resources - Learn more

Need help optimizing your Databricks platform? Contact me for consulting services.

Databricks Best Practices

Databricks Best Practices

Table of Contents

1. Cluster Configuration & Management

✅ DO: Use Job Clusters for Production Workloads

✅ DO: Right-Size Your Clusters

✅ DO: Use Spot/Preemptible Instances

✅ DO: Configure Auto-Termination

❌ DON'T: Share Clusters Across Environments

2. Delta Lake Optimization

✅ DO: Use OPTIMIZE and ZORDER

✅ DO: Enable Auto-Optimize

✅ DO: Vacuum Regularly (But Carefully)

✅ DO: Partition Large Tables Thoughtfully

✅ DO: Use Constraints for Data Quality

✅ DO: Enable Change Data Feed for CDC

3. Performance Tuning

✅ DO: Enable Adaptive Query Execution (AQE)

✅ DO: Use Photon for SQL Workloads

✅ DO: Broadcast Small Tables

✅ DO: Cache Strategically

✅ DO: Minimize Shuffles

✅ DO: Manage Shuffle Partitions

4. Cost Optimization

✅ DO: Monitor and Set Budget Alerts

✅ DO: Use Cluster Policies

✅ DO: Choose the Right Workload Type

✅ DO: Optimize Storage Costs

❌ DON'T: Leave Clusters Running

5. Code Organization

✅ DO: Use Modular Notebooks

✅ DO: Use Git Integration (Repos)

✅ DO: Parameterize Notebooks

✅ DO: Implement Data Quality Checks

6. Security & Governance

✅ DO: Use Unity Catalog

✅ DO: Use Secrets for Credentials

✅ DO: Implement Audit Logging

7. Streaming Best Practices

✅ DO: Use Structured Streaming with Delta

✅ DO: Set Appropriate Triggers

✅ DO: Handle Late Data with Watermarks

8. ML & MLflow

✅ DO: Use MLflow for Experiment Tracking

✅ DO: Use Feature Store

Summary of Key Best Practices

Performance

Cost

Quality

Security

Stay in the loop

Getting Started with Databricks

Databricks Use Cases & Real-World Scenarios