DatabricksBest Practices

Databricks Best Practices

Professional patterns, optimization techniques, and anti-patterns to avoid when working with Databricks at scale.

12 min read

Databricks Best Practices

Professional patterns, optimization techniques, and anti-patterns to avoid when working with Databricks at scale.


Table of Contents

  1. Cluster Configuration & Management
  2. Delta Lake Optimization
  3. Performance Tuning
  4. Cost Optimization
  5. Code Organization
  6. Security & Governance
  7. Streaming Best Practices
  8. ML & MLflow

1. Cluster Configuration & Management

✅ DO: Use Job Clusters for Production Workloads

Why: Lower DBU costs, isolated environments, automatic termination.

Benefits:

  • ~50% lower DBU rate vs all-purpose clusters
  • Automatic start/stop with job
  • Reproducible configurations

✅ DO: Right-Size Your Clusters

Anti-Pattern:

Best Practice:

How to Size:

  1. Start with 2-4 workers
  2. Monitor cluster metrics (CPU, memory, shuffle)
  3. Enable auto-scaling
  4. Adjust based on actual utilization

✅ DO: Use Spot/Preemptible Instances

Savings: 60-90% on compute costs

When NOT to use spot:

  • Real-time streaming with strict SLAs
  • Short-running jobs (startup overhead)
  • Jobs that can't tolerate interruptions

✅ DO: Configure Auto-Termination

Guidelines:

  • Interactive clusters: 30-60 minutes
  • Shared team clusters: 60-120 minutes
  • Job clusters: Always auto-terminate (default)

❌ DON'T: Share Clusters Across Environments

Anti-Pattern:

Best Practice:


2. Delta Lake Optimization

✅ DO: Use OPTIMIZE and ZORDER

Why: Improves query performance by compacting small files and co-locating data.

When to OPTIMIZE:

  • After many small writes (streaming, upserts)
  • When you see thousands of small files
  • Query performance degrades

ZORDER Guidelines:

  • Choose 2-4 most-filtered columns
  • Put highest-cardinality columns first
  • Re-run when filter patterns change

✅ DO: Enable Auto-Optimize

Trade-offs:

  • ✅ Automatic file compaction
  • ✅ Better query performance
  • ❌ Slightly slower writes
  • ❌ Higher write costs

Recommendation: Enable for production tables

✅ DO: Vacuum Regularly (But Carefully)

Guidelines:

  • Default retention: 7 days (168 hours)
  • Extend if you use time travel frequently
  • Never reduce below your longest-running job
  • Coordinate with backup strategies

✅ DO: Partition Large Tables Thoughtfully

Anti-Pattern:

Best Practice:

Guidelines:

  • Aim for 1GB+ per partition
  • Use partition when filtering removes >80% of data
  • For <1TB tables, ZORDER often better than partitioning
  • Max 10,000 partitions per table

✅ DO: Use Constraints for Data Quality

✅ DO: Enable Change Data Feed for CDC

Use cases:

  • Incremental processing
  • Downstream system sync
  • Audit trails

3. Performance Tuning

✅ DO: Enable Adaptive Query Execution (AQE)

Benefits:

  • Automatically adjusts shuffle partitions
  • Handles data skew in joins
  • 2-10x faster for complex queries

✅ DO: Use Photon for SQL Workloads

Performance gains:

  • 3-8x faster for aggregations
  • 10x+ for complex SQL
  • Lower cost per query

When to use:

  • SQL-heavy workloads
  • BI and analytics
  • Large aggregations

✅ DO: Broadcast Small Tables

Auto-broadcast threshold:

✅ DO: Cache Strategically

Anti-Pattern:

Best Practice:

Guidelines:

  • Cache only if used 3+ times
  • Monitor memory usage
  • Unpersist after use
  • Consider persist(StorageLevel.DISK_ONLY) for large data

✅ DO: Minimize Shuffles

Anti-Pattern:

Best Practice:

✅ DO: Manage Shuffle Partitions


4. Cost Optimization

✅ DO: Monitor and Set Budget Alerts

✅ DO: Use Cluster Policies

Benefits:

  • Prevent expensive cluster configurations
  • Enforce spot instances
  • Ensure auto-termination

✅ DO: Choose the Right Workload Type

Workload Recommended DBU Rate
Ad-hoc queries SQL Warehouse (Serverless) Medium
Scheduled ETL Job Cluster Low
Interactive dev All-Purpose (auto-terminate) High
Real-time streaming Dedicated cluster Medium
ML training Job Cluster with ML Runtime Low-Medium

✅ DO: Optimize Storage Costs

❌ DON'T: Leave Clusters Running

Anti-Pattern:

Best Practice:


5. Code Organization

✅ DO: Use Modular Notebooks

Anti-Pattern:

Best Practice:

Import pattern:

✅ DO: Use Git Integration (Repos)

Benefits:

  • Version control for notebooks
  • Code review process
  • CI/CD integration
  • Environment promotion (dev → staging → prod)

✅ DO: Parameterize Notebooks

Run programmatically:

✅ DO: Implement Data Quality Checks


6. Security & Governance

✅ DO: Use Unity Catalog

✅ DO: Use Secrets for Credentials

✅ DO: Implement Audit Logging


7. Streaming Best Practices

✅ DO: Use Structured Streaming with Delta

✅ DO: Set Appropriate Triggers

Guidelines:

  • Use availableNow for cost-effective near-real-time
  • Use processingTime for consistent latency
  • Monitor lag and adjust trigger interval

✅ DO: Handle Late Data with Watermarks


8. ML & MLflow

✅ DO: Use MLflow for Experiment Tracking

✅ DO: Use Feature Store


Summary of Key Best Practices

Performance

  • ✅ Enable AQE and Photon
  • ✅ Use OPTIMIZE and ZORDER
  • ✅ Broadcast small tables
  • ✅ Minimize shuffles

Cost

  • ✅ Use job clusters and spot instances
  • ✅ Enable auto-termination
  • ✅ Right-size clusters
  • ✅ Monitor with budget alerts

Quality

  • ✅ Implement data quality checks
  • ✅ Use Delta constraints
  • ✅ Enable Change Data Feed
  • ✅ Parameterize notebooks

Security

  • ✅ Use Unity Catalog
  • ✅ Store credentials in Secrets
  • ✅ Implement audit logging
  • ✅ Apply least-privilege access

Next Steps:

Need help optimizing your Databricks platform? Contact me for consulting services.

Stay in the loop

Get weekly insights on data engineering, analytics, and AI—delivered straight to your inbox.

No spam. Unsubscribe anytime.