Databricks Best Practices
Professional patterns, optimization techniques, and anti-patterns to avoid when working with Databricks at scale.
Table of Contents
- Cluster Configuration & Management
- Delta Lake Optimization
- Performance Tuning
- Cost Optimization
- Code Organization
- Security & Governance
- Streaming Best Practices
- ML & MLflow
1. Cluster Configuration & Management
✅ DO: Use Job Clusters for Production Workloads
Why: Lower DBU costs, isolated environments, automatic termination.
Benefits:
- ~50% lower DBU rate vs all-purpose clusters
- Automatic start/stop with job
- Reproducible configurations
✅ DO: Right-Size Your Clusters
Anti-Pattern:
Best Practice:
How to Size:
- Start with 2-4 workers
- Monitor cluster metrics (CPU, memory, shuffle)
- Enable auto-scaling
- Adjust based on actual utilization
✅ DO: Use Spot/Preemptible Instances
Savings: 60-90% on compute costs
When NOT to use spot:
- Real-time streaming with strict SLAs
- Short-running jobs (startup overhead)
- Jobs that can't tolerate interruptions
✅ DO: Configure Auto-Termination
Guidelines:
- Interactive clusters: 30-60 minutes
- Shared team clusters: 60-120 minutes
- Job clusters: Always auto-terminate (default)
❌ DON'T: Share Clusters Across Environments
Anti-Pattern:
Best Practice:
2. Delta Lake Optimization
✅ DO: Use OPTIMIZE and ZORDER
Why: Improves query performance by compacting small files and co-locating data.
When to OPTIMIZE:
- After many small writes (streaming, upserts)
- When you see thousands of small files
- Query performance degrades
ZORDER Guidelines:
- Choose 2-4 most-filtered columns
- Put highest-cardinality columns first
- Re-run when filter patterns change
✅ DO: Enable Auto-Optimize
Trade-offs:
- ✅ Automatic file compaction
- ✅ Better query performance
- ❌ Slightly slower writes
- ❌ Higher write costs
Recommendation: Enable for production tables
✅ DO: Vacuum Regularly (But Carefully)
Guidelines:
- Default retention: 7 days (168 hours)
- Extend if you use time travel frequently
- Never reduce below your longest-running job
- Coordinate with backup strategies
✅ DO: Partition Large Tables Thoughtfully
Anti-Pattern:
Best Practice:
Guidelines:
- Aim for 1GB+ per partition
- Use partition when filtering removes >80% of data
- For <1TB tables, ZORDER often better than partitioning
- Max 10,000 partitions per table
✅ DO: Use Constraints for Data Quality
✅ DO: Enable Change Data Feed for CDC
Use cases:
- Incremental processing
- Downstream system sync
- Audit trails
3. Performance Tuning
✅ DO: Enable Adaptive Query Execution (AQE)
Benefits:
- Automatically adjusts shuffle partitions
- Handles data skew in joins
- 2-10x faster for complex queries
✅ DO: Use Photon for SQL Workloads
Performance gains:
- 3-8x faster for aggregations
- 10x+ for complex SQL
- Lower cost per query
When to use:
- SQL-heavy workloads
- BI and analytics
- Large aggregations
✅ DO: Broadcast Small Tables
Auto-broadcast threshold:
✅ DO: Cache Strategically
Anti-Pattern:
Best Practice:
Guidelines:
- Cache only if used 3+ times
- Monitor memory usage
- Unpersist after use
- Consider persist(StorageLevel.DISK_ONLY) for large data
✅ DO: Minimize Shuffles
Anti-Pattern:
Best Practice:
✅ DO: Manage Shuffle Partitions
4. Cost Optimization
✅ DO: Monitor and Set Budget Alerts
✅ DO: Use Cluster Policies
Benefits:
- Prevent expensive cluster configurations
- Enforce spot instances
- Ensure auto-termination
✅ DO: Choose the Right Workload Type
| Workload | Recommended | DBU Rate |
|---|---|---|
| Ad-hoc queries | SQL Warehouse (Serverless) | Medium |
| Scheduled ETL | Job Cluster | Low |
| Interactive dev | All-Purpose (auto-terminate) | High |
| Real-time streaming | Dedicated cluster | Medium |
| ML training | Job Cluster with ML Runtime | Low-Medium |
✅ DO: Optimize Storage Costs
❌ DON'T: Leave Clusters Running
Anti-Pattern:
Best Practice:
5. Code Organization
✅ DO: Use Modular Notebooks
Anti-Pattern:
Best Practice:
Import pattern:
✅ DO: Use Git Integration (Repos)
Benefits:
- Version control for notebooks
- Code review process
- CI/CD integration
- Environment promotion (dev → staging → prod)
✅ DO: Parameterize Notebooks
Run programmatically:
✅ DO: Implement Data Quality Checks
6. Security & Governance
✅ DO: Use Unity Catalog
✅ DO: Use Secrets for Credentials
✅ DO: Implement Audit Logging
7. Streaming Best Practices
✅ DO: Use Structured Streaming with Delta
✅ DO: Set Appropriate Triggers
Guidelines:
- Use
availableNowfor cost-effective near-real-time - Use
processingTimefor consistent latency - Monitor lag and adjust trigger interval
✅ DO: Handle Late Data with Watermarks
8. ML & MLflow
✅ DO: Use MLflow for Experiment Tracking
✅ DO: Use Feature Store
Summary of Key Best Practices
Performance
- ✅ Enable AQE and Photon
- ✅ Use OPTIMIZE and ZORDER
- ✅ Broadcast small tables
- ✅ Minimize shuffles
Cost
- ✅ Use job clusters and spot instances
- ✅ Enable auto-termination
- ✅ Right-size clusters
- ✅ Monitor with budget alerts
Quality
- ✅ Implement data quality checks
- ✅ Use Delta constraints
- ✅ Enable Change Data Feed
- ✅ Parameterize notebooks
Security
- ✅ Use Unity Catalog
- ✅ Store credentials in Secrets
- ✅ Implement audit logging
- ✅ Apply least-privilege access
Next Steps:
Need help optimizing your Databricks platform? Contact me for consulting services.