Tutorial 1: Build a Complete ETL Pipeline with Prefect
In this tutorial, you'll build a production-ready ETL pipeline that extracts data from a REST API, transforms it, and loads it into a database. You'll learn core Prefect concepts including flows, tasks, error handling, caching, and deployment.
Time: 60-90 minutes Level: Beginner Prerequisites: Python 3.8+, basic SQL knowledge
What You'll Build
A daily ETL pipeline that:
- Extracts user data from JSONPlaceholder API
- Transforms and enriches the data
- Loads data to SQLite database
- Includes error handling and retries
- Implements caching for efficiency
- Runs on a schedule
Tech Stack:
- Prefect 2.x
- Pandas for data manipulation
- SQLite for storage
- Requests for API calls
Step 1: Project Setup
Create Project Directory
Create Virtual Environment
Install Dependencies
Create Project Files
Step 2: Configure Prefect
Option A: Use Prefect Cloud (Recommended)
Follow the prompts to authenticate.
Option B: Use Local Server
Step 3: Build the Extract Task
Create etl_pipeline.py:
Step 4: Build the Transform Task
Add to etl_pipeline.py:
Step 5: Build the Load Task
Add to etl_pipeline.py:
Step 6: Create the Main Flow
Add to etl_pipeline.py:
Step 7: Test the Pipeline
Run your pipeline locally:
Expected Output:
Verify Data in Database
Step 8: Add Data Validation
Create validation.py:
Add validation to your flow:
Step 9: Create a Deployment
Create Deployment Configuration
Edit deployment.yaml
Apply Deployment
Step 10: Start a Worker and Run
Start Worker
Trigger a Run
Monitor in UI
Visit Prefect UI to see:
- Flow run status
- Task execution timeline
- Logs from each task
- Run duration and performance
Step 11: Add Notifications
Create a notification for failures:
Step 12: Testing
Create test_pipeline.py:
Run tests:
Exercises
Exercise 1: Add Incremental Loading
Modify the pipeline to only load new/updated records based on a timestamp.
Hint:
Exercise 2: Add More Data Sources
Extract and merge data from /comments endpoint.
Exercise 3: Implement Error Notifications
Send an email or Slack message when pipeline fails.
Exercise 4: Add Data Profiling
Generate statistics about the data (row counts, null percentages, etc.).
Common Issues & Solutions
Issue: "Cannot connect to Prefect API"
Solution:
Issue: "Task cached but data changed"
Solution: Clear cache or change cache key function:
Issue: "Worker not picking up runs"
Solution:
Next Steps
Congratulations! You've built a production-ready ETL pipeline with Prefect.
What you learned:
- ✅ Creating flows and tasks
- ✅ Error handling and retries
- ✅ Caching for efficiency
- ✅ Deploying and scheduling
- ✅ Monitoring and logging
- ✅ Data validation
Continue learning:
- Tutorial 2: ML Model Training Pipeline
- Best Practices - Production patterns
- Use Cases - More real-world examples
Complete Code
Find the complete code for this tutorial on GitHub: prefect-tutorials