N8nTutorial

Tutorial 2: Data Pipeline Automation

Build a production-ready data pipeline that extracts customer data from an API, transforms it, loads it into PostgreSQL, and generates automated reports.

17 min read

Tutorial 2: Data Pipeline Automation

Build a production-ready data pipeline that extracts customer data from an API, transforms it, loads it into PostgreSQL, and generates automated reports.

Duration: 90 minutes Difficulty: Intermediate Prerequisites: Tutorial 1 completed, basic SQL knowledge


Table of Contents

  1. What You'll Build
  2. Prerequisites
  3. Part 1: Setup Database Environment
  4. Part 2: Create Pipeline Workflow
  5. Part 3: Extract Data from API
  6. Part 4: Batch Processing
  7. Part 5: Data Transformation
  8. Part 6: Load to Database
  9. Part 7: Generate Reports
  10. Part 8: Error Handling & Monitoring
  11. Part 9: Production Deployment
  12. Advanced Topics
  13. Troubleshooting

What You'll Build

A complete ETL (Extract, Transform, Load) pipeline that:

  • ✅ Extracts customer data from REST API (hourly)
  • ✅ Processes large datasets with batching
  • ✅ Transforms and validates data
  • ✅ Loads data to PostgreSQL with upsert logic
  • ✅ Tracks sync status and handles errors
  • ✅ Generates daily summary reports
  • ✅ Sends alerts on failures
  • ✅ Production-ready with monitoring

Architecture Diagram:

What You'll Learn:

  • Database connections and operations
  • Batch processing for scalability
  • Data validation and cleansing
  • Upsert (insert or update) patterns
  • Error handling strategies
  • Incremental sync logic
  • Performance optimization
  • Production monitoring
  • Report generation

Prerequisites

Required Knowledge

  • ✅ Tutorial 1 completed (n8n basics)
  • ✅ Basic SQL (SELECT, INSERT, UPDATE)
  • ✅ Understanding of JSON
  • ✅ Familiarity with REST APIs

Required Tools

1. n8n Running

2. PostgreSQL Database

We'll set up PostgreSQL via Docker:

3. Mock API (We'll use JSONPlaceholder)

Free test API: https://jsonplaceholder.typicode.com/

  • No API key required
  • Provides sample user data
  • Perfect for learning

Optional Tools

  • Database Client: DBeaver, pgAdmin, or TablePlus
  • API Client: Postman or Insomnia (for testing)
  • Slack: For notifications (or use Email)

Part 1: Setup Database Environment

Step 1.1: Start PostgreSQL

Step 1.2: Create Database Schema

Connect to database and create tables:

Execute this SQL:

Step 1.3: Insert Sample Data (Optional)

Step 1.4: Configure n8n PostgreSQL Credential

  1. In n8n, go to Credentials (left sidebar)
  2. Click "Create New"
  3. Search for "PostgreSQL"
  4. Fill in connection details:
    • Host: localhost (or Docker network if needed)
    • Database: customers_db
    • User: n8n_user
    • Password: n8n_password
    • Port: 5432
    • SSL Mode: Disable (for local testing)
  5. Test Connection
  6. Save as "PostgreSQL - Customers DB"

Docker Network Note: If n8n and PostgreSQL are in different containers:


Part 2: Create Pipeline Workflow

Step 2.1: Create New Workflow

  1. Click "New Workflow"
  2. Rename to: Customer Data Pipeline
  3. Add sticky note:

Step 2.2: Add Schedule Trigger

  1. Add "Schedule Trigger" node
  2. Configure:
    • Trigger Interval: Hours
    • Hours Between Triggers: 1 (hourly)
    • Trigger at Minute: 0 (top of hour)

For testing: Set to "Every 5 minutes" initially


Part 3: Extract Data from API

Step 3.1: Log Pipeline Start

Add PostgreSQL node after trigger:

  1. Name it: "Log Pipeline Start"
  2. Operation: Insert
  3. Table: sync_logs
  4. Columns:

This creates an audit trail for each run!

Step 3.2: Fetch Customer Data

Add HTTP Request node:

  1. Name: "Fetch Customers from API"
  2. Method: GET
  3. URL:

Response Format: JSON

For Production APIs:

  • Add authentication (API key, OAuth)
  • Add pagination if needed
  • Set timeouts appropriately

Step 3.3: Test API Call

Execute the node - you should get 10 users:


Part 4: Batch Processing

For large datasets (1000s+ records), process in batches to avoid memory issues.

Step 4.1: Add Split In Batches Node

After HTTP Request, add "Split In Batches":

  1. Name: "Process in Batches"
  2. Batch Size: 100
  3. Options:
    • Reset: ✓ (important for loops)

This splits data into chunks of 100 records.

Step 4.2: Understanding Batching

Why batch?

  • Prevents memory overflow
  • Better error recovery (partial success)
  • Faster database operations (bulk inserts)
  • More predictable performance

How it works:

  • First execution: Processes first 100 records
  • Loops back until all records processed
  • Each batch is independent

Part 5: Data Transformation

Step 5.1: Add Code Node

After "Split In Batches", add Code node:

Name: "Transform & Validate Data"

Step 5.2: Understanding the Transformation

Key operations:

  1. Validation:

  2. Data Cleaning:

  3. Flattening Nested JSON:

  4. Null Handling:

Step 5.3: Add Data Quality Checks

Enhance validation:


Part 6: Load to Database

Step 6.1: Add PostgreSQL Upsert Node

After transformation, add PostgreSQL node:

Name: "Upsert Customers to Database"

Operation: Execute Query

Query:

Important: Use PostgreSQL Insert operation instead for easier mapping.

Step 6.2: Configure Insert with Conflict Resolution

Better approach - use Insert operation:

  1. Operation: Insert
  2. Table: customers
  3. Columns: Map all fields:
    • id{{ $json.id }}
    • name{{ $json.name }}
    • email{{ $json.email }}
    • etc.
  4. Options:
    • Insert Mode: Upsert
    • Conflict Columns: id
    • Update Columns: All fields

This automatically handles insert or update!

Step 6.3: Loop Back for Next Batch

After PostgreSQL node:

  1. Connect it back to "Split In Batches" node
  2. Creates a loop that processes all batches

Flow:


Part 7: Generate Reports

After all batches complete, generate a summary report.

Step 7.1: Calculate Statistics

Add Code node after loop completes:

Name: "Generate Summary Report"

Step 7.2: Query Database Statistics

Add PostgreSQL node:

Name: "Get Database Stats"

Operation: Execute Query Query:

Step 7.3: Format Report Message

Add Code node:

Name: "Format Report"

Step 7.4: Send Report

Add Slack or Email node:

For Slack:

  • Channel: #data-pipelines
  • Message: ={{ $json.message }}

For Email:

  • To: data-team@company.com
  • Subject: Customer Sync Report - {{ $now.format('YYYY-MM-DD HH:mm') }}
  • Text: ={{ $json.message }}

Part 8: Error Handling & Monitoring

Step 8.1: Create Error Workflow

  1. Create new workflow: "Pipeline Error Handler"
  2. Add Error Trigger
  3. Add Code to format error:
  1. Send to Slack/Email with high priority

Step 8.2: Log Errors to Database

In error workflow, add PostgreSQL node:

Operation: Update Table: sync_logs Query:

Step 8.3: Set Error Workflow

In main pipeline:

  1. Workflow Settings (⚙️)
  2. Error Workflow: Select "Pipeline Error Handler"
  3. Save

Now errors are automatically caught and logged!

Step 8.4: Add Continue On Fail

For non-critical nodes:

  1. Select node (e.g., "Send Report")
  2. Settings tab
  3. Continue On Fail:

Pipeline won't stop if Slack is down.


Part 9: Production Deployment

Step 9.1: Update Sync Log on Success

Before sending report, add PostgreSQL node:

Name: "Log Sync Success"

Operation: Update Query:

Step 9.2: Add Monitoring Dashboard Query

Create view for monitoring:

Query this for daily health check!

Step 9.3: Set Up Alerting Thresholds

Add IF node before sending report:

Condition:

True: Send high-priority alert (something's wrong) False: Normal report

Step 9.4: Implement Incremental Sync (Optional)

For APIs that support it:

  1. Store last sync timestamp in database:

  2. Query before API call:

  3. Use in API request:

  4. Update after successful sync

This fetches only changed records!


Advanced Topics

Incremental Sync Pattern

Full implementation:

Data Deduplication

Parallel Processing

For multiple data sources:

Data Retention Policy

Add as scheduled workflow.


Troubleshooting

Issue: Database Connection Fails

Error: Connection refused or Timeout

Solutions:

Issue: Batch Loop Never Completes

Cause: "Reset" not enabled on Split In Batches

Solution:

  1. Select "Split In Batches" node
  2. OptionsReset:
  3. Save and re-run

Issue: Duplicate Records Despite Upsert

Cause: Conflict column mismatch

Solution:

Issue: Out of Memory on Large Datasets

Solutions:

  1. Reduce batch size:

  2. Limit workflow execution data:

  3. Stream large responses:

Issue: Slow Database Inserts

Optimize:


Testing the Complete Workflow

Test Checklist

Smoke Test:

  1. Execute workflow manually
  2. Verify all nodes execute successfully
  3. Check database has data
  4. Confirm report received

Data Validation:

Error Handling:

  1. Temporarily break API URL
  2. Verify error workflow triggers
  3. Check error logged to database
  4. Confirm alert received

Performance:

  1. Test with large dataset (100+ records)
  2. Verify batching works
  3. Check execution time
  4. Monitor memory usage

Production Checklist

Before deploying to production:

  • Database credentials use environment variables
  • Error workflow configured
  • Monitoring and alerting set up
  • Workflow tested with production-size data
  • Execution data retention configured
  • Schedule set appropriately
  • Documentation updated
  • Team trained on workflow
  • Backup and recovery plan in place
  • Performance benchmarked

Key Concepts Learned

✅ ETL Patterns

  • Extract: API data fetching
  • Transform: Validation, cleaning, flattening
  • Load: Database upserts

✅ Batch Processing

  • Split large datasets
  • Process in chunks
  • Loop until complete

✅ Data Quality

  • Validation rules
  • Error handling
  • Data cleansing

✅ Production Practices

  • Logging and auditing
  • Error alerting
  • Performance monitoring
  • Incremental sync

✅ Database Operations

  • INSERT, UPDATE, UPSERT
  • Indexes for performance
  • Transactions and consistency

Next Steps

Congratulations! 🎉

You've built a production-ready data pipeline! You now understand:

  • End-to-end ETL workflows
  • Database integration
  • Batch processing
  • Error handling
  • Production monitoring

Continue Learning

Apply These Skills:

  1. Modify for your data:

    • Connect to your actual API
    • Adjust transformation logic
    • Add your business rules
  2. Expand the pipeline:

    • Add data enrichment (API lookups)
    • Implement data quality scoring
    • Create data lineage tracking
  3. Build related workflows:

    • Data validation monitors
    • Scheduled reports
    • Alert workflows
  4. Explore advanced topics:

Recommended Projects

Build these to practice:

Data Engineering:

  • Multi-source data aggregation
  • Real-time sync with webhooks
  • Change data capture (CDC)
  • Data quality monitoring

Business Intelligence:

  • Automated reporting dashboards
  • KPI tracking workflows
  • Anomaly detection alerts

Additional Resources

Tools & Extensions

  • dbt: SQL transformations in warehouse
  • Great Expectations: Data validation framework
  • Airflow: Complex orchestration
  • Metabase/Superset: BI dashboards

Further Reading


Get Help

Questions? Issues?

Need production assistance? Contact me for consulting, architecture review, or team training.


→ What's Next?

Happy building! 🚀

Stay in the loop

Get weekly insights on data engineering, analytics, and AI—delivered straight to your inbox.

No spam. Unsubscribe anytime.