Getting Started with Airbyte
This comprehensive guide will walk you through installing Airbyte, setting up your first data pipeline, and understanding core configuration options.
Time: 45-60 minutes Prerequisites: Docker installed, basic command line knowledge
Installation Options
Option 1: Docker Compose (Recommended for Getting Started)
The fastest way to get Airbyte running locally.
Step 1: Install Prerequisites
Docker Desktop:
Verify Installation:
Step 2: Download Airbyte
What Happens:
- Downloads latest Airbyte docker-compose configuration
- Pulls Docker images (~2-3 GB)
- Starts all Airbyte services
- Creates PostgreSQL database for metadata
Wait for startup (usually 2-3 minutes):
Step 3: Access Airbyte UI
- Open browser to
http://localhost:8000 - Create admin account (first-time setup)
- Set email and password
- Optionally sign up for product updates
Option 2: Airbyte Cloud (Fully Managed)
No infrastructure management required.
Step 1: Sign Up
- Visit cloud.airbyte.com
- Sign up with email or Google/GitHub
- Verify email address
- Create workspace
Step 2: Connect Data Warehouse (Optional)
Cloud version runs in Airbyte's infrastructure but can write to your warehouse.
Free Tier Includes:
- 1 workspace
- Unlimited sources and destinations
- 5 GB synced data per month
- Community support
Option 3: Kubernetes (Production)
For production deployments requiring scalability.
Prerequisites
- Kubernetes cluster (1.19+)
- kubectl configured
- Helm 3.0+
Install with Helm
Access at http://localhost:8000
Your First Data Pipeline
Let's create a complete pipeline: Postgres → Snowflake
Step 1: Add a Source (PostgreSQL)
Navigate to Sources
- Click "Sources" in left sidebar
- Click "+ New Source"
- Search for "Postgres"
- Click "Postgres"
Configure PostgreSQL Source
Connection Details:
SSL Configuration:
- For production: Enable SSL
- For local dev: Disable SSL
Advanced Options:
Test Connection
Click "Test" button:
- ✅ Connection successful
- ✅ Can list schemas
- ✅ Can read table metadata
Click "Set up source"
Step 2: Add a Destination (Snowflake)
Navigate to Destinations
- Click "Destinations" in left sidebar
- Click "+ New Destination"
- Search for "Snowflake"
- Click "Snowflake (tables)"
Configure Snowflake Destination
Account Information:
Authentication:
Advanced Options:
Test Connection
Click "Test":
- ✅ Can connect to Snowflake
- ✅ Can create schema
- ✅ Can write test record
Click "Set up destination"
Step 3: Create a Connection
Start Connection Setup
- From Sources page, click your PostgreSQL source
- Click "Set up connection" or "+ New connection"
- Select "My Snowflake Warehouse" as destination
- Click "Set up connection"
Configure Sync Settings
Replication Frequency:
Or for manual:
Destination Namespace:
Destination Stream Prefix:
Select Streams (Tables)
All Tables View:
- Shows all tables from source
- Check tables to sync
- Configure sync mode per table
Example:
Configure Sync Modes
Full Refresh | Overwrite:
- Deletes destination table
- Reloads all source data
- Use for: Small dimension tables
Full Refresh | Append:
- Keeps existing data
- Appends new full snapshot
- Use for: Historical snapshots
Incremental | Append:
- Only syncs new rows (based on cursor)
- Appends to destination
- Use for: Event logs, append-only tables
Incremental | Append + Dedup:
- Syncs new/updated rows
- Deduplicates based on primary key
- Use for: Transactional tables with updates
Example Configuration:
Configure Transformations (Optional)
Basic Normalization:
- ✅ Normalize Data
- Creates typed tables from JSON
- Uses dbt under the hood
Custom dbt Transformations:
- Link GitHub repo with dbt models
- Run transformations after sync
- [Advanced feature]
Finalize Connection
- Review configuration
- Click "Set up connection"
- Connection created!
Step 4: Run Your First Sync
Manual Sync
- Go to "Connections" page
- Find your new connection
- Click "Sync now"
Watch the Progress:
Verify in Snowflake
Understanding Sync Modes
Full Refresh Sync Modes
Full Refresh | Overwrite:
Full Refresh | Append:
Use Full Refresh When:
- Table is small (<1M rows)
- Data changes unpredictably
- No cursor field available
- Need historical snapshots
Incremental Sync Modes
Incremental | Append:
Incremental | Append + Dedup:
Cursor Field Requirements:
- Monotonically increasing (timestamp, auto-increment ID)
- Never null
- Not updated on row updates (use
updated_atnotcreated_at)
Primary Key for Dedup:
- Unique identifier
- Composite keys supported:
[user_id, order_id]
Configuration Deep Dive
Source Configuration
Replication Method:
Standard:
- Queries database directly
- SELECT statements
- Works for all databases
- Can impact source database performance
Change Data Capture (CDC):
- Uses database transaction logs
- Minimal source impact
- Captures deletes
- Requires database configuration
CDC Setup (PostgreSQL Example):
Destination Configuration
Staging:
Some destinations use staging (S3, GCS) before loading:
Configuration:
Loading Method:
COPY (Recommended):
- Fastest for large datasets
- Uses warehouse's native COPY command
- Requires staging
INSERT:
- Direct INSERT statements
- Slower but no staging needed
- Good for small datasets
Monitoring and Troubleshooting
View Sync History
- Go to connection
- Click "Sync History" tab
- View past sync runs:
- Status (Success, Failed, Partial Success)
- Records synced
- Duration
- Error logs
Debug Failed Syncs
Check Logs:
- Click failed sync job
- View "Logs" tab
- Look for error messages
Common Issues:
Connection Timeout:
Permission Denied:
Out of Memory:
Sync Job Retry
Failed syncs don't auto-retry by default:
- Fix the issue
- Click "Sync now" to retry
Essential CLI Commands
Airbyte provides a CLI for advanced operations:
Next Steps
Immediate (Today)
- ✅ Set up second source (try File/CSV or API)
- ✅ Configure incremental sync with cursor field
- ✅ Enable normalization and explore dbt models
- ✅ Set up scheduled sync
Short-term (This Week)
- Read Best Practices
- Explore Use Cases
- Build a custom connector
- Set up monitoring and alerting
Medium-term (This Month)
- Deploy to production (Kubernetes)
- Implement CI/CD for connector configs
- Set up dbt transformations
- Optimize sync schedules and performance
Quick Reference
Common Sync Frequencies
Cursor Field Best Practices
Resource Requirements
Docker Compose:
- 8 GB RAM minimum
- 20 GB disk space
- 2 CPU cores
Kubernetes:
- 3 worker nodes (2 vCPU, 8 GB RAM each)
- 50 GB persistent storage
- Load balancer
Troubleshooting Guide
Issue: Can't Access UI at localhost:8000
Check Docker:
Solution:
Issue: Sync Fails with "Out of Memory"
Solution:
Issue: Can't Connect to Source Database
Check Network:
Solution:
- Check firewall rules
- Use host.docker.internal for localhost databases
- Verify credentials
Resources
Ready to build complex pipelines? → See Best Practices
Want real-world examples? → Explore Use Cases
Build a custom connector? → Start Tutorial 1