Getting Started with Airbyte

This comprehensive guide will walk you through installing Airbyte, setting up your first data pipeline, and understanding core configuration options.

Time: 45-60 minutes Prerequisites: Docker installed, basic command line knowledge

Installation Options

Option 1: Docker Compose (Recommended for Getting Started)

The fastest way to get Airbyte running locally.

Step 1: Install Prerequisites

Docker Desktop:

Verify Installation:

Step 2: Download Airbyte

What Happens:

Downloads latest Airbyte docker-compose configuration
Pulls Docker images (~2-3 GB)
Starts all Airbyte services
Creates PostgreSQL database for metadata

Wait for startup (usually 2-3 minutes):

Step 3: Access Airbyte UI

Open browser to http://localhost:8000
Create admin account (first-time setup)
Set email and password
Optionally sign up for product updates

Option 2: Airbyte Cloud (Fully Managed)

No infrastructure management required.

Step 1: Sign Up

Visit cloud.airbyte.com
Sign up with email or Google/GitHub
Verify email address
Create workspace

Step 2: Connect Data Warehouse (Optional)

Cloud version runs in Airbyte's infrastructure but can write to your warehouse.

Free Tier Includes:

1 workspace
Unlimited sources and destinations
5 GB synced data per month
Community support

Option 3: Kubernetes (Production)

For production deployments requiring scalability.

Prerequisites

Kubernetes cluster (1.19+)
kubectl configured
Helm 3.0+

Install with Helm

Access at http://localhost:8000

Your First Data Pipeline

Let's create a complete pipeline: Postgres → Snowflake

Step 1: Add a Source (PostgreSQL)

Navigate to Sources

Click "Sources" in left sidebar
Click "+ New Source"
Search for "Postgres"
Click "Postgres"

Configure PostgreSQL Source

Connection Details:

SSL Configuration:

For production: Enable SSL
For local dev: Disable SSL

Advanced Options:

Test Connection

Click "Test" button:

✅ Connection successful
✅ Can list schemas
✅ Can read table metadata

Click "Set up source"

Step 2: Add a Destination (Snowflake)

Navigate to Destinations

Click "Destinations" in left sidebar
Click "+ New Destination"
Search for "Snowflake"
Click "Snowflake (tables)"

Configure Snowflake Destination

Account Information:

Authentication:

Advanced Options:

Test Connection

Click "Test":

✅ Can connect to Snowflake
✅ Can create schema
✅ Can write test record

Click "Set up destination"

Step 3: Create a Connection

Start Connection Setup

From Sources page, click your PostgreSQL source
Click "Set up connection" or "+ New connection"
Select "My Snowflake Warehouse" as destination
Click "Set up connection"

Configure Sync Settings

Replication Frequency:

Or for manual:

Destination Namespace:

Destination Stream Prefix:

Select Streams (Tables)

All Tables View:

Shows all tables from source
Check tables to sync
Configure sync mode per table

Example:

Configure Sync Modes

Full Refresh | Overwrite:

Deletes destination table
Reloads all source data
Use for: Small dimension tables

Full Refresh | Append:

Keeps existing data
Appends new full snapshot
Use for: Historical snapshots

Incremental | Append:

Only syncs new rows (based on cursor)
Appends to destination
Use for: Event logs, append-only tables

Incremental | Append + Dedup:

Syncs new/updated rows
Deduplicates based on primary key
Use for: Transactional tables with updates

Example Configuration:

Configure Transformations (Optional)

Basic Normalization:

✅ Normalize Data
Creates typed tables from JSON
Uses dbt under the hood

Custom dbt Transformations:

Link GitHub repo with dbt models
Run transformations after sync
[Advanced feature]

Finalize Connection

Review configuration
Click "Set up connection"
Connection created!

Step 4: Run Your First Sync

Manual Sync

Go to "Connections" page
Find your new connection
Click "Sync now"

Watch the Progress:

Verify in Snowflake

Understanding Sync Modes

Full Refresh Sync Modes

Full Refresh | Overwrite:

Full Refresh | Append:

Use Full Refresh When:

Table is small (<1M rows)
Data changes unpredictably
No cursor field available
Need historical snapshots

Incremental Sync Modes

Incremental | Append:

Incremental | Append + Dedup:

Cursor Field Requirements:

Monotonically increasing (timestamp, auto-increment ID)
Never null
Not updated on row updates (use updated_at not created_at)

Primary Key for Dedup:

Unique identifier
Composite keys supported: [user_id, order_id]

Configuration Deep Dive

Source Configuration

Replication Method:

Standard:

Queries database directly
SELECT statements
Works for all databases
Can impact source database performance

Change Data Capture (CDC):

Uses database transaction logs
Minimal source impact
Captures deletes
Requires database configuration

CDC Setup (PostgreSQL Example):

Destination Configuration

Staging:

Some destinations use staging (S3, GCS) before loading:

Configuration:

Loading Method:

COPY (Recommended):

Fastest for large datasets
Uses warehouse's native COPY command
Requires staging

INSERT:

Direct INSERT statements
Slower but no staging needed
Good for small datasets

Monitoring and Troubleshooting

View Sync History

Go to connection
Click "Sync History" tab
View past sync runs:
- Status (Success, Failed, Partial Success)
- Records synced
- Duration
- Error logs

Debug Failed Syncs

Check Logs:

Click failed sync job
View "Logs" tab
Look for error messages

Common Issues:

Connection Timeout:

Permission Denied:

Out of Memory:

Sync Job Retry

Failed syncs don't auto-retry by default:

Fix the issue
Click "Sync now" to retry

Essential CLI Commands

Airbyte provides a CLI for advanced operations:

Next Steps

Immediate (Today)

✅ Set up second source (try File/CSV or API)
✅ Configure incremental sync with cursor field
✅ Enable normalization and explore dbt models
✅ Set up scheduled sync

Short-term (This Week)

Read Best Practices
Explore Use Cases
Build a custom connector
Set up monitoring and alerting

Medium-term (This Month)

Deploy to production (Kubernetes)
Implement CI/CD for connector configs
Set up dbt transformations
Optimize sync schedules and performance

Quick Reference

Common Sync Frequencies

Cursor Field Best Practices

Resource Requirements

Docker Compose:

8 GB RAM minimum
20 GB disk space
2 CPU cores

Kubernetes:

3 worker nodes (2 vCPU, 8 GB RAM each)
50 GB persistent storage
Load balancer

Troubleshooting Guide

Issue: Can't Access UI at localhost:8000

Check Docker:

Solution:

Issue: Sync Fails with "Out of Memory"

Solution:

Issue: Can't Connect to Source Database

Check Network:

Solution:

Check firewall rules
Use host.docker.internal for localhost databases
Verify credentials

Resources

Ready to build complex pipelines? → See Best Practices

Want real-world examples? → Explore Use Cases

Build a custom connector? → Start Tutorial 1

← Back to Airbyte Overview | Next: Best Practices →

Getting Started with Airbyte

Getting Started with Airbyte

Installation Options

Option 1: Docker Compose (Recommended for Getting Started)

Step 1: Install Prerequisites

Step 2: Download Airbyte

Step 3: Access Airbyte UI

Option 2: Airbyte Cloud (Fully Managed)

Step 1: Sign Up

Step 2: Connect Data Warehouse (Optional)

Option 3: Kubernetes (Production)

Prerequisites

Install with Helm

Your First Data Pipeline

Step 1: Add a Source (PostgreSQL)

Navigate to Sources

Configure PostgreSQL Source

Test Connection

Step 2: Add a Destination (Snowflake)

Navigate to Destinations

Configure Snowflake Destination

Test Connection

Step 3: Create a Connection

Start Connection Setup

Configure Sync Settings

Select Streams (Tables)

Configure Sync Modes

Configure Transformations (Optional)

Finalize Connection

Step 4: Run Your First Sync

Manual Sync

Verify in Snowflake

Understanding Sync Modes

Full Refresh Sync Modes

Incremental Sync Modes

Configuration Deep Dive

Source Configuration

Destination Configuration

Monitoring and Troubleshooting

View Sync History

Debug Failed Syncs

Sync Job Retry

Essential CLI Commands

Next Steps

Immediate (Today)

Short-term (This Week)

Medium-term (This Month)

Quick Reference

Common Sync Frequencies

Cursor Field Best Practices

Resource Requirements

Troubleshooting Guide

Issue: Can't Access UI at localhost:8000

Issue: Sync Fails with "Out of Memory"

Issue: Can't Connect to Source Database

Resources

Stay in the loop

Airbyte - Open-Source Data Integration Platform

Airbyte Best Practices