AirbyteGetting Started

Getting Started with Airbyte

This comprehensive guide will walk you through installing Airbyte, setting up your first data pipeline, and understanding core configuration options.

10 min read

Getting Started with Airbyte

This comprehensive guide will walk you through installing Airbyte, setting up your first data pipeline, and understanding core configuration options.

Time: 45-60 minutes Prerequisites: Docker installed, basic command line knowledge


Installation Options

Option 1: Docker Compose (Recommended for Getting Started)

The fastest way to get Airbyte running locally.

Step 1: Install Prerequisites

Docker Desktop:

Verify Installation:

Step 2: Download Airbyte

What Happens:

  • Downloads latest Airbyte docker-compose configuration
  • Pulls Docker images (~2-3 GB)
  • Starts all Airbyte services
  • Creates PostgreSQL database for metadata

Wait for startup (usually 2-3 minutes):

Step 3: Access Airbyte UI

  1. Open browser to http://localhost:8000
  2. Create admin account (first-time setup)
  3. Set email and password
  4. Optionally sign up for product updates

Option 2: Airbyte Cloud (Fully Managed)

No infrastructure management required.

Step 1: Sign Up

  1. Visit cloud.airbyte.com
  2. Sign up with email or Google/GitHub
  3. Verify email address
  4. Create workspace

Step 2: Connect Data Warehouse (Optional)

Cloud version runs in Airbyte's infrastructure but can write to your warehouse.

Free Tier Includes:

  • 1 workspace
  • Unlimited sources and destinations
  • 5 GB synced data per month
  • Community support

Option 3: Kubernetes (Production)

For production deployments requiring scalability.

Prerequisites

  • Kubernetes cluster (1.19+)
  • kubectl configured
  • Helm 3.0+

Install with Helm

Access at http://localhost:8000


Your First Data Pipeline

Let's create a complete pipeline: Postgres → Snowflake

Step 1: Add a Source (PostgreSQL)

Navigate to Sources

  1. Click "Sources" in left sidebar
  2. Click "+ New Source"
  3. Search for "Postgres"
  4. Click "Postgres"

Configure PostgreSQL Source

Connection Details:

SSL Configuration:

  • For production: Enable SSL
  • For local dev: Disable SSL

Advanced Options:

Test Connection

Click "Test" button:

  • ✅ Connection successful
  • ✅ Can list schemas
  • ✅ Can read table metadata

Click "Set up source"


Step 2: Add a Destination (Snowflake)

Navigate to Destinations

  1. Click "Destinations" in left sidebar
  2. Click "+ New Destination"
  3. Search for "Snowflake"
  4. Click "Snowflake (tables)"

Configure Snowflake Destination

Account Information:

Authentication:

Advanced Options:

Test Connection

Click "Test":

  • ✅ Can connect to Snowflake
  • ✅ Can create schema
  • ✅ Can write test record

Click "Set up destination"


Step 3: Create a Connection

Start Connection Setup

  1. From Sources page, click your PostgreSQL source
  2. Click "Set up connection" or "+ New connection"
  3. Select "My Snowflake Warehouse" as destination
  4. Click "Set up connection"

Configure Sync Settings

Replication Frequency:

Or for manual:

Destination Namespace:

Destination Stream Prefix:

Select Streams (Tables)

All Tables View:

  • Shows all tables from source
  • Check tables to sync
  • Configure sync mode per table

Example:

Configure Sync Modes

Full Refresh | Overwrite:

  • Deletes destination table
  • Reloads all source data
  • Use for: Small dimension tables

Full Refresh | Append:

  • Keeps existing data
  • Appends new full snapshot
  • Use for: Historical snapshots

Incremental | Append:

  • Only syncs new rows (based on cursor)
  • Appends to destination
  • Use for: Event logs, append-only tables

Incremental | Append + Dedup:

  • Syncs new/updated rows
  • Deduplicates based on primary key
  • Use for: Transactional tables with updates

Example Configuration:

Configure Transformations (Optional)

Basic Normalization:

  • ✅ Normalize Data
  • Creates typed tables from JSON
  • Uses dbt under the hood

Custom dbt Transformations:

  • Link GitHub repo with dbt models
  • Run transformations after sync
  • [Advanced feature]

Finalize Connection

  1. Review configuration
  2. Click "Set up connection"
  3. Connection created!

Step 4: Run Your First Sync

Manual Sync

  1. Go to "Connections" page
  2. Find your new connection
  3. Click "Sync now"

Watch the Progress:

Verify in Snowflake


Understanding Sync Modes

Full Refresh Sync Modes

Full Refresh | Overwrite:

Full Refresh | Append:

Use Full Refresh When:

  • Table is small (<1M rows)
  • Data changes unpredictably
  • No cursor field available
  • Need historical snapshots

Incremental Sync Modes

Incremental | Append:

Incremental | Append + Dedup:

Cursor Field Requirements:

  • Monotonically increasing (timestamp, auto-increment ID)
  • Never null
  • Not updated on row updates (use updated_at not created_at)

Primary Key for Dedup:

  • Unique identifier
  • Composite keys supported: [user_id, order_id]

Configuration Deep Dive

Source Configuration

Replication Method:

Standard:

  • Queries database directly
  • SELECT statements
  • Works for all databases
  • Can impact source database performance

Change Data Capture (CDC):

  • Uses database transaction logs
  • Minimal source impact
  • Captures deletes
  • Requires database configuration

CDC Setup (PostgreSQL Example):

Destination Configuration

Staging:

Some destinations use staging (S3, GCS) before loading:

Configuration:

Loading Method:

COPY (Recommended):

  • Fastest for large datasets
  • Uses warehouse's native COPY command
  • Requires staging

INSERT:

  • Direct INSERT statements
  • Slower but no staging needed
  • Good for small datasets

Monitoring and Troubleshooting

View Sync History

  1. Go to connection
  2. Click "Sync History" tab
  3. View past sync runs:
    • Status (Success, Failed, Partial Success)
    • Records synced
    • Duration
    • Error logs

Debug Failed Syncs

Check Logs:

  1. Click failed sync job
  2. View "Logs" tab
  3. Look for error messages

Common Issues:

Connection Timeout:

Permission Denied:

Out of Memory:

Sync Job Retry

Failed syncs don't auto-retry by default:

  1. Fix the issue
  2. Click "Sync now" to retry

Essential CLI Commands

Airbyte provides a CLI for advanced operations:


Next Steps

Immediate (Today)

  1. ✅ Set up second source (try File/CSV or API)
  2. ✅ Configure incremental sync with cursor field
  3. ✅ Enable normalization and explore dbt models
  4. ✅ Set up scheduled sync

Short-term (This Week)

  1. Read Best Practices
  2. Explore Use Cases
  3. Build a custom connector
  4. Set up monitoring and alerting

Medium-term (This Month)

  1. Deploy to production (Kubernetes)
  2. Implement CI/CD for connector configs
  3. Set up dbt transformations
  4. Optimize sync schedules and performance

Quick Reference

Common Sync Frequencies

Cursor Field Best Practices

Resource Requirements

Docker Compose:

  • 8 GB RAM minimum
  • 20 GB disk space
  • 2 CPU cores

Kubernetes:

  • 3 worker nodes (2 vCPU, 8 GB RAM each)
  • 50 GB persistent storage
  • Load balancer

Troubleshooting Guide

Issue: Can't Access UI at localhost:8000

Check Docker:

Solution:

Issue: Sync Fails with "Out of Memory"

Solution:

Issue: Can't Connect to Source Database

Check Network:

Solution:

  • Check firewall rules
  • Use host.docker.internal for localhost databases
  • Verify credentials

Resources


Ready to build complex pipelines?See Best Practices

Want real-world examples?Explore Use Cases

Build a custom connector?Start Tutorial 1


← Back to Airbyte Overview | Next: Best Practices →

Stay in the loop

Get weekly insights on data engineering, analytics, and AI—delivered straight to your inbox.

No spam. Unsubscribe anytime.