DltOverview

dlt (data load tool)

dlt (data load tool) is an open-source Python library that makes building data pipelines simple and maintainable. Created by dltHub in 2022, dlt takes a code-first approach to data ingestion, allowing

10 min read

dlt (data load tool)

Overview

dlt (data load tool) is an open-source Python library that makes building data pipelines simple and maintainable. Created by dltHub in 2022, dlt takes a code-first approach to data ingestion, allowing you to build reliable, scalable pipelines with just Python functions.

What is dlt?

dlt is a Python library for building data pipelines that handles the heavy lifting of data extraction, schema management, incremental loading, and data quality.

Core Philosophy:

  • Code over config: Write Python, not YAML
  • Automatic schema inference: dlt discovers your data structure
  • Incremental by default: Only load new/changed data
  • Developer-friendly: Pythonic API, works with your existing tools
  • Production-ready: Built-in observability, error handling, retries

What dlt Does:

What dlt handles automatically:

  • ✅ Schema inference and evolution
  • ✅ Incremental loading (only new data)
  • ✅ Data type conversion
  • ✅ Normalization and flattening
  • ✅ Error handling and retries
  • ✅ State management
  • ✅ Data quality checks

Why Use dlt?

The Problem dlt Solves

Before dlt:

With dlt:

Key Benefits

  1. Rapid Development: Build pipelines in minutes, not days
  2. Automatic Schema Management: Schema inference and evolution
  3. Incremental Loading: Built-in state tracking
  4. Pythonic: Familiar Python syntax, integrates with existing code
  5. Flexible: Works with APIs, databases, files, custom sources
  6. Production-Ready: Observability, retries, error handling included

When to Use dlt

Perfect For:

  • Building custom data pipelines from APIs
  • Extracting data from databases
  • Loading data from files (JSON, CSV, Parquet)
  • Incremental data synchronization
  • Rapid prototyping of data pipelines
  • Python-first data teams
  • ELT workflows (extract, load, transform with dbt)

Not Ideal For:

  • Real-time streaming (use Kafka, Flink)
  • Orchestration (use Airflow, Prefect, Dagster)
  • No-code requirements (use Fivetran, Airbyte)
  • Pure SQL transformations (use dbt)

dlt vs Alternatives

Feature dlt Fivetran Airbyte Custom Scripts
Cost Free (OSS) $$$ Free (OSS) Free
Setup Time Minutes Minutes Hours Days
Flexibility High Low Medium Highest
Code-First
Maintenance Low None Medium High
Custom Sources Easy Hard Medium Easy
Best For Python teams Non-technical Self-host Complex logic

Core Concepts

1. Sources

Source: A Python function that yields data.

Key Points:

  • Sources are Python generators or iterators
  • Can yield dicts, lists, or Pandas DataFrames
  • dlt handles pagination, retries, state automatically

2. Resources

Resource: Individual data entities within a source (tables).

Write Dispositions:

  • replace: Truncate and reload (full refresh)
  • append: Insert all records (duplicates possible)
  • merge: Upsert based on primary key (deduplicate)

3. Pipelines

Pipeline: Manages the flow from source to destination.

4. Destinations

Destination: Where dlt loads your data.

Supported Destinations:

  • Warehouses: Snowflake, BigQuery, Redshift, Synapse
  • Databases: PostgreSQL, DuckDB, MotherDuck
  • Lakes: Databricks, Filesystem (Parquet)
  • Others: ClickHouse, Weaviate (vector DB)

Configuration:

5. Schema

Schema: Automatically inferred from your data, but customizable.

Automatic Inference:

Custom Schema:

Schema Evolution:

6. Incremental Loading

Incremental Loading: Only load new/changed data.

Incremental Strategies:

  • Append: Track last seen value (timestamp, ID)
  • Merge: Upsert based on primary key + incremental key
  • Custom: Implement your own state management

7. Transformations

Transformations: Modify data before loading.

Use dbt for complex transformations:

Architecture

How dlt Works

Pipeline Execution

  1. Extract Phase:

    • Call source functions
    • Yield data (streaming, not all in memory)
    • Track state for incremental loading
  2. Normalize Phase:

    • Infer schema from data
    • Convert types
    • Flatten nested structures
    • Handle schema evolution
  3. Load Phase:

    • Create/update tables
    • Batch insert data
    • Handle retries and errors
    • Update state

State Management

Common Patterns

REST API Integration

Database Replication

File Processing

Custom Source with Pagination

Integration with Modern Data Stack

dlt + dbt

Perfect combination: dlt for EL (extract, load), dbt for T (transform)

Why this works:

  • dlt: Gets data into warehouse quickly
  • dbt: Transforms data with SQL (version-controlled, tested)
  • Separation of concerns

dlt + Airflow

Orchestrate dlt pipelines with Airflow:

dlt + Dagster

Verified Sources

dlt provides 50+ pre-built, verified sources:

Popular Verified Sources:

  • SaaS: GitHub, Slack, Notion, Asana, Zendesk
  • Databases: PostgreSQL, MySQL, MongoDB
  • Marketing: Google Analytics, Facebook Ads, Google Ads
  • CRM: Salesforce, HubSpot, Pipedrive
  • Ecommerce: Shopify, Stripe
  • Data: REST API (generic), SQL Database (generic)

When Not to Use dlt

  • Real-time streaming: Use Kafka, Flink, Spark Streaming
  • Orchestration: dlt doesn't schedule; use Airflow, Prefect, Dagster
  • No-code required: Use Fivetran, Airbyte UI
  • Complex transformations: Use dbt (or dlt + dbt)
  • Managed connectors preferred: Use Fivetran (less maintenance)

Getting Help


Ready to build pipelines with dlt? Check out:

Stay in the loop

Get weekly insights on data engineering, analytics, and AI—delivered straight to your inbox.

No spam. Unsubscribe anytime.