Unlocking Data Transformation Magic with dbt-core, ClickHouse, and Dagster
In this blog post, we embark on a journey to explore the seamless integration of dbt-core with Dagster, complemented by the powerful analytics capabilities of ClickHouse. Let's dive in and witness the synergy of these tools in action.
Introduction
At the heart of our data transformation endeavors lies dbt Core, a versatile tool that empowers data teams to wield analytics engineering best practices with ease. Coupled with dagster, an orchestration platform tailored for data asset development and observation, and ClickHouse, an open-source column-oriented database management system.
Minimum Software Requirements
Ensure you have the following components installed to get started.
Installations
Install the dbt ClickHouse plugin.
pip install dbt-clickhouse
Install the dagster-dbt library.
pip install dagster-dbt dagster-webserver
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Initialize the dbt project and configure it to interface with ClickHouse.
dbt init dbt_data_practitioner
Navigate to Your dbt Project Directory
cd dbt_data_practitioner
touch profiles.yml
Add ClickHouse configuration details to profiles.yml
. Make sure all dependencies are correctly set up in your profiles.yml
file for your database connections.
dbt_data_practitioner:
target: dev
outputs:
dev:
type: clickhouse
schema: sakila_db
host: localhost
port: 8123
user: default
password: root
secure: False
Running dbt debug
provides valuable insights into your dbt project's configuration and environment, ensuring everything is set up correctly before diving into data transformations. When you run dbt debug, it performs a series of checks and validations, including: Configuration Validation, Connection Testing, Adapter Information, Environment Information.
By running dbt debug
as part of your setup process, you can catch any configuration errors or connectivity issues early on, ensuring a smoother experience when running dbt commands and executing data transformations.
dbt debug
After executing this command, carefully review the output to ensure that everything is configured correctly and that dbt can successfully connect to your ClickHouse instance.
dbt docs
Generate documentation for the dbt project:
dbt docs generate
This command generates the documentation artifacts for your project. It creates a manifest.json
and catalog.json
file in the target directory.
To view the documentation, you need to serve it locally using the following command:
dbt docs serve
By default, this will start a local server at http://localhost:8080 where you can view the documentation.
models
Create models and define transformations:
cd models
mkdir sakila_db
cd sakila_db
touch actor_film_actor_join.sql
touch point_of_interest_1.sql
Delete the examples folder present inside the models folder.
cd ..
cd ..
Build the project:
The following command runs a series of tasks to build your models, including running models, tests, and snapshots. This command essentially combines the functionality of dbt run
, dbt test
, and dbt snapshot
.
The dbt build command will perform the following tasks:
- Run Models: Executes SQL scripts to create views or tables based on your model definitions.
- Run Tests: Executes tests defined in your project to validate data quality and integrity.
- Run Snapshots: Executes snapshot scripts to capture and store historical data.
dbt build
The tables and views defined are now generated in ClickHouse DB.
dbt docs generate
dbt docs serve
Lineage Graph and other details.
Dagster Integration
Now, let's integrate Dagster into our data transformation pipeline:
cd dbt_data_practitioner
dagster-dbt project scaffold --project-name dagster_data_practitioner
cd dagster_data_practitioner
DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev
To access the Dagster UI from your browser, navigate to: http://127.0.0.1:3000
Dagster UI
Click on the black "Materialize all" button to materialize the transformations.
Sample Project
DataPractitioner is the sample project i've used to illustrate the usage of the aforementioned tools.
Noticed an issue with this Sample Project? Open an issue or a PR on GitHub!