Anantha Raju C

Building a Real-Time Data Pipeline Using Python, MySQL, Kafka, and ClickHouse

Anantha Raju C — Sun, 09 Jun 2024 03:48:53 GMT

In the modern data-driven world, real-time data processing and analytics are critical for making timely and informed decisions. This post explores a data pipeline architecture for real-time data streaming, processing, and visualization using popular open-source tools like Python, MySQL, Kafka, and ClickHouse.

Code and Documentation

The complete code for this data pipeline, along with detailed documentation of all steps, is available on GitHub.

Key Features of the GitHub Project

Step-by-Step Documentation: The GitHub repository includes detailed instructions for setting up and running the entire pipeline. This makes it easy for users to replicate the setup and understand each component's role.
Sample Data Generator: Python scripts to simulate real-time data generation are provided, helping users to test and experiment with the pipeline.
Configuration Files: Pre-configured settings for MySQL, Debezium, Kafka, and ClickHouse are included to streamline the setup process.
Stream Processing with ksql: Examples of how to use ksql for real-time data enrichment and transformation are provided, showcasing the power of stream processing.
Integration with BI Tools: Guidance on how to connect Metabase and Apache Superset to the data pipeline for visualization and reporting is available.

By following the instructions and leveraging the provided code, users can quickly set up a robust and scalable real-time data pipeline tailored to their specific needs.

Data Generation

The journey begins with the Data Generator. In our example, Python is used to simulate and generate event data. This could be any real-time data source, such as IoT devices, transaction logs, or user interactions on a website.

OLTP (Online Transaction Processing) Data Source

The generated data is initially stored in a MySQL database. MySQL serves as the OLTP data source, managing high transaction volumes and maintaining data integrity. It logs changes in real-time using a binlog (binary log), which captures all updates made to the database.

Change Data Capture (CDC) with Debezium

To stream changes from the MySQL database in real-time, Debezium is employed. Debezium is a distributed platform that captures row-level changes in databases. It acts as a MySQL Source Connector, monitoring the MySQL binlog for data changes and converting these changes into events.

Stream Processing Platform with Kafka

These change events are then sent to Kafka, a distributed streaming platform capable of handling large volumes of data with low latency. Kafka organizes these events into topics. Each topic is a log of messages, which can be processed in real-time.

To enrich and process these messages further, a stream processing tool like ksql (Kafka Stream Processing) can be used. ksql enables real-time data transformations and filtering directly on Kafka topics.

Analytical Database with ClickHouse

For analytical processing, the data is transferred from Kafka to ClickHouse, an OLAP (Online Analytical Processing) data warehouse known for its high performance and efficiency in handling analytical queries. ClickHouse consumes Kafka topics through its Kafka Table Engine. The data is then transformed into Materialized Views and stored in MergeTree Tables, optimized for fast query performance.

Business Intelligence and Data Visualization

Finally, the processed and aggregated data is ready for visualization and analysis. Tools like Metabase and Apache Superset can connect to ClickHouse, providing interactive dashboards and reports. Additionally, tools like dbt (data build tool) can be used for data transformation and modeling, enabling more advanced analytics and insights.

Other applications of this data pipeline include feeding into ML models for predictive analytics, performing operational analytics, and various other data-driven applications.

Conclusion

This data pipeline demonstrates a architecture for real-time data processing. By leveraging tools like Python, MySQL, Debezium, Kafka, ClickHouse, and various BI tools, organizations can ensure that they have timely and accurate data for decision-making and analytics. The flexibility of this setup make it suitable for a wide range of use cases, from operational monitoring to analytics etc.,

Unlocking Data Transformation Magic with dbt-core, ClickHouse, and Dagster

Anantha Raju C — Mon, 25 Dec 2023 12:22:54 GMT

In this blog post, we embark on a journey to explore the seamless integration of dbt-core with Dagster, complemented by the powerful analytics capabilities of ClickHouse. Let's dive in and witness the synergy of these tools in action.

Introduction

At the heart of our data transformation endeavors lies dbt Core, a versatile tool that empowers data teams to wield analytics engineering best practices with ease. Coupled with dagster, an orchestration platform tailored for data asset development and observation, and ClickHouse, an open-source column-oriented database management system.

Minimum Software Requirements

Ensure you have the following components installed to get started.

Installations

Install the dbt ClickHouse plugin.

pip install dbt-clickhouse

Install the dagster-dbt library.

pip install dagster-dbt dagster-webserver

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Initialize the dbt project and configure it to interface with ClickHouse.

dbt init dbt_data_practitioner

Navigate to Your dbt Project Directory

cd dbt_data_practitioner

touch profiles.yml

Add ClickHouse configuration details to profiles.yml. Make sure all dependencies are correctly set up in your profiles.yml file for your database connections.

dbt_data_practitioner:
  target: dev
  outputs:
    dev:
      type: clickhouse
      schema: sakila_db
      host: localhost
      port: 8123
      user: default
      password: root
      secure: False

Running dbt debug provides valuable insights into your dbt project's configuration and environment, ensuring everything is set up correctly before diving into data transformations. When you run dbt debug, it performs a series of checks and validations, including: Configuration Validation, Connection Testing, Adapter Information, Environment Information.

By running dbt debug as part of your setup process, you can catch any configuration errors or connectivity issues early on, ensuring a smoother experience when running dbt commands and executing data transformations.

dbt debug

After executing this command, carefully review the output to ensure that everything is configured correctly and that dbt can successfully connect to your ClickHouse instance.

dbt docs

Generate documentation for the dbt project:

dbt docs generate

This command generates the documentation artifacts for your project. It creates a manifest.json and catalog.json file in the target directory.

To view the documentation, you need to serve it locally using the following command:

dbt docs serve

By default, this will start a local server at http://localhost:8080 where you can view the documentation.

models

Create models and define transformations:

cd models  
mkdir sakila_db  
cd sakila_db

touch actor_film_actor_join.sql
touch point_of_interest_1.sql

Delete the examples folder present inside the models folder.

cd ..
cd ..

Build the project:

The following command runs a series of tasks to build your models, including running models, tests, and snapshots. This command essentially combines the functionality of dbt run, dbt test, and dbt snapshot.

The dbt build command will perform the following tasks:

Run Models: Executes SQL scripts to create views or tables based on your model definitions.
Run Tests: Executes tests defined in your project to validate data quality and integrity.
Run Snapshots: Executes snapshot scripts to capture and store historical data.

dbt build

The tables and views defined are now generated in ClickHouse DB.

dbt docs generate

dbt docs serve

Lineage Graph and other details.

Dagster Integration

Now, let's integrate Dagster into our data transformation pipeline:

cd dbt_data_practitioner

dagster-dbt project scaffold --project-name dagster_data_practitioner

cd dagster_data_practitioner

DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1 dagster dev

To access the Dagster UI from your browser, navigate to: http://127.0.0.1:3000

Dagster UI

Click on the black "Materialize all" button to materialize the transformations.

Sample Project

DataPractitioner is the sample project i've used to illustrate the usage of the aforementioned tools.

Noticed an issue with this Sample Project? Open an issue or a PR on GitHub!

Replicate MySQL database in ClickHouse with [experimental] MaterializedMySQL Database Engine

Anantha Raju C — Mon, 25 Dec 2023 07:34:10 GMT

In this guide, we'll walk through the seamless process of replicating a MySQL database into ClickHouse, leveraging the experimental MaterializedMySQL Database Engine. By replicating MySQL data into ClickHouse, we harness the real-time analytical capabilities of ClickHouse while preserving data integrity and consistency.

Introduction

MySQL stands as a stalwart in the realm of relational database management systems, renowned for its open-source nature and robust features. On the other hand, ClickHouse shines as a lightning-fast, column-oriented database management system, empowering users to generate analytical data reports in real-time through SQL queries.

Minimum Software Requirements

Ensure you have the following components installed and configured:

MySQL database.
ClickHouse database.

mysql.cnf MySQL server configuration for Replication

The following are mandatory MySQL server configuration which must be set.

Stop MySQL service.

systemctl status mysql
systemctl stop mysql.service

Update the configuration.

Edit the MySQL configuration file mysqld.cnf:

sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf

Add/Update the following parameters:

[mysqld]
server-id                     = 223344
log_bin                       = mysql-bin
expire_logs_days              = 1
binlog_format                 = row
binlog_row_image              = FULL
default_authentication_plugin = mysql_native_password
gtid_mode                     = on
enforce_gtid_consistency      = on

ClickHouse reads binlog and performs DDL and DML queries.

Start MySQL service.

systemctl start mysql.service

Reference: https://clickhouse.com/docs/en/engines/database-engines/materialized-mysql

MySQL user creation for replication

Create a MySQL user specifically for replication:

create user clickhouse_replication@'localhost' identified with mysql_native_password by 'ChRep$316';

grant replication slave, replication client, reload, select on *.* to clickhouse_replication@'localhost';

flush privileges;

Sample MySQL Data Setup

Setup a sample table and insert data into MySQL:

Create a table.

CREATE TABLE `user` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `created_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `last_modified_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `username` varchar(255) DEFAULT NULL,
  `email` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `UK_sb8bbouer5wak8vyiiy4pf2bx` (`username`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Insert sample data.

INSERT INTO `user`(`username`,`email`)VALUES('John Doe','johndoe@example.com');

INSERT INTO `user`(`username`,`email`)VALUES('Jane Doe','janedoe@example.com');

CLickHouse server configuration

sudo systemctl status clickhouse-server
sudo systemctl stop clickhouse-server

Update ClickHouse server configuration to allow experimental MaterializedMySQL engine:

sudo nano /etc/clickhouse-server/users.xml

Add/Update the following value within :

sudo systemctl start clickhouse-server

clickhouse-client --password

SELECT
    name,
    value,
    changed,
    description
FROM system.settings
WHERE name = 'allow_experimental_database_materialized_mysql'

Activating the replication in ClickHouse

Create a ClickHouse database with MaterializedMySQL engine for replication:

CREATE DATABASE mysqlCH
ENGINE = MaterializeMySQL('127.0.0.1', 'ch', 'clickhouse_replication', 'ChRep$316')
SETTINGS allows_query_when_mysql_lost = 1, max_wait_time_when_mysql_unavailable = 10000, materialized_mysql_tables_list = 'user'

Verify Replication

Check if the database and the table with data got replicated into ClickHouse:

select * from mysqlCH.user;

Additional Operations

Perform additional operations in MySQL and verify replication status in ClickHouse:

INSERT additional rows in MySQL table and check the replication status.

INSERT INTO `user`(`username`,`email`)VALUES('Alice','alice@example.com');
INSERT INTO `user`(`username`,`email`)VALUES('Bob','alice@example.com');

UPDATE Alice's email in MySQL user table.

UPDATE `ch`.`user` SET `email` = 'alice@domain.com' WHERE `username` = 'Alice';

DELETE Bob in MySQL user table.

DELETE FROM `ch`.`user` WHERE (`id` = '4');

With these steps, you've successfully set up MySQL replication into ClickHouse using the MaterializedMySQL engine. Harness the power of ClickHouse's real-time analytical capabilities while ensuring data consistency and integrity. Explore the vast possibilities of real-time analytics with ClickHouse and unlock insights from your replicated MySQL data effortlessly.

ClickHouse

Anantha Raju C — Mon, 13 Nov 2023 13:59:16 GMT

This post briefly documents the process of using ClickHouse.

Introduction

ClickHouse is a fast open-source column-oriented database management system that allows generating analytical data reports in real-time using SQL queries.

Minimum Software Requirements

ClickHouse

Download the software

Running the application

Accessing the client

Resources

New York Public Library "What's on the Menu?" Dataset an example of denormalizing data.

Airbyte

Anantha Raju C — Mon, 13 Nov 2023 13:23:17 GMT

Introduction

Airbyte is a powerful, open-source data integration engine designed to help you seamlessly consolidate your data into data warehouses, lakes, and databases. Whether you're managing data from various sources or need to ensure your data is easily accessible for analysis, Airbyte simplifies the process.

In this blog post, I will demonstrate how to use Airbyte with MySQL installed locally on Ubuntu as the source and a local JSON file as the output.

Minimum Software Requirements

To get started with Airbyte, ensure you have the following software installed:

Docker
MySQL (installed locally on Ubuntu)

Running the application

First, ensure Docker is running on your machine. If Docker isn't installed, download and install it from the official Docker website.

Open a terminal and run the following command's to download and start Airbyte:

Accessing the UI

Once Airbyte is up and running, you can access the user interface by navigating to the following URL in your web browser:

http://localhost:8000/

Use the following credentials to log in:

username airbyte
password password

note: For security reasons, it's recommended to change these credentials. You can do this by modifying the .env file located in your Airbyte directory.

Source Configuration

After logging in, you'll need to configure your data sources. Airbyte supports a wide variety of data sources, allowing you to integrate data from multiple platforms effortlessly.

1. Adding a Source

Click on the "Sources" tab in the UI.
Select "MySQL" as the type of source you want to add.
Fill in the required details, such as connection name, host, port, database name, username, and password.
Test the connection to ensure it's working correctly.

2. Setting Up a Destination

For this example, we'll use a local JSON file as the destination.

Click on the "Destinations" tab.
Select "Local JSON" as your desired destination.
Enter the necessary details, such as the file path where the JSON file will be stored.

3. Creating a Connection

Once your source and destination are configured, navigate to the "Connections" tab.
Click "New Connection" and select your source and destination.
Configure the sync frequency and any other options as needed.

4. Running Your First Sync

After setting up your connection, you can run your first sync manually by clicking "Sync Now" on your connection page.
Monitor the progress and check for any errors that may need to be addressed.

Conclusion

Airbyte is an excellent tool for integrating and consolidating your data from various sources. By following this quick start guide, you can set up and run your first data sync, ensuring your data is centralized and ready for analysis. For more advanced configurations and troubleshooting, refer to the official Airbyte documentation.

By leveraging Airbyte's capabilities, you can streamline your data workflows and focus on deriving insights from your data, rather than managing the complexities of data integration.

Exploratory data analysis with sweetviz

Anantha Raju C — Fri, 03 Nov 2023 15:38:01 GMT

"Exploratory Data Analysis (EDA) is an analysis approach that identifies general patterns in the data. These patterns include outliers and features of the data that might be unexpected. EDA is an important first step in any data analysis." 1

"Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. Output is a fully self-contained HTML application."

Installation

Before installing Sweetviz, you need to ensure that Python is installed on your system. You can check this by opening your command line or terminal and running the following command:

python --version

If Python is installed, this command will display the version number. For example, you might see something like Python 3.8.10. If Python is not installed, you can download and install it from the official Python website.

Once Python is installed, you can proceed to install Sweetviz using pip. Run the following command in your command line or terminal:

pip install sweetviz

Basic Program

Once Sweetviz is installed, you can create a simple program to generate a data report. We'll use the Titanic dataset for this example.

import pandas as pd
import sweetviz as sv

# Load the dataset
data = pd.read_csv('titanic.csv')

# Generate the Sweetviz report
report = sv.analyze(data)

# Display the report in the browser
report.show_html()

Explanation of the Code

Import Libraries: Import the necessary libraries, pandas for data manipulation and sweetviz for generating the report.
Load Dataset: Read the Titanic dataset into a Pandas DataFrame.
Generate Report: Use Sweetviz's analyze function to create a report based on the DataFrame.
Show Report: Generate an HTML report that is automatically opened in your web browser.

Output

After running the script, Sweetviz will generate a self-contained HTML report and open it in your default web browser. The report provides visualizations and analysis of your dataset.

Resources

To further enhance your understanding and usage of Sweetviz and Python scripts, here are some additional resources:

Download the dataset used in this example: https://github.com/datasciencedojo/datasets/blob/master/titanic.csv
A comprehensive guide on running Python scripts, useful for beginners: https://www.knowledgehut.com/blog/programming/run-python-scripts#how-to-run-python-script-by-the-interpreter

Conclusion

Sweetviz is a powerful tool that simplifies the process of exploratory data analysis by generating interactive reports. By following the steps outlined in this article, you can quickly start analyzing your datasets and gain valuable insights. Feel free to explore more features and customize your reports to suit your needs.

Automating Your Ubuntu Setup: A Post-Installation Shell Script

Anantha Raju C — Sat, 03 Dec 2022 07:30:42 GMT

Streamline your Ubuntu setup process with a post-installation shell script designed to automate the installation of essential software and configurations. This script simplifies the setup of a new Ubuntu system, saving you time and effort by handling common post-install tasks automatically.

Understanding Shell Scripts

A shell script is a computer program designed to be run by a Unix shell, serving as a command-line interpreter. By leveraging shell scripting, you can automate repetitive tasks and streamline your workflow on Ubuntu.

Minimum Software Requirements

To utilize this post-installation shell script, you need a system running Ubuntu, a popular Linux distribution known for its user-friendly interface and robustness.

Sample Project

Explore the Ubuntu Post Install Scripts repository on GitHub to access the sample script and customize it according to your preferences.

Basic Usage

Open a Terminal Window: Launch a terminal window on Ubuntu by pressing either of the following key combinations on your keyboard:
- Ctrl + Alt + T
- Ctrl + Shift + T
Set Script Permissions: Update the permissions of the script to make it executable using the chmod command.
- sudo chmod +x ubuntu-post-install.sh
Run the script with the bash command.
- sudo bash ubuntu-post-install.sh

Customization

While the provided script includes a set of preferential packages and configurations, you can customize it to suit your specific requirements. Feel free to update the script with your preferred software packages and system configurations by modifying the script available on the GitHub repository.

Conclusion

By leveraging the Ubuntu Post Install Script, you can expedite the setup process of your Ubuntu system, ensuring that it is configured according to your preferences and equipped with essential software packages. Embrace automation to enhance your productivity and make the most out of your Ubuntu experience.

List: Gnome Shell Extensions, IntelliJ IDEA Plugins, Web Browser Plugins, Online IDE's, API Tools, Dev Tools, Websites

Anantha Raju C — Sun, 27 Nov 2022 05:25:15 GMT

Gnome Shell Extensions

GNOME Shell extensions provide a powerful way to customize your GNOME desktop environment, tailoring it to your preferences and enhancing its functionality. Here, i’ve compiled a list of some of useful GNOME Shell extensions that can improve your user experience.

Extensions

ArcMenu

Description: ArcMenu is a customizable application menu for GNOME Shell, offering a modern and intuitive way to access your applications and system settings. It can be configured to resemble traditional application menus found in other desktop environments, making it ideal for users transitioning to GNOME.

Features:

Highly customizable menu layout
Search functionality for quick access to applications
Support for themes and icon packs

Control Blur Effect On Lock Screen
Description: This extension allows you to control the blur effect on the lock screen, providing a customizable aesthetic touch to your GNOME desktop. You can adjust the blur intensity to your liking, ensuring a perfect balance between style and readability.

Features:

Adjustable blur intensity
Simple and straightforward configuration
Enhances visual appeal of the lock screen

Dash to Panel
Description: Dash to Panel transforms your GNOME Shell dash into a unified taskbar, combining application launchers and the system tray into a single panel. This extension is perfect for users who prefer a more traditional desktop layout similar to Windows or KDE Plasma.

Features:

Combines dash and system tray into one panel
Highly customizable appearance and behavior
Supports multi-monitor setups

Google Earth Wallpaper
Description: Bring the beauty of the world to your desktop with the Google Earth Wallpaper extension. This extension sets your wallpaper to a random, high-quality photo from the curated Google Earth collection, ensuring your background is always stunning and unique.

Features:

Automatically updates wallpaper with random Google Earth photos
High-resolution images
Option to manually refresh the wallpaper

OpenWeather
Description: Stay updated with real-time weather information directly on your GNOME Shell with the OpenWeather extension. It displays weather data for any location worldwide, offering detailed forecasts and current conditions.

Features:

Displays current weather and forecasts
Supports multiple locations
Detailed weather information including temperature, humidity, and wind speed

Lock Keys
Description: The Lock Keys extension displays the status of Numlock and Capslock on the GNOME panel, providing a convenient way to check if these keys are active. This is especially useful for users who frequently switch between text and numeric input.

Features:

Visual indicator for Numlock and Capslock status
Simple and unobtrusive design
Essential for users with keyboards lacking LED indicators

Desktop Icons
Description: The Desktop Icons extension brings back the ability to add and manage icons on your desktop, a feature that is missing in the default GNOME Shell. This extension is perfect for users who prefer to have quick access to files, folders, and applications directly from their desktop.

Features:

Add and manage desktop icons
Support for dragging and dropping files
Customizable icon placement

With these GNOME Shell extensions, you can significantly enhance your GNOME desktop environment, making it more functional and personalized to suit your needs. Whether you're looking for aesthetic improvements, productivity boosts, or additional features, these extensions have got you covered.

IntelliJ IDEA Plugins

IntelliJ IDEA: stands as a premier integrated development environment (IDE) crafted by JetBrains, designed to facilitate the creation of software using Java, Kotlin, Groovy, and other JVM-based languages. It comes in two editions: a community edition licensed under Apache 2 and a proprietary commercial edition. One of its most compelling features is its extensibility through plugins, which enrich its capabilities and tailor it to individual developer needs.

Understanding Plugins

Plugins are software extensions that augment the functionality of a program. In the realm of IntelliJ IDEA, they serve as indispensable tools, providing developers with additional features, productivity enhancements, and integration with external services.

Minimum Software Requirements

To take advantage of these plugins, ensure you have IntelliJ IDEA installed on your system.

Plugins

Rainbow CSV
Description: Rainbow CSV is a versatile plugin that enhances the readability of CSV files by highlighting them in different colors. This visual distinction makes it easier to interpret and manipulate large datasets, thereby improving productivity during data analysis and manipulation tasks.
SonarLint
Description: SonarLint is a free IDE extension that acts as a guardian for your codebase, continuously analyzing it to detect and rectify bugs, vulnerabilities, and code smells in real-time. Similar to a spell checker, SonarLint identifies issues as you write code, offering quick fixes and actionable insights to ensure clean and robust code.
RoboPOJOGenerator
Description: This indispensable plugin streamlines the process of transforming JSON data structures into Plain Old Java Objects (POJOs). Ideal for IntelliJ IDEA and Android Studio users, RoboPOJOGenerator automates the tedious task of manual POJO generation, saving developers valuable time and effort.
intellij-generateAllSetMethod
Description: Simplify the task of generating setter method calls for class properties with this intuitive IntelliJ IDEA plugin. By automating the creation of setter method invocations, it accelerates the development process and reduces the likelihood of manual errors.
Key Promoter X
Description: Mastering keyboard shortcuts is key to efficient coding, and Key Promoter X is here to help. This plugin facilitates the learning process by displaying keyboard shortcuts whenever you perform an action using the mouse within the IDE. By encouraging the use of keyboard shortcuts, it promotes a faster, mouse-free development workflow.
Maven Helper
Description: Maven Helper is an indispensable tool for managing dependencies within your IntelliJ IDEA projects. It provides valuable insights into dependency conflicts, allowing you to analyze and exclude conflicting dependencies effortlessly. Additionally, it offers advanced features for optimizing Maven builds, ensuring smooth and efficient project development.

IntelliJ IDEA plugins play a crucial role in enhancing productivity, streamlining development workflows, and empowering developers to write clean, efficient code. By incorporating IntelliJ IDEA plugins into your IntelliJ IDEA setup, you can unlock new capabilities, automate repetitive tasks.

Web Browser Plugins

Enhance your browsing experience and streamline your workflow with these essential web browser plugins. From improving SEO practices to organizing your favorite links, these tools are designed to make your web browsing more efficient and enjoyable.

General Plugins

Checkbot
Browser extension that tests if your website follows 50+ SEO, speed, and security best practices. Crawls your site checking multiple pages at once.
FVD Speed Dial
Tool for saving and organizing the links to your favorite sites.
Hover Zoom
Browse image galleries with ease: move the mouse cursor over thumbnails to view images in their full size without loading a new page.
Floating for YouTube™
Always on top Floating Mini Player for YouTube™.
Floating for YouTube™ Extension
Open YouTube videos in Floating Mini Player for YouTube™.
WhatFont
The easiest way to identify fonts on web pages.
Alexa Traffic Rank
The Official Alexa Traffic Rank Extension, providing Alexa Traffic Rank and site information when clicked.
Clear Cache
Clear your cache and browsing data with a single click of a button.
Speedtest by Ookla
Take a Speedtest directly from your Google Chrome toolbar to quickly test your internet performance.
Tampermonkey
Userscript manager with over 10 million users. Available for Chrome, Microsoft Edge, Safari, Opera Next, and Firefox.

Developer Plugins

GitHub Repository Size
Automatically adds repository size to GitHub's repository summary.
LiveReload
LiveReload monitors changes in the file system. As soon as you save a file, it is preprocessed as needed, and the browser is refreshed.
Isometric Contributions
Toggle between the normal GitHub contribution chart and an isometric pixel art version.
FireShot
Capture full web page screenshots in Firefox, Chrome, Opera, or IE, and then edit and save them.

Web Browsers

Blisk
A free browser for web developers, offering tools for development, debugging, and testing: emulation, sync, analytics, and screenshots.
Brave
A free browser that automatically blocks ads and trackers, making it faster and safer.

These plugins can significantly enhance your web browsing experience and boost your productivity. Whether you're a developer, a designer, or just someone who loves exploring the web, these tools are sure to be invaluable additions to your toolkit.

Online IDE's

Online Coding Environments

Try It Online
Try It Online (TIO) is an online compiler that supports a wide range of practical and recreational programming languages. It provides a convenient platform for quickly testing and running code snippets without the need for local installations.
Ideone
Ideone is a versatile online compiler and debugging tool that supports over 60 programming languages. It allows developers to compile and execute code directly in the browser, making it an excellent choice for rapid prototyping and sharing code snippets.
Codenvy
Codenvy offers self-service Eclipse Che workspaces in the cloud. It provides a collaborative environment for teams to develop, build, and deploy applications using the power of container-based development.
Coding Ground - TutorialsPoint
TutorialsPoint's Coding Ground is a comprehensive online platform that supports coding in multiple popular programming languages. It offers a seamless editing, compiling, executing, and sharing experience, all within a browser-based interface.

Code Fiddles

PhpFiddle
PhpFiddle is a versatile online PHP IDE that provides a range of tools and resources for PHP, MySQL, SQLite, HTML, CSS, and JavaScript development. It enables users to test, debug, and share PHP code snippets and web applications with ease.
.NET Fiddle
.NET Fiddle is a convenient online sandbox for experimenting with .NET code snippets. It offers a lightweight development environment for quickly trying out C#, F#, and Visual Basic code without the need for local installations.
JSFiddle
JSFiddle is a popular online playground for testing and sharing JavaScript, CSS, HTML, and CoffeeScript code snippets. It provides a simple yet powerful code editor and live preview functionality, making it ideal for frontend development and prototyping.
SQL Fiddle
SQL Fiddle is a handy online tool for testing and sharing SQL database queries and schema designs. It supports various database systems, including MySQL, PostgreSQL, Oracle, and SQLite, allowing users to experiment with SQL code in a collaborative environment.
Python Fiddle
Python Fiddle offers a user-friendly web-based Python IDE for experimenting with Python code snippets. It allows users to run, edit, and share Python scripts in real-time, making it an excellent choice for learning Python and testing small programs.
R-Fiddle
R-Fiddle provides a convenient online platform for coding, testing, and sharing R code snippets and data analysis scripts. It offers an interactive R environment with support for plotting, statistical analysis, and package management, making it ideal for statistical computing tasks.
Go Playground
Go Playground is an official online tool provided by the Go programming language community. It offers a clean and minimalist environment for writing and executing Go code, making it easy to experiment with Go features and syntax without setting up a local development environment.

API Tools

Enhancing Your API Workflow: Essential Tools for Developers

APIs (Application Programming Interfaces) and web services play a crucial role in modern software development, enabling seamless communication and integration between different applications and systems. To streamline your API development process and enhance productivity, consider incorporating the following essential tools into your toolkit:

down for everyone or just me
Description: This simple tool allows you to quickly check if a website is down for everyone or just for you. It provides valuable insights into website availability, helping you troubleshoot connectivity issues more efficiently.
Uptime
Description: Ensure your service-level agreements (SLAs) are met with this uptime calculator tool. It helps you calculate and track uptime metrics, enabling you to maintain high availability and reliability for your applications and services.
Apigee API Console
Description: Discover, learn, test, and debug any API with interactive developer tools and documentation provided by Apigee API Console. It offers a comprehensive set of features for API exploration and testing, empowering developers to build and integrate APIs seamlessly.
PostBin
Description: Programatically test your API clients or webhooks with PostBin. It provides a convenient platform for sending and receiving HTTP requests, allowing you to validate and debug your API integrations effectively.
httpbin
Description: httpbin is a versatile HTTP request and response service designed for client testing. It offers a wide range of endpoints for testing various HTTP methods and parameters, making it an invaluable tool for API development and debugging.
RequestBin
Description: RequestBin provides a URL that collects requests made to it, allowing you to inspect and debug them in a human-friendly way. Use RequestBin to monitor and analyze HTTP requests sent by your clients or webhooks, facilitating effective troubleshooting and debugging.
Mockbin
Description: Mockbin enables you to generate custom endpoints to test, mock, and track HTTP requests and responses. It's a powerful tool for simulating different scenarios and behaviors in your API integrations, helping you ensure robustness and reliability.
Statuspage Description: Statuspage is a comprehensive status and incident communication tool that enables you to keep your stakeholders informed about service availability and performance. It provides real-time status updates and incident notifications, helping you maintain transparency and trust with your users.
Rapid API
Description: RapidAPI is the world’s largest API marketplace, offering a vast selection of APIs for various use cases and industries. Explore and integrate APIs seamlessly into your applications, accelerating development and unlocking new capabilities.
APIembed
Description: APIembed provides embeddable API code snippets in multiple programming languages for your website, blog, or API documentation. Quickly generate code examples and documentation to simplify API integration for developers.
Loader
Description: Loader is a free load testing service that allows you to stress test your web apps and APIs with thousands of concurrent connections. Ensure your applications can handle peak loads and maintain performance under stress with Loader.
REQ RES
Description: REQ RES is a hosted REST API ready to respond to your AJAX requests. It's a convenient tool for testing and prototyping API interactions, providing a reliable endpoint for your development and testing needs.

Conclusion

By incorporating API tools into your development workflow, you can streamline your API development process, enhance productivity, and ensure the reliability and performance of your applications and services.

Dev Tools

Useful Software Tool's for Software Development

Version control

Sourcetree
Sourcetree is a user-friendly Git GUI client available for both Windows and Mac users. It simplifies the process of managing Git repositories through its intuitive interface.
GitHub Desktop
GitHub Desktop provides a graphical interface for interacting with GitHub repositories, allowing users to perform common Git operations without using the command line.

Virtualization

VMware Workstation
VMware Workstation is a powerful desktop virtualization application that enables users to run multiple operating systems simultaneously on a single computer without the need for rebooting.
virt-manager virt-manager is a desktop application for managing virtual machines using the libvirt API. It supports various virtualization technologies, including KVM, Xen, and LXC.

Java

Java Decompiler
Java Decompiler is a tool for decompiling and analyzing Java bytecode, allowing developers to understand and modify Java applications more effectively.
Bytecode Viewer
Bytecode Viewer is a comprehensive suite for reverse engineering Java 8 JAR files and Android APKs. It includes features such as decompilation, editing, debugging, and more.
jadx - Dex to Java decompiler
jadx is a command-line and GUI tool for converting Android DEX and APK files into human-readable Java source code, aiding in the analysis and modification of Android applications.

IDE's

IntelliJ IDEA
IntelliJ IDEA is a feature-rich integrated development environment (IDE) for Java, Kotlin, Groovy, and other JVM-based languages. It offers advanced coding assistance, productivity tools, and seamless integration with version control systems.
PyCharm
PyCharm is a powerful Python IDE designed for professional developers. It provides intelligent code completion, debugging, testing, and other features to streamline Python development workflows.

Project Management

Open Project
OpenProject is an open-source project management software solution that facilitates collaboration, task tracking, and project planning. It offers features such as Gantt charts, Agile boards, and time tracking.

Various

WinSCP
WinSCP is a free and open-source SFTP client and FTP client for Windows, enabling secure file transfers between local and remote systems. It supports various transfer protocols and offers a user-friendly interface.
PuTTY
PuTTY is a widely-used SSH and telnet client for Windows, developed to provide secure remote access to Unix-based systems. It offers a simple yet powerful interface for managing SSH connections.
KeyStore Explorer
KeyStore Explorer is a GUI tool that serves as a replacement for the Java command-line utilities keytool and jarsigner. It simplifies the management of cryptographic keys, certificates, and keystores.
APNS/GCM Online Tester
APNS/GCM Online Tester is a web-based tool for testing Apple Push Notification Service (APNS) and Google Cloud Messaging (GCM) integrations. It allows developers to send test notifications and verify their implementation.

Data - RDBMS

DBeaver
DBeaver is a versatile multi-platform database tool designed for developers, SQL programmers, and database administrators. It supports various relational database management systems (RDBMS) and offers advanced SQL editing capabilities.
MySQL Notifier
MySQL Notifier is a lightweight utility that monitors Windows and MySQL services, providing notifications about changes in their status. It helps administrators keep track of MySQL server operations and troubleshoot issues efficiently.

Data - NoSQL

MongoDB Compass
MongoDB Compass is a graphical user interface (GUI) for MongoDB that simplifies database management and query building. It offers visual tools for exploring data, creating indexes, and analyzing performance metrics.

Others

HOW SECURE IS MY PASSWORD?
Password strength meter.
DevOps Bookmarks
Discover tools and frameworks in the DevOps landscape.
Lambda Test
Cross Browser Testing Cloud
Epoch Converter
Epoch & Unix Timestamp Conversion Tools
quicktype
Generate types and converters from JSON, Schema, and GraphQL
User Agent String
Analyze User Agent of you browser or any other user agent string.

Editor

GitBook Editor
Expressive Markup (Markdown/Asciidoc), Advanced proofreading, Version Control
Laverna
Markdown note taking app focused on privacy. Consider it like open source alternative to Evernote.

Design

Behance
Showcase and discover the latest work from top online portfolios by creative professionals across industries.

Markdown

Markdown PDF
This Visual Studio Code extension convert Markdown file to pdf, html, png or jpeg file.
markdownlint
markdown lint.
markdown-toc
Markdown Table of Contents generator.
Markdown to HTML Converter
Markdown to HTML Converter.

Useful Websites/Services for Software Development

Discover a curated list of websites and services that can enhance your software development workflow. From platforms offering cloud services to tools for data analysis and design inspiration, these resources cover various aspects of the development lifecycle.

Platform as a Service (PaaS - Free Tier)

Heroku
Heroku is a cloud platform that lets you build, deliver, monitor, and scale apps effortlessly. With its easy-to-use interface and extensive documentation, Heroku is a popular choice among developers for deploying applications.
Amazon Web Services
Amazon Web Services (AWS) provides a secure cloud services platform, offering a wide range of services including compute power, database storage, content delivery, and more. With its pay-as-you-go pricing model and global infrastructure, AWS is trusted by businesses of all sizes.
mLab
mLab is a Database-as-a-Service for MongoDB, offering a simple and scalable solution for managing MongoDB databases in the cloud. With features like automated backups and 24/7 support, mLab is a reliable choice for hosting MongoDB databases.
Netlify
Build, deploy, and manage modern web projects with Netlify. Featuring continuous deployment, serverless functions, and form handling, Netlify simplifies the process of building and deploying websites.

Data

Kaggle Datasets
Kaggle Datasets is the premier destination for discovering and analyzing publicly-available data. With thousands of datasets covering a wide range of topics, Kaggle Datasets is a valuable resource for data scientists and analysts.
Academic Torrents
Academic Torrents is a distributed system for sharing enormous datasets for researchers, by researchers. With its peer-to-peer architecture and comprehensive dataset collection, Academic Torrents is a valuable resource for academic research.
Mockaroo
Mockaroo is a powerful tool for generating realistic test data in various formats, including CSV, JSON, SQL, and Excel. With features like custom data generation and batch processing, Mockaroo is an essential tool for testing and development.
JSON viewer
JSON viewer is a plugin for Notepad++ that allows you to view JSON strings in a tree view format. With its intuitive interface and powerful features, JSON viewer makes it easy to navigate and analyze JSON data.
JSON to CSV Converter
JSON to CSV Converter is a free, in-browser tool for converting JSON data to CSV format. With its simple and intuitive interface, JSON to CSV Converter makes it easy to convert and download JSON data in CSV format.
JSON2table
JSON2table is a tool for visualizing JSON data in table and tree view format. With its intuitive interface and powerful features, JSON2table makes it easy to analyze and validate JSON data.

Design

Dribbble
Dribbble is a community of designers sharing their work and providing inspiration to others. With its vast collection of design samples and active community, Dribbble is a valuable resource for designers looking for inspiration and feedback.
UpLabs
UpLabs curates the best of design and development inspiration, resources, and freebies. With its comprehensive collection of design assets and active community, UpLabs is a valuable resource for designers and developers alike.
MaterialUp
MaterialUp curates the best of Material Design, Google's design language for creating intuitive and beautiful user interfaces. With its extensive collection of Material Design examples and resources, MaterialUp is a valuable resource for designers and developers working with Material Design.
iOSUp
iOSUp curates the best of iOS and macOS design, providing inspiration and resources for designers and developers working on Apple platforms. With its comprehensive collection of iOS and macOS design examples and resources, iOSUp is a valuable resource for Apple developers.
SiteUp
SiteUp curates the best of web design, providing inspiration and resources for designers and developers working on web projects. With its extensive collection of web design examples and resources, SiteUp is a valuable resource for web developers.

Java

Codota
Codota is an AI Pair Programmer that provides contextual code suggestions based on your code. With its machine learning algorithms and extensive code database, Codota helps developers write code faster and more efficiently.
java2s
java2s is a repository of code examples and tutorials organized by topic. With its extensive collection of Java code examples and tutorials, java2s is a valuable resource for Java developers.

Various

CodeBeautify
CodeBeautify offers a suite of online tools for developers, including beautifiers, validators, minifiers, and converters for various programming languages and formats. With its comprehensive collection of tools, CodeBeautify simplifies common development tasks and improves productivity.
Free Formatter
Free Formatter offers a variety of tools for developers, including formatters, validators, code minifiers, string escapers, encoders and decoders, and more. With its extensive collection of tools, Free Formatter is a valuable resource for developers working with various data formats and programming languages.
JS Bin
JS Bin is an open-source collaborative web development debugging tool. With its live editing and real-time collaboration features, JS Bin makes it easy to debug and experiment with web code.
Google Fonts
Google Fonts offers a vast collection of free, open-source fonts optimized for the web. With its easy-to-use interface and extensive collection of fonts, Google Fonts is a valuable resource for designers and developers looking to enhance their web projects with beautiful typography.
Coggle
Coggle is a collaborative mind-mapping tool that helps you make sense of complex ideas and processes. With its intuitive interface and real-time collaboration features, Coggle makes it easy to brainstorm, organize, and visualize your ideas.
CII Best Practices Badge Program
The CII Best Practices Badge Program is a way for FLOSS projects to show that they follow best practices. With its comprehensive set of guidelines and criteria, the CII Best Practices Badge Program helps projects improve their security, reliability, and sustainability.

Text to ASCII Art

TAAG
TAAG is a text-to-ASCII art generator that allows you to convert text into visually appealing ASCII art. With its customizable options and easy-to-use interface, TAAG is a fun tool for creating unique text designs.

Graphic Resources

Freepik
Freepik offers a wide range of graphic resources for everyone, including free vectors, stock photos, PSD files, and icons. With its extensive collection of high-quality graphics, Freepik is a valuable resource for designers and developers alike.

Productivity Tracking

WakaTime
WakaTime helps you quantify your coding by providing metrics, insights, and time tracking automatically generated from your programming activity. With its detailed reports and integrations with popular IDEs, WakaTime helps you understand and improve your coding habits.

Development Utilities

Online Spring Boot Banner Generator
The Online Spring Boot Banner Generator allows you to create custom banners for your Spring Boot applications. With its easy-to-use interface and customizable options, the generator makes it simple to add a personalized touch to your projects.
GitHub and BitBucket HTML Preview
GitHub and BitBucket HTML Preview is a handy tool that allows you to preview HTML files hosted on GitHub or BitBucket. With its simple interface and seamless integration with version control platforms, HTML Preview makes it easy to view and share HTML content online.

Unlocking Data Visualization Brilliance with Apache Superset

Anantha Raju C — Sun, 13 Nov 2022 14:16:20 GMT

Introduction

In this guide, we'll embark on an exploration of Apache Superset, a cutting-edge platform for data exploration and visualization. By leveraging Apache Superset, we empower users to uncover insights and communicate data stories effectively through stunning visualizations.

"Data and information visualization is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is numerous as for example a time series."

Minimum Software Requirements

Ensure you have the following components installed to kickstart your Apache Superset journey:

Getting Started

Setup

First, ensure Docker and Docker Compose are installed on your system by checking their versions:

docker version
docker-compose version

Database Setup

In this post we will use docker to deploy Apache Superset, in order to allow Apache Superset to connect to locally installed MySQL database we will have to perform the following operations.

Enable MySQL to be able to listen for an external IP address where the server can be reached:

sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf

update bind-address directive to a wildcard IP address, either *****, ::, or 0.0.0.0 to reference an external IP address.

By default, bind-address is set to 127.0.0.1, meaning that the server will only look for local connections.

Reference: https://www.digitalocean.com/community/tutorials/how-to-allow-remote-access-to-mysql

Create MySQL users for Apache Superset:

CREATE USER 'myuser'@'localhost' IDENTIFIED BY 'mypass';
CREATE USER 'myuser'@'%' IDENTIFIED BY 'mypass';

GRANT ALL ON *.* TO 'myuser'@'localhost';
GRANT ALL ON *.* TO 'myuser'@'%';

FLUSH PRIVILEGES;

Reference: https://stackoverflow.com/a/55742963

Bringing up Apache Superset

Updating docker-compose-non-dev.yml to connect to localhost.

Clone the Apache Superset repository and navigate to the directory:

git clone https://github.com/apache/superset.git
cd superset/

Pull and start Apache Superset using Docker Compose:

docker-compose -f docker-compose-non-dev.yml pull
docker-compose -f docker-compose-non-dev.yml up

Reference: https://superset.apache.org/docs/installation/installing-superset-using-docker-compose

Accessing Superset GUI

Navigate to localhost:8088/login/ in your web browser to access the Apache Superset GUI.

Connect to Database

To connect Apache Superset to your MySQL database, use the following connection details:

Host: 172.17.0.1
Port: 3306
Database Name: mysqmpledb
Username: root
Password: root
Display Name: MySQL - LocalHost

With Apache Superset, you unlock a world of possibilities for data exploration and visualization. Harness its capabilities to transform your data into actionable insights and captivating visualizations that drive informed decision-making and empower data-driven storytelling.

Empower Real-Time Stream Processing with ksqlDB and ksql-cli

Anantha Raju C — Fri, 04 Nov 2022 14:10:01 GMT

In this guide, we'll delve into setting up ksqlDB—a powerful database purpose-built for stream processing applications—alongside its command-line interface, ksql-cli. With these tools, you can seamlessly explore, query, and transform streaming data to derive valuable insights in real-time.

Introduction

ksqlDB stands as a cutting-edge database designed specifically for stream processing applications. It offers a streamlined platform for processing and analyzing streaming data, enabling users to unlock insights and respond to events as they occur. Coupled with ksql-cli, a command-line interface for interacting with ksqlDB, you gain unparalleled flexibility and control over your stream processing workflows.

Minimum Software Requirements

Ensure you have the following software components installed to kickstart your ksqlDB journey:

Docker (for ksqldb-server, ksqldb-cli)
Docker Compose

Getting Started

Setup

Running the application via docker compose

Leverage Docker Compose to orchestrate the deployment of ksqlDB and ksql-cli:

 ksqldb-server:
    image: confluentinc/ksqldb-server:latest
    container_name: ksqldb-server
    hostname: ksqldb-server
    depends_on: [kafka]
    ports:
      - 8088:8088
    networks:
      - webproxy
    environment:
      KSQL_LISTENERS: http://0.0.0.0:8088
      KSQL_BOOTSTRAP_SERVERS: kafka:9092
      KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true"
      KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true"
      KSQL_KSQL_CONNECT_URL: http://kafka-connect-01:8084
      KSQL_KSQL_SCHEMA_REGISTRY_URL: http://schema-registry:8081

  ksqldb-cli:
    image: confluentinc/ksqldb-cli:latest
    container_name: ksqldb-cli
    networks:
      - webproxy
    depends_on: [kafka, ksqldb-server]
    entrypoint: /bin/sh
    tty: true
    environment:
      KSQL_KSQL_CONNECT_URL: http://kafka-connect-01:8084

Pull all required docker images

$ docker-compose pull

Start up the environment

Initiate the Docker containers to start up the ksqlDB environment.

The first time that you do this, the Docker images will be pulled down from the remote server. This may take a while!

$ docker-compose up

Creating ksqldb-server      ... done
Creating ksqldb-cli         ... done

Accessing ksqlDb via ksqldb-cli

Interact with ksqlDB using ksql-cli:

$ docker exec -it ksqldb-cli ksql http://ksqldb-server:8088

Check topics, streams and tables

show topics;

show streams;

show tables;

Streams

Declare Streams

SET 'auto.offset.reset' = 'earliest';

CREATE STREAM PERSON_STREAM (id bigint,uuid VARCHAR,created_date_time TIMESTAMP,last_modified_date_time TIMESTAMP,name VARCHAR,username VARCHAR,address_id bigint) WITH (KAFKA_TOPIC='mysql.streaming_etl_db.person',VALUE_FORMAT='JSON');

CREATE STREAM ADDRESS_STREAM (id bigint,uuid VARCHAR,created_date_time TIMESTAMP,last_modified_date_time TIMESTAMP,city VARCHAR,street VARCHAR,suite VARCHAR,zipcode VARCHAR,geo_id bigint) WITH (KAFKA_TOPIC='mysql.streaming_etl_db.address',VALUE_FORMAT='JSON');

Queries

DESCRIBE PERSON_STREAM;

select * from PERSON_STREAM;

SELECT * FROM PERSON_STREAM EMIT CHANGES LIMIT 1;

+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+
|ID                       |UUID                     |CREATED_DATE_TIME        |LAST_MODIFIED_DATE_TIME  |NAME                     |USERNAME                 |ADDRESS_ID               |
+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+
|1                        |ce8d2120-1f93-11ed-8647-0|2022-08-19T13:22:00.000  |2022-08-19T13:22:00.000  |d14                      |dbz14                    |1                        |
|                         |c9a3cfadc50              |                         |                         |                         |                         |                         |
Limit Reached
Query terminated

stream-stream join

CREATE STREAM PERSON_ADDRESS_ENRICHED_STREAM WITH (FORMAT='JSON', KAFKA_TOPIC='person_address_enriched', PARTITIONS=1, REPLICAS=1) AS 
SELECT
  P.ID P_ID,
  A.ID A_ID,
  P.NAME NAME,
  A.CITY CITY
FROM PERSON_STREAM P
LEFT OUTER JOIN ADDRESS_STREAM A WITHIN 1 HOURS GRACE PERIOD 30 MINUTES ON ((A.ID = P.ADDRESS_ID))
EMIT CHANGES;

Tables

Declare Tables

CREATE TABLE PERSON (id bigint PRIMARY KEY,uuid VARCHAR,created_date_time TIMESTAMP,last_modified_date_time TIMESTAMP,name VARCHAR,username VARCHAR,address_id bigint) WITH (KAFKA_TOPIC='mysql.streaming_etl_db.person',VALUE_FORMAT='JSON');

CREATE TABLE ADDRESS (id bigint PRIMARY KEY,uuid VARCHAR,created_date_time TIMESTAMP,last_modified_date_time TIMESTAMP,city VARCHAR,street VARCHAR,suite VARCHAR,zipcode VARCHAR,geo_id bigint) WITH (KAFKA_TOPIC='mysql.streaming_etl_db.address',VALUE_FORMAT='JSON');

Query Tables

SELECT * FROM PERSON EMIT CHANGES LIMIT 1;

SELECT * FROM ADDRESS EMIT CHANGES LIMIT 1;

Table Joins

SELECT 
	P.NAME,
	A.CITY
FROM PERSON P
LEFT JOIN ADDRESS A on A.id = P.address_id
EMIT CHANGES 
LIMIT 1;

SELECT 
  P.NAME, 
  A.CITY
FROM PERSON P
INNER JOIN ADDRESS A
ON A.id = P.address_id
EMIT CHANGES
LIMIT 1;

CREATE TABLE PERSON_ADDRESS_ENRICHED (P_ID bigint,A_ID bigint,NAME VARCHAR,CITY VARCHAR) WITH (KAFKA_TOPIC='person_address_enriched',VALUE_FORMAT='JSON');

Others

DROP TABLE IF EXISTS PERSON;

Tear down the stack

When done, tear down the ksqlDB stack:

$ docker compose-down

Stopping ksqldb-cli       ... done
Stopping ksqldb-server    ... done
Removing ksqldb-cli         ... done
Removing ksqldb-server      ... done

If you want to preserve the state of all containers, run docker-compose stop instead.

By following these steps, you can harness the power of ksqlDB and ksql-cli to perform real-time stream processing, enabling you to extract insights and derive value from streaming data with ease and efficiency.

Learning & Reference Resources for Developers

Anantha Raju C — Mon, 31 Oct 2022 03:26:15 GMT

Expand your knowledge and enhance your skills with these curated learning and reference resources. From versioning guidelines to design patterns and web tutorials, these resources provide valuable information for developers at all levels.

Various Resources

Semantic Versioning 2.0.0

Description: Understand the guidelines for semantic versioning. This resource provides a standardized versioning system that helps you manage project releases and dependencies effectively.

Java Design Patterns

Description: Explore a comprehensive collection of design patterns implemented in Java. This site offers clear examples and explanations, making it an invaluable resource for mastering design patterns in Java.

Common Words

Description: Visualize the common words used in different programming languages. This interactive tool helps you understand language syntax and common terminology across various programming languages.

Web Resources

REST API Tutorial

Description: Learn about RESTful APIs with this detailed tutorial. It covers the fundamentals of REST architecture, design principles, and best practices for building and consuming RESTful web services.

HTTP Status Codes

Description: Familiarize yourself with HTTP status codes using this comprehensive resource. It provides detailed descriptions and examples of various HTTP status codes, helping you understand their meanings and proper usage in web development.

Conclusion

By leveraging these learning and reference resources, you can deepen your understanding of key concepts and best practices in software development. Whether you're looking to master versioning guidelines, design patterns, or RESTful APIs, these resources offer valuable insights to help you grow as a developer.

Software Development - Read & Publish

Anantha Raju C — Sun, 30 Oct 2022 12:14:04 GMT

Explore a variety of resources that can enhance your software development journey. Whether you're looking for entertainment, tools for creating presentations, community platforms, or documentation solutions, these resources have you covered.

Entertainment

CommitStrip

Description: Enjoy a daily strip that humorously captures the life of a coder. CommitStrip combines funny anecdotes with topical tech news, making it a delightful read for developers.

Don't Hit Save

Description: Dive into a comic that brings humor to tech, the workplace, and indie game development. Don't Hit Save offers a light-hearted take on the challenges and quirks of being a developer.

Presentation

Speaker Deck

Description: Share your presentations online effortlessly with Speaker Deck. This platform makes it easy to upload and share your slides with a global audience, ensuring your ideas reach far and wide.

SlideShare

Description: Discover, share, and present professional content with SlideShare. As the world's largest community for sharing presentations and infographics, SlideShare is an invaluable resource for professionals looking to disseminate knowledge and insights.

Marp

Description: Create beautiful presentations using Markdown with Marp. This tool transforms your Markdown files into stunning slide decks, making it perfect for developers who prefer writing in plain text.

Community

DZone

Description: Join a vibrant community of software professionals on DZone. This platform publishes technical content and provides a space for developers to share knowledge, stay updated on industry trends, and enhance their skills.

CodeProject

Description: Access free source code and tutorials on CodeProject. This resource offers extensive programming help, particularly for Windows developers working with Visual Basic .NET and other .NET languages.

ProgrammableWeb

Description: Stay informed about the API economy with ProgrammableWeb. This site is a leading source of news and information about APIs, offering a comprehensive API directory and chronicling the evolution of web services.

StackShare

Description: Discover and discuss the best software tools and services on StackShare. This platform allows developers to compare tools, share experiences, and make informed decisions about the technologies they use.

Documentation

Read the Docs

Description: Host and browse documentation effortlessly with Read the Docs. This service makes your documentation fully searchable and easy to find, supporting imports from major version control systems.

Docusaurus

Description: Build and maintain open-source documentation websites with ease using Docusaurus. This tool helps you create beautiful documentation sites that are easy to update and manage.

asciinema

Description: Record and share terminal sessions with asciinema. This free and open-source solution captures terminal activity and allows you to share it on the web, making it an excellent tool for creating tutorials and demonstrating workflows.

Conclusion

Utilize these resources to enhance your software development process, stay entertained, and connect with the community. Whether you need tools for presentations, community support, or comprehensive documentation solutions, these platforms provide the necessary support to help you succeed in your projects.

Change Data Capture (CDC) - MySQL

Anantha Raju C — Mon, 24 Oct 2022 12:08:33 GMT

Introduction

Change Data Capture (CDC) is a powerful process that tracks and captures changes made to data in a database, delivering those changes in real-time to downstream processes or systems. This capability is crucial for maintaining data consistency across distributed systems, enabling real-time analytics, and more.

In this post, we will briefly document the process of setting up CDC with MySQL using various tools such as Docker, Kafka, and Debezium.

Minimum Software Requirements

Before we get started, ensure you have the following software installed on your machine:

Docker: Used to containerize applications (MySQL, Zookeeper, Kafka, Debezium, schema-registry, kafka-ui, ksqldb-server, ksqldb-cli)
Docker Compose: A tool for defining and running multi-container Docker applications.
MySQL Database: : The relational database management system.
MySQL Workbench: Or on any other MySQL database client/console.

MySQL Configuration (`mysql.cnf`)

To enable CDC on your MySQL database, you need to configure the MySQL server appropriately. Here’s an example of what your mysql.cnf file should look like:

[mysqld]
server-id         = 223344
log_bin           = mysql-bin
expire_logs_days  = 1
binlog_format     = row

server-id: A unique identifier for the server.
log_bin: Enables binary logging.
expire_logs_days: Specifies the number of days to retain binary logs.
binlog_format: Sets the binary log format to 'row', which is necessary for CDC.

Sample Project Setup

Follow these steps to set up the CDC environment

1. Clone the repository

First, clone the repository containing the project setup:

git clone https://github.com/AnanthaRajuC/Streaming_ETL_pipeline_MySQL.git
cd Streaming_ETL_pipeline_MySQL

Running the application via docker compose

2. Pull all required docker images

Pull all the necessary Docker images defined in the Docker Compose file:

docker compose -f docker-compose.yaml pull

3. Start up the environment

Start the Docker containers. The first time you do this, the images will be downloaded from the remote server, which may take some time:

docker compose -f docker-compose.yaml up

You should see output indicating that various services are being created and started:

Creating network "streaming_etl_pipeline_mysql_webproxy" with driver "bridge"
Creating zookeeper ... done
Creating kafka     ... done
Creating debezium           ... done
Creating cp-schema-registry ... done
Creating kafka-connect-01   ... done
Creating ksqldb-server      ... done
Creating ksqldb-cli         ... done

4. Accessing Kafka Topics via Kakfka-UI

Optionally, start Kafka UI, an open-source web UI for Apache Kafka Management

docker run --name=kafka-ui --network=streaming_etl_pipeline_mysql_webproxy -p 8080:8080 -e KAFKA_CLUSTERS_0_NAME=local -e KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS=kafka:9092 -d provectuslabs/kafka-ui:latest

Access the Kafka-UI console at: http://localhost:8080

5. Verify Everything is Running

Ensure all containers are up and running:

docker ps

IMPORTANT: If any components do not show "Up" under the Status column (e.g., they say "Exit") then you must rectify this before continuing. As a first solution, try re-issuing the docker-compose up -d command.

6. Access ksqlDB via ksqlDB-CLI

Launch the KSQL CLI in another terminal window.

docker exec -it ksqldb-cli ksql http://ksqldb-server:8088

Tear down the stack

To stop and remove the containers, run:

docker stop kafka-ui

docker rm kafka-ui

Then, bring down the rest of the environment:

docker compose -f docker-compose.yaml down

You should see output indicating that the containers are being stopped and removed:

Stopping ksqldb-cli       ... done
Stopping ksqldb-server    ... done
Stopping kafka-connect-01 ... done
Stopping debezium         ... done
Stopping kafka            ... done
Stopping zookeeper        ... done
Removing ksqldb-cli         ... done
Removing ksqldb-server      ... done
Removing kafka-connect-01   ... done
Removing cp-schema-registry ... done
Removing debezium           ... done
Removing kafka              ... done
Removing zookeeper          ... done
Removing network streaming_etl_pipeline_mysql_webproxy

If you want to preserve the state of all containers, run docker-compose stop instead.

Initial MySQL preparation

To get started with our MySQL database, we'll set up the initial schema, tables, and populate them with sample data. This preparation is crucial for ensuring that our database is ready for further development and integration.

MySQL

1. Declare schema, user and permissions.

First, create a new schema and user, then grant the necessary permissions to the user.

-- create schema
CREATE SCHEMA streaming_etl_db;

-- use schema
USE streaming_etl_db;

-- Create user 
CREATE USER 'debezium' IDENTIFIED WITH mysql_native_password BY 'Debezium@123#';

-- Grant privileges to user
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'debezium';

-- Reload the grant tables in the mysql database enabling the changes to take effect without reloading or restarting mysql service
FLUSH PRIVILEGES;

2. Create Tables

Next, create the tables for storing geographical data, addresses, and personal information. Each table includes fields and constraints tailored to the data it will store.

-- Table for geographical data
CREATE TABLE `geo` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Unique ID for each entry.',
  `uuid` VARCHAR(50) DEFAULT (uuid()),
  `created_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Field representing the date the entity containing the field was created.',
  `last_modified_date_time` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
  `lat` varchar(255) DEFAULT NULL,
  `lng` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COMMENT='Application Log.';

-- Table for addresses
CREATE TABLE `address` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Unique ID for each entry.',
  `uuid` VARCHAR(50) DEFAULT (uuid()),
  `created_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Field representing the date the entity containing the field was created.',
  `last_modified_date_time` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
  `city` varchar(255) DEFAULT NULL,
  `zipcode` varchar(255) DEFAULT NULL,
  `state` varchar(255) DEFAULT NULL,
  `geo_id` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `FK_geo_id` (`geo_id`),
  CONSTRAINT `FKC_geo_id` FOREIGN KEY (`geo_id`) REFERENCES `geo` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

-- Table for personal information
CREATE TABLE `person` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'Unique ID for each entry.',
  `uuid` VARCHAR(50) DEFAULT (uuid()),
  `created_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Field representing the date the entity containing the field was created.',
  `last_modified_date_time` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
  `first_name` varchar(255) NOT NULL,
  `last_name` varchar(255) DEFAULT NULL,
  `email` varchar(255) DEFAULT NULL,
  `gender` varchar(255) DEFAULT NULL,
  `registration` datetime DEFAULT NULL,
  `age` int DEFAULT NULL,
  `address_id` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `FK_address_id` (`address_id`),
  CONSTRAINT `FKC_address_id` FOREIGN KEY (`address_id`) REFERENCES `address` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

3. Insert Sample Data

Populate the tables with some sample data to verify the structure and relationships.

-- Insert sample data into geo table
INSERT INTO `streaming_etl_db`.`geo`(`lat`,`lng`)VALUES('la14','lo14');

-- Insert sample data into address table
INSERT INTO `streaming_etl_db`.`address`(`city`,`zipcode`,`state`,`geo_id`)VALUES('c14','z14','s14',1);

-- Insert sample data into person table
INSERT INTO `streaming_etl_db`.`person`(`first_name`,`last_name`,`email`,`gender`,`registration`,`age`,`address_id`)VALUES('fn14','ln14','example@domain.com','M',now(),34,1);

4. Select Statements

Retrieve data from the tables to ensure everything is set up correctly.

-- Join query to retrieve data from person, address, and geo tables
SELECT * 
FROM streaming_etl_db.person p
LEFT JOIN streaming_etl_db.address a on a.id = p.address_id
LEFT JOIN streaming_etl_db.geo g on g.id = a.geo_id;

-- Select all data from person table
SELECT * FROM streaming_etl_db.person;

-- Select all data from address table
SELECT * FROM streaming_etl_db.address;

By following these steps, you will have a fully prepared MySQL database with the necessary schema, tables, and sample data. This setup will serve as a foundation for further development and data integration.

Debezium Registration

In this section, we'll set up and register the Debezium connector to monitor changes in our MySQL database and stream them to Kafka. Debezium is an open-source distributed platform for change data capture (CDC).

Registering the Debezium Connector

To register the Debezium connector, use the following curl command.

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" 127.0.0.1:8083/connectors/ -d '{
  "name": "streaming_ETL_pipeline_MySQL-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "172.17.0.1",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "Debezium@123#",
    "database.server.name": "mysql",
	  "database.server.id": "223344",
    "database.include.list": "streaming_etl_db",
	  "database.allowPublicKeyRetrieval": true,
	  "database.history.kafka.bootstrap.servers": "kafka:9092",
	  "database.history.kafka.topic": "mysql-streaming_etl_db-person",
	  "time.precision.mode": "connect",
    "include.schema.changes": false,
    "transforms": "unwrap,dropTopicPrefix",
	  "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
	  "transforms.dropTopicPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
	  "transforms.dropTopicPrefix.regex":"asgard.demo.(.*)",
	  "transforms.dropTopicPrefix.replacement":"$1",
	  "key.converter":"org.apache.kafka.connect.json.JsonConverter",
	  "key.converter.schemas.enable": "false",
	  "value.converter":"org.apache.kafka.connect.json.JsonConverter",
	  "value.converter.schemas.enable": "false"
  }
}'

This command will:

Create a new connector named streaming_ETL_pipeline_MySQL-connector.
Configure it to monitor the streaming_etl_db schema on the MySQL instance.
Stream the change data to a Kafka topic named mysql-streaming_etl_db-person.

After registering the Debezium connector, you can check the status and information of the connectors.

Checking Connector Status
To verify the status of the connectors, visit:

http://localhost:8083/connectors?expand=info&expand=status

To check the status of the specific Debezium connector, visit:

http://localhost:8083/connectors/streaming_ETL_pipeline_MySQL-connector/status

Kafka UI

You can also use the Kafka UI to inspect the topics and messages being streamed. Visit the Kafka UI at:

http://localhost:8080

Viewing the Person Topic

Finally, check the messages in the person topic to ensure data is being streamed correctly:

Accessing ksqlDb via ksqldb-cli

In this section, we will explore how to access and use ksqlDB via the ksqlDB-CLI to interact with Kafka topics, create streams, perform stream-stream joins, and finally, sink the enriched stream back to MySQL.

Checking Topics, Streams, and Tables
Start by checking the available topics, streams, and tables in ksqlDB.

-- Display available topics
SHOW TOPICS;

-- Display available streams
SHOW STREAMS;

-- Display available tables
SHOW TABLES;

Declaring Streams
To begin processing data, declare the necessary streams. Set the offset to the earliest to ensure you capture all existing messages.

-- Set offset to the earliest
SET 'auto.offset.reset' = 'earliest';

Create streams to capture data from the Kafka topics corresponding to the MySQL tables.

-- Create stream for the person topic
CREATE STREAM PERSON_STREAM (
  id BIGINT,
  uuid VARCHAR,
  created_date_time TIMESTAMP,
  last_modified_date_time TIMESTAMP,
  name VARCHAR,
  username VARCHAR,
  address_id BIGINT
) WITH (KAFKA_TOPIC='mysql.streaming_etl_db.person', VALUE_FORMAT='JSON');

-- Create stream for the address topic
CREATE STREAM ADDRESS_STREAM (
  id BIGINT,
  uuid VARCHAR,
  created_date_time TIMESTAMP,
  last_modified_date_time TIMESTAMP,
  city VARCHAR,
  street VARCHAR,
  suite VARCHAR,
  zipcode VARCHAR,
  geo_id BIGINT
) WITH (KAFKA_TOPIC='mysql.streaming_etl_db.address', VALUE_FORMAT='JSON');

Querying Streams
Retrieve data from the streams to ensure they are correctly set up.

-- Select a single record from the PERSON_STREAM
SELECT * FROM PERSON_STREAM EMIT CHANGES LIMIT 1;

+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+
|ID                       |UUID                     |CREATED_DATE_TIME        |LAST_MODIFIED_DATE_TIME  |NAME                     |USERNAME                 |ADDRESS_ID               |
+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+
|1                        |ce8d2120-1f93-11ed-8647-0|2022-08-19T13:22:00.000  |2022-08-19T13:22:00.000  |d14                      |dbz14                    |1                        |
|                         |c9a3cfadc50              |                         |                         |                         |                         |                         |
Limit Reached
Query terminated

Describe the stream to get details about its schema.

-- Describe the PERSON_STREAM
DESCRIBE PERSON_STREAM;

-- Select all records from PERSON_STREAM
SELECT * FROM PERSON_STREAM;

Stream-Stream Join
Perform a join between the PERSON_STREAM and ADDRESS_STREAM to create an enriched stream combining data from both.

-- Create an enriched stream by joining PERSON_STREAM and ADDRESS_STREAM
CREATE STREAM PERSON_ADDRESS_ENRICHED_STREAM WITH (
FORMAT='JSON', 
KAFKA_TOPIC='person_address_enriched', 
PARTITIONS=1, 
REPLICAS=1
) AS 
SELECT
  P.ID P_ID,
  A.ID A_ID,
  P.NAME NAME,
  A.CITY CITY
FROM PERSON_STREAM P
LEFT OUTER JOIN ADDRESS_STREAM A WITHIN 1 HOURS GRACE PERIOD 30 MINUTES 
ON ((A.ID = P.ADDRESS_ID))
EMIT CHANGES;

Kafka Sink MySQL DB

Finally, create a sink connector to write the enriched stream back to the MySQL database.

-- Create a sink connector for the enriched stream
CREATE SINK CONNECTOR SINK_PERSON_ADDRESS_ENRICHED_STREAM WITH (
  'connector.class'            = 'io.confluent.connect.jdbc.JdbcSinkConnector',
  'connection.url'             = 'jdbc:mysql://172.17.0.1:3306/',
  'connection.user'            = 'debezium',
  'connection.password'        = 'Debezium@123#',
  'topics'                     = 'PERSON_ADDRESS_ENRICHED_STREAM',
  'key.converter'              = 'org.apache.kafka.connect.json.JsonConverter',
  'key.converter.schemas.enable' = 'false',
  'value.converter'            = 'org.apache.kafka.connect.json.JsonConverter',
  'value.converter.schemas.enable' = 'false'
);

By following these steps, you can access ksqlDB via the ksqlDB-CLI, create and query streams, perform joins, and sink the enriched data back to MySQL.

Accessing ksqlDb via ksqldb-cli

In this section, we'll delve into how to access ksqlDB using the ksqlDB-CLI, focusing on creating and querying tables, and performing joins.

Checking Topics, Streams, and Tables
Start by listing the available topics, streams, and tables in your ksqlDB environment to ensure you have everything set up correctly.

-- Display available topics
SHOW TOPICS;

-- Display available streams
SHOW STREAMS;

-- Display available tables
SHOW TABLES;

Declaring Tables
Next, declare tables to represent your Kafka topics in ksqlDB. These tables will enable you to perform SQL-like queries on streaming data.

-- Create table for the person topic
CREATE TABLE PERSON (
  id BIGINT PRIMARY KEY,
  uuid VARCHAR,
  created_date_time TIMESTAMP,
  last_modified_date_time TIMESTAMP,
  name VARCHAR,
  username VARCHAR,
  address_id BIGINT
) WITH (KAFKA_TOPIC='mysql.streaming_etl_db.person', VALUE_FORMAT='JSON');

-- Create table for the address topic
CREATE TABLE ADDRESS (
  id BIGINT PRIMARY KEY,
  uuid VARCHAR,
  created_date_time TIMESTAMP,
  last_modified_date_time TIMESTAMP,
  city VARCHAR,
  street VARCHAR,
  suite VARCHAR,
  zipcode VARCHAR,
  geo_id BIGINT
) WITH (KAFKA_TOPIC='mysql.streaming_etl_db.address', VALUE_FORMAT='JSON');

Querying Tables
Retrieve data from the tables to verify that they are correctly set up and ingesting data from the corresponding Kafka topics.

-- Select a single record from the PERSON table
SELECT * FROM PERSON EMIT CHANGES LIMIT 1;

-- Select a single record from the ADDRESS table
SELECT * FROM ADDRESS EMIT CHANGES LIMIT 1;

Performing Joins
Join the PERSON and ADDRESS tables to enrich the data, combining fields from both tables based on a common key.

-- Perform a left join between PERSON and ADDRESS tables
SELECT 
  P.NAME,
  A.CITY
FROM PERSON P
LEFT JOIN ADDRESS A 
ON A.id = P.address_id
EMIT CHANGES 
LIMIT 1;

-- Perform an inner join between PERSON and ADDRESS tables
SELECT 
  P.NAME, 
  A.CITY
FROM PERSON P
INNER JOIN ADDRESS A
ON A.id = P.address_id
EMIT CHANGES
LIMIT 1;

Creating Enriched Table
Create a new table to store the results of the join, providing a persistent view of the enriched data.

-- Create a table for the enriched person and address data
CREATE TABLE PERSON_ADDRESS_ENRICHED (
  P_ID BIGINT,
  A_ID BIGINT,
  NAME VARCHAR,
  CITY VARCHAR
) WITH (KAFKA_TOPIC='person_address_enriched', VALUE_FORMAT='JSON');

Managing Tables
You can also manage your tables by dropping them when they are no longer needed.

-- Drop the PERSON table if it exists
DROP TABLE IF EXISTS PERSON;

By following these steps, you can effectively use ksqlDB to create and manage tables, perform joins, and enrich your streaming data, enabling powerful real-time data processing capabilities in your ETL pipeline.

Conclusion

In this tutorial, we walked through the process of setting up a streaming ETL pipeline using MySQL, Debezium, Kafka, and ksqlDB. We began by preparing our MySQL database with the necessary schema, tables, and sample data. Then, we registered the Debezium connector to monitor changes and stream them to Kafka. Using ksqlDB, we created and queried streams and tables, performed joins to enrich our data, and finally, sank the enriched data back into MySQL.

By leveraging these powerful tools, you can build robust and scalable real-time data processing pipelines that enable you to respond to changes in your data as they happen. This setup provides a foundation for more complex stream processing tasks and opens up possibilities for real-time analytics, monitoring, and more.

Change Data Capture (CDC) with PostgreSQL and Debezium

Anantha Raju C — Sun, 14 Aug 2022 11:09:21 GMT

Change Data Capture (CDC) has revolutionized the way we interact with data, enabling real-time tracking and propagation of changes within databases. In this blog post, we'll delve into harnessing the power of CDC using two formidable tools: PostgreSQL and Debezium.

Introduction

Change Data Capture (CDC) is the backbone of modern data architectures, allowing seamless tracking and delivery of database changes to downstream systems. PostgreSQL, revered for its extensibility and SQL compliance, serves as our foundation for data storage. Complementing it is Debezium, a versatile open-source platform dedicated to change data capture.

Minimum Software Requirements

To embark on this journey, ensure you have Docker installed, facilitating the effortless setup of PostgreSQL, Zookeeper, Kafka, Debezium, and schema-registry components. Docker Compose streamlines the orchestration of these services, ensuring a hassle-free setup process.

Getting Started

Setup

To kickstart our CDC adventure, we utilize a Docker Compose configuration, encapsulating the required services. This configuration, detailed in the provided docker-compose.yaml file, lays the groundwork for our PostgreSQL and Debezium setup.

DB Configuration

In PostgreSQL, we define a STUDENT table with essential columns, setting the stage for our CDC operations. Leveraging PostgreSQL's robust functionality, we configure the STUDENT table with full REPLICA IDENTITY, ensuring comprehensive change tracking.

CREATE TABLE STUDENT(id INTEGER PRIMARY KEY, name VARCHAR);

ALTER TABLE public.student REPLICA IDENTITY FULL;

SELECT * FROM STUDENT;

Setup debezium Connector

Debezium seamlessly integrates with PostgreSQL via a dedicated connector. With the provided script, debezium-postgresql-connector.sh, we initiate and configure the connector, enabling seamless communication between our database and Debezium.

Tail kafka CDC topic

Monitoring the CDC topic in Kafka provides real-time insights into database changes. By tailing the Kafka CDC topic, as illustrated in tail_cdc_kafka_topic.sh, we gain immediate visibility into data modifications, empowering us to track changes effectively.

Add, Update, Delete Records

To witness CDC in action, we simulate data modifications within our PostgreSQL database. By executing SQL statements to add, update, and delete records in the STUDENT table, we trigger CDC events that are seamlessly propagated to the Kafka CDC topic.

-- Add records
INSERT INTO STUDENT(id,name) VALUES (1,'JOHN');
INSERT INTO STUDENT(id,name) VALUES (2,'JANE');

-- Update record
UPDATE STUDENT SET name='JOHNNY' WHERE ID=1;

In conclusion, the synergy between PostgreSQL and Debezium unlocks unparalleled capabilities in Change Data Capture, revolutionizing how we perceive and interact with real-time data changes. Through meticulous setup and intuitive configuration, harnessing CDC becomes not just a possibility but a powerful reality, propelling data-driven decision-making to new heights.

PostgreSQL, pgAdmin on Docker

Anantha Raju C — Sun, 14 Aug 2022 08:40:44 GMT

This post briefly documents the process of using Postgres, pgAdmin via Docker container.

Introduction

PostgreSQL, also known as Postgres, is a free and open-source relational database management system emphasizing extensibility and SQL compliance.

pgAdmin is the most popular and feature rich Open Source administration and development platform for PostgreSQL

Goals

Pull Docker Images from Docker Hub
Run the downloaded Postgres, pgAdmin Docker Image
Connect to the Database via pgAdmin
Interact with the MySQL Database via pgAdmin
Container Management (Stop, Restart, Stats)
Cleanup (Remove)

Minimum Software Requirements

Docker

Getting Started

Setup

Connect to Database via pgAdmin

http://localhost:5050/

Email Address : root@admin.com
Password : root

CREATE TABLE accounts (
	user_id serial PRIMARY KEY,
	username VARCHAR ( 50 ) UNIQUE NOT NULL,
	password VARCHAR ( 50 ) NOT NULL,
	email VARCHAR ( 255 ) UNIQUE NOT NULL,
	created_on TIMESTAMP NOT NULL DEFAULT now()
);

INSERT INTO accounts(username,password,email) VALUES('John Doe','mySecret','example@domain.com');

SELECT * FROM accounts;

Connect to DB via Terminal

docker exec -it postgres-latest /bin/sh

psql -U admin -d sample_db -W

Password: admin

sample_db=# \l

sample_db=# \dt

sample_db=# SELECT * FROM accounts;

sample_db=# exit

# exit

Cleanup (Remove)

docker-compose down