Thursday, June 18, 2026

 Demystifying dbt Fundamentals: A Multi-Warehouse Hands-On Guide from Setup to Production DevOps

Whether you're migrating an existing data pipeline or building a modern data stack from scratch, dbt (Data Build Tool) has become the industry standard for managing the T (Transformation) layer in ELT pipelines. By bringing software engineering best practices—like modularity, testing, version control, and CI/CD—to SQL workflows, dbt bridges the gap between data analytics and software engineering.

Recently, I decided to dive deep into the dbt Fundamentals ecosystem. Instead of testing a single warehouse environment, I pushed the boundaries to connect dbt with three of the market's leading cloud data platforms: Snowflake, Google BigQuery, and Databricks.

Here is a complete step-by-step documentation of my hands-on journey, the architectural configurations I used, and how I took raw data all the way into a scheduled production environment.

1: Preparing Data Platforms & Project Initialization

The initial phase required setting up raw data tables across different cloud environments to mimic production ingestion

Snowflake Setup: SQL worksheet to load data using S3 bucket

Next is to use dbt cloud interface, build snowflake connection, link GitHub repo and initialize project

Saturday, June 13, 2026

 

Building Project PulseStream: An End-to-End Real-Time Healthcare Data Pipeline on Azure

In my latest lab exercise, I designed and implemented Project PulseStream—an end-to-end real-time healthcare data ingestion and analytics pipeline.

The core business objective is to simulate real-time patient data streams to monitor hospital admissions, minimize patient waiting times, identify bottleneck departments (like the Emergency Room, Surgery, or ICU), and expose demographic insights using age and gender-based KPIs.

The full repository—complete with the simulator scripts, Spark notebooks, orchestration layouts, and SQL warehouse queries—is open-source and hosted on my GitHub:

👉 Project PulseStream GitHub Repository

Architecture & Data Flow

The pipeline follows a modern stream processing paradigm mapped to a classic Medallion Lakehouse structure:

  1. Data Ingestion: Local Python Kafka Simulator $\rightarrow$ Azure Event Hubs.

  2. Medallion Lakehouse Storage: Azure Data Lake Storage (ADLS Gen2) structured into bronze, silver, and gold zones.

  3. Stream Processing & Cleansing: Azure Databricks Spark compute integrated with Azure Key Vault for credential rotation.

  4. Orchestration: Azure Data Factory (ADF) pipeline triggers.

  5. Data Warehousing & BI: Serverless SQL Pools in Azure Synapse Analytics connecting directly to Power BI.

Real-Time Patient Admission Simulation

To mimic live hospital events, I generated a Python script that acts as an active patient record simulator.

  • Prerequisites: The local environment requires the Kafka library (pip install kafka-python).

  • Azure Event Hub: I spun up an Event Hubs Namespace (EH-namespace-pulse) and created an instance named EventHub-PulseStream with a partition count of 1 and a cleanup policy set to delete after a 2-hour retention window.



Copy the Event HUB configurations to the Python Simulator code and generate events.

Data Warehousing Workshop (Badge 1) – Snowflake (Issued: June 2026) 



Thursday, June 11, 2026

Design and Implement In-Database RAG Architecture with Azure SQL and Azure OpenAI

Implementing Retrieval-Augmented Generation (RAG) traditionally requires orchestrating complex external vector databases and separate middleware code. However, modern database engineering allows us to implement the entire RAG pipeline—from vector search to LLM generation—directly inside the database engine using T-SQL.

This post analyzes an enterprise-grade RAG pattern built for Adventure Works, leveraging Azure SQL Database and Azure OpenAI via Microsoft Foundry.

System Architecture Flow Diagram:


Step 1: Provision an Azure SQL Database

Login to Azure Portal and go to the Azure SQL Hub and then select Create SQL Database.

Wednesday, May 27, 2026

Architecting an E2E Clinical Data Pipeline using Azure Stack, Power BI

Modern healthcare data engineering demands robust, secure, and scalable architectures to transform raw patient data into structured, actionable intelligence. In this post, I will walk you through a production-ready cloud architecture designed to ingest, transform, clean, and visualize clinical data using the Microsoft Azure data ecosystem.

The High-Level Architecture

  1. Data Source: Raw text files are tracked in a git repository.

  2. Ingestion & Raw Storage: An Azure Data Factory (ADF) pipeline pulls files via HTTP and dumps them into Azure Data Lake Storage (ADLS Gen2).

  3. Data Transformation: ADF Data Flows process the files to clean missing parameters, perform schema mapping, and optimize attributes.

  4. Structured Storage: The enriched data is loaded to an optimized Azure SQL Database table.

  5. Visualization: Power BI Desktop establishes a connection to Azure SQL to render a live analytical dashboard for decision-makers



Phase 1: Environment Setup & Infrastructure Provisioning

Step 1: Establish the Version Control Repository

Before establishing cloud connections, upload your source clinical dataset (patients.csv) directly to a repository on GitHub. This acts as our persistent HTTP-accessible source layer.

Step 2: Creating the Resource Group in Azure

To simplify infrastructure life-cycle, log into the Azure Portal and provision a dedicated Resource Group. This acts as a single logical container holding every cloud service needed for this project.

Step 3 & 4: Provisioning the Storage and Processing Layer

Next, create the orchestration and raw storage engines:

  1. Azure Data Factory V2: Create a workspace (pharmahub-adf) to build our pipelines.

  2. Azure Data Lake Storage Gen2: Provision a standard-tier general-purpose v2 storage account (pharmahubdatalake) with locally redundant storage (LRS) to manage files efficiently.


Step 5 & 6: Deploying the Target Azure SQL Database

Deploy an Azure SQL Database (PharmaHUB) hosted on a new logical server (pharmaserver).

  • Workload Environment: Development (Serverless compute tier to reduce cost).

  • Networking Configuration: Enable Public endpoint access and ensure the rule "Allow Azure services and resources to access this server" is checked to grant Azure Data Factory an uninterrupted path to load data.

After deployment, open Azure Query editor and execute a test to confirm system accessibility:

SQL
SELECT * FROM sys.objects