Thursday, June 18, 2026

 Demystifying dbt Fundamentals: A Multi-Warehouse Hands-On Guide from Setup to Production DevOps

Whether you're migrating an existing data pipeline or building a modern data stack from scratch, dbt (Data Build Tool) has become the industry standard for managing the T (Transformation) layer in ELT pipelines. By bringing software engineering best practices—like modularity, testing, version control, and CI/CD—to SQL workflows, dbt bridges the gap between data analytics and software engineering.

Recently, I decided to dive deep into the dbt Fundamentals ecosystem. Instead of testing a single warehouse environment, I pushed the boundaries to connect dbt with three of the market's leading cloud data platforms: Snowflake, Google BigQuery, and Databricks.

Here is a complete step-by-step documentation of my hands-on journey, the architectural configurations I used, and how I took raw data all the way into a scheduled production environment.

1: Preparing Data Platforms & Project Initialization

The initial phase required setting up raw data tables across different cloud environments to mimic production ingestion

Snowflake Setup: SQL worksheet to load data using S3 bucket

Next is to use dbt cloud interface, build snowflake connection, link GitHub repo and initialize project

Saturday, June 13, 2026

 

Building Project PulseStream: An End-to-End Real-Time Healthcare Data Pipeline on Azure

In my latest lab exercise, I designed and implemented Project PulseStream—an end-to-end real-time healthcare data ingestion and analytics pipeline.

The core business objective is to simulate real-time patient data streams to monitor hospital admissions, minimize patient waiting times, identify bottleneck departments (like the Emergency Room, Surgery, or ICU), and expose demographic insights using age and gender-based KPIs.

The full repository—complete with the simulator scripts, Spark notebooks, orchestration layouts, and SQL warehouse queries—is open-source and hosted on my GitHub:

👉 Project PulseStream GitHub Repository

Architecture & Data Flow

The pipeline follows a modern stream processing paradigm mapped to a classic Medallion Lakehouse structure:

  1. Data Ingestion: Local Python Kafka Simulator $\rightarrow$ Azure Event Hubs.

  2. Medallion Lakehouse Storage: Azure Data Lake Storage (ADLS Gen2) structured into bronze, silver, and gold zones.

  3. Stream Processing & Cleansing: Azure Databricks Spark compute integrated with Azure Key Vault for credential rotation.

  4. Orchestration: Azure Data Factory (ADF) pipeline triggers.

  5. Data Warehousing & BI: Serverless SQL Pools in Azure Synapse Analytics connecting directly to Power BI.

Real-Time Patient Admission Simulation

To mimic live hospital events, I generated a Python script that acts as an active patient record simulator.

  • Prerequisites: The local environment requires the Kafka library (pip install kafka-python).

  • Azure Event Hub: I spun up an Event Hubs Namespace (EH-namespace-pulse) and created an instance named EventHub-PulseStream with a partition count of 1 and a cleanup policy set to delete after a 2-hour retention window.



Copy the Event HUB configurations to the Python Simulator code and generate events.

Data Warehousing Workshop (Badge 1) – Snowflake (Issued: June 2026) 



Thursday, June 11, 2026

Design and Implement In-Database RAG Architecture with Azure SQL and Azure OpenAI

Implementing Retrieval-Augmented Generation (RAG) traditionally requires orchestrating complex external vector databases and separate middleware code. However, modern database engineering allows us to implement the entire RAG pipeline—from vector search to LLM generation—directly inside the database engine using T-SQL.

This post analyzes an enterprise-grade RAG pattern built for Adventure Works, leveraging Azure SQL Database and Azure OpenAI via Microsoft Foundry.

System Architecture Flow Diagram:


Step 1: Provision an Azure SQL Database

Login to Azure Portal and go to the Azure SQL Hub and then select Create SQL Database.