Saturday, June 13, 2026

 

Building Project PulseStream: An End-to-End Real-Time Healthcare Data Pipeline on Azure

In my latest lab exercise, I designed and implemented Project PulseStream—an end-to-end real-time healthcare data ingestion and analytics pipeline.

The core business objective is to simulate real-time patient data streams to monitor hospital admissions, minimize patient waiting times, identify bottleneck departments (like the Emergency Room, Surgery, or ICU), and expose demographic insights using age and gender-based KPIs.

The full repository—complete with the simulator scripts, Spark notebooks, orchestration layouts, and SQL warehouse queries—is open-source and hosted on my GitHub:

👉 Project PulseStream GitHub Repository

Architecture & Data Flow

The pipeline follows a modern stream processing paradigm mapped to a classic Medallion Lakehouse structure:

  1. Data Ingestion: Local Python Kafka Simulator $\rightarrow$ Azure Event Hubs.

  2. Medallion Lakehouse Storage: Azure Data Lake Storage (ADLS Gen2) structured into bronze, silver, and gold zones.

  3. Stream Processing & Cleansing: Azure Databricks Spark compute integrated with Azure Key Vault for credential rotation.

  4. Orchestration: Azure Data Factory (ADF) pipeline triggers.

  5. Data Warehousing & BI: Serverless SQL Pools in Azure Synapse Analytics connecting directly to Power BI.

Real-Time Patient Admission Simulation

To mimic live hospital events, I generated a Python script that acts as an active patient record simulator.

  • Prerequisites: The local environment requires the Kafka library (pip install kafka-python).

  • Azure Event Hub: I spun up an Event Hubs Namespace (EH-namespace-pulse) and created an instance named EventHub-PulseStream with a partition count of 1 and a cleanup policy set to delete after a 2-hour retention window.



Copy the Event HUB configurations to the Python Simulator code and generate events.

Next major item is to Create Git Repository:


Setting Up Medallion Storage (ADLS Gen2):

  • /bronze: Captures raw binary payloads pushed from the Event Hub.

  • /silver: Holds cleansed, filtered, and parsed Parquet partitions.

  • /gold: Stores aggregated, business-level dimensional tables ready for analytics.

  • /synapseworkspace: Backing storage for our Synapse workspace.


Developing Spark Processing Notebooks

Using Azure Databricks (databricks-pulsestream) running a Spark 4.0.0 (Scala 2.13, Databricks Runtime 17.3 LTS) compute cluster, I built three logical processing stages:
  • Bronze Notebook: Subscribes to the Event Hub connection stream using the Spark-Kafka integration client, consumes live data packets, and writes them directly into /bronze.
  • Silver Notebook: Reads the raw stream files, parses the nested JSON string payloads, enforces schema schemas, and outputs clean columnar Parquet tables to /silver.
  • Gold Notebook: Consumes clean Silver Parquet files and structures them into highly optimized Dimensional and Fact tables within /gold.

Security Integration (Azure Key Vault & Databricks)

To avoid hardcoding storage access keys and connection strings in my PySpark notebooks, I deployed an Azure Key Vault named pharmavaulthub:

  1. RBAC Permissions: I assigned myself the Key Vault Administrator role to set up secrets.

  2. Secrets: I added two critical secrets—pulseeventconnstr (for the Event Hub connection) and adlskey (ADLS Access Key 1).

  3. Databricks Secret Scope: I mapped Key Vault directly to Databricks by navigating to https://<databricks-instance>/#secrets/createScope and pasting the Key Vault's DNS Name and Resource ID.

  4. Data Plane Access: I assigned the AzureDatabricks application the role of Key Vault Secrets User so the Spark cluster could programmatically pull credentials.

Now, my notebooks load connection keys dynamically at runtime:

dbutils.secrets.get(scope = "vaultscope", key = "keyname")

Next refer them in the databricks. To create a secret scope in Databricks using the user interface, navigate to https://<databricks-instance>/#secrets/createScope 


Copy the Azure key vault properties from key vault settings

Pipeline Orchestration in Azure Data Factory

Next is to Initialize Azure data factory to execute these notebooks.

Start with Linked services , dataset and create pipeline.



Test the trigger and Publish all. May require to run the event simulator again.


OLAP Data Warehousing with Azure Synapse:

Next is to start Azure Synapse datawarehouse for loading OLAP data.

Create SQL Pool and fact, Dim external tables. Queries are uploaded in the GitHub.


Next Task to start Power BI dashboard connecting to Synapse endpoints.

Interactive Power BI Dashboard

The final step was mapping our Synapse SQL endpoint directly into Power BI Desktop. By connecting directly to the Synapse views, I created an interactive, operational dashboard showing real-time statistics, admission rates, age/gender distributions, and immediate bottleneck alerts when specific department thresholds are breached.

https://github.com/ranjit78/pulse-stream-analytics


No comments: