Cloud Data Engineering

YouTube Trending Data Pipeline on Azure

Data Bootcamp·Cloud Data Engineering·8 min read

An end-to-end serverless Azure data pipeline that ingests YouTube trending data across 5+ global regions, processes it through a three-tier medallion lakehouse (Bronze / Silver / Gold), and surfaces real-time analytics via Power BI — secured entirely through Managed Identity with zero hardcoded credentials.

Azure data pipeline architecture
Figure 1 — End-to-end pipeline architecture
5+
Regions Ingested
250+
Videos per Run
3
Lakehouse Layers
~$0
Idle Cost

The Challenge

Building a production-grade YouTube analytics pipeline required solving three interrelated problems: (1) securely accessing the YouTube Data API from a serverless function without exposing credentials in code or environment variables; (2) designing a scalable landing zone that could partition data by region and date for efficient downstream processing; and (3) establishing a clean medallion lakehouse architecture on ADLS Gen2 that enforces data quality gates between Bronze (raw), Silver (cleaned), and Gold (analytics-ready) layers — all while keeping costs near zero during idle periods.

The Solution

We built the ingestion layer as an Azure Function (Python 3.11) that authenticates to the YouTube Data API using a key retrieved at runtime from Azure Key Vault via Managed Identity — zero secrets in code or config. Raw trending video JSON lands in ADLS Gen2 Bronze containers partitioned by region, date, and hour. Synapse Spark notebooks handle Bronze-to-Silver transformation (schema enforcement, null handling, Parquet output) and Silver-to-Gold aggregations. Azure Data Factory orchestrates the full daily pipeline with parallel per-region ingestion, Synapse activity chaining, and Logic Apps notifications on success or failure. Synapse Serverless SQL exposes Gold tables as external views that Power BI queries via DirectQuery.

Pipeline Architecture

1
Ingestion

YouTube API Ingestion

Azure Functions (Python 3.11) expose two HTTP-triggered routes — one for trending videos, one for category reference data. On each invocation, the function retrieves the YouTube API key from Azure Key Vault via Managed Identity at runtime. No credentials are stored in code, config, or environment variables.

Azure FunctionsPython 3.11YouTube Data API v3Azure Key Vault
2
Bronze

Raw Landing Zone

Raw JSON responses land in ADLS Gen2 Bronze containers partitioned by region, date, and hour. This immutable landing zone preserves the original API payload exactly as received, enabling full reprocessing at any time without re-calling the API and incurring quota costs.

ADLS Gen2Managed IdentityJSONRBAC
3
Silver

Data Cleansing & Normalization

Synapse Spark notebooks read Bronze JSON, enforce schema, parse nested fields (tags, thumbnails, statistics), handle nulls and duplicates, and write cleansed Parquet files to the Silver layer partitioned by region and ingestion date. Reject records are written to a separate reject log.

Azure Synapse SparkPySparkParquetDelta Lake
4
Gold

Analytics Aggregations

Gold-layer notebooks aggregate Silver Parquet into analytics-ready fact tables: trending rankings by region, channel performance metrics, category distribution, and engagement rate calculations across all five ingested regions — ready for direct BI consumption.

Azure Synapse SparkPySparkSQL
5
Serving

Synapse Serverless SQL Views

Synapse Serverless SQL exposes Gold Parquet files as external tables via OPENROWSET, allowing analysts to query trending data with standard T-SQL without provisioning or paying for a dedicated SQL pool. Views enforce consistent column naming across all regions.

Synapse Serverless SQLT-SQLOPENROWSETExternal Tables
6
Orchestration

ADF Pipeline Orchestration

Azure Data Factory orchestrates the full pipeline on a daily schedule. Parallel ingestion activities run per region simultaneously, followed by chained Synapse notebook activities for each layer. Logic Apps fire success or failure email and webhook notifications at pipeline completion.

Azure Data FactoryLogic AppsADF TriggersWebhooks
7
Serving

Power BI Dashboards

Power BI connects to Synapse Serverless SQL views via DirectQuery, rendering real-time dashboards for regional trending rankings, category insights, and engagement analytics. Scheduled refresh delivers updated global YouTube trend data to stakeholders each morning.

Power BIDirectQuerySynapse SQL Connector

The Results

The pipeline runs daily across five regions (US, GB, CA, IN, PK), ingesting 250+ trending videos per run with full category reference data. The serverless architecture auto-scales on demand and incurs near-zero cost when idle. The medallion architecture ensures malformed or incomplete API responses never propagate to analytical layers — Silver reject logs capture all dropped records for auditability. Power BI dashboards refresh in under 30 seconds against Synapse Serverless SQL, giving the analytics team real-time visibility into global YouTube trending patterns each morning.

Technologies Used

Azure FunctionsPython 3.11YouTube Data API v3ADLS Gen2Azure Key VaultManaged IdentityMicrosoft Entra IDAzure Synapse SparkApache PySparkSynapse Serverless SQLAzure Data FactoryLogic AppsPower BIApplication Insights

Project Screenshots

Pipeline infrastructure and execution views.

Screenshot 1
Screenshot 2
Screenshot 3
Screenshot 4
Screenshot 5

Want a similar solution?

Let's talk about your data pipeline needs.

Get in Touch →