i. Processed 1 Trillion rows using AWS Glue, PySpark, Python
Scaling Data Engineering for Deterministic Results
When dealing with financial data, precision and determinism are non-negotiable. We architected a solution leveraging AWS Glue 4.0 and PySpark to manage extreme scale and speed, successfully processing a colossal 1 Trillion rows in just 55 minutes. This was not merely a speed challenge, but a precision mandate.
ii. Engineering for Financial Precision
A critical component of this project involved safeguarding the integrity of sensitive financial calculations. We identified and addressed common pitfalls in high-volume processing:
iii. Designing and Implementing the Data Mesh with Apache Iceberg
To ensure long-term agility and data ownership, we evolved the architecture from a central Data Lake to a decentralized Data Mesh. This shift provided domain-oriented data products for the organization:
iV. Data Integration and API Development
We extended the data lake’s capabilities by implementing robust, on-demand data retrieval services:
i. Global Data Ingestion Framework:: Architecting a Scalable, Metadata-Driven Data Pipeline using Data Mesh on Azure
We designed and implemented a metadata-driven data pipeline to transition the organization to a Data Mesh architecture on Azure. This system unifies data ingestion and processing across multiple business groups, ensuring both extreme flexibility and stringent governance via the Medallion Architecture.
ii. The Dynamic Ingestion Framework (Azure Synapse)
To manage n number of diverse data sources (with requirements for full load, incremental, and CDC runs) without requiring one-off development, we built a highly efficient orchestration layer:
iii. Unified Processing and Governance (Azure Databricks)
The entire transformation workflow is implemented using Azure Databricks Workflows governed by Delta Live Tables (DLT), consolidating all logic into just two highly adaptable notebooks.
i. Migration of Hive Metastore Tables, ADF Pipelines and Azure Databricks Notebooks to Unity Catalog
Custom Migration of 10,000+ Tables, ADF Pipeline & Databricks Notebooks to Databricks Unity Catalog
We led a complex, high-stakes project to transition the organization’s core data infrastructure by migrating over 10,000 legacy tables and their dependent pipelines from an existing Hive Metastore to the secure, centralized governance model of Databricks Unity Catalog (UC).
This initiative evolved beyond a simple data transfer into a massive DevOps and Code Refactoring effort across the entire Azure data stack.
ii. The Migration Challenge: Building a Custom Solution
The native Unity Catalog migration tool (UCX) failed to execute the transfer because the legacy environment presented two key roadblocks: non-standardized external table paths and conflicting schema names. This required a custom, programmatic approach:
1. The Master Mapping Document: We first created a comprehensive Mapping Document—the single source of truth for the entire migration. This document precisely mapped the Hive Metastore Database/Table Name to the new Unity Catalog Name, UC Schema, and the standardized target UC External Table Path.
2. Custom Data Migration Scripting: Using this map, we executed a unified migration script designed to maximize efficiency for each table type:
Delta Tables (Deep Clone & Version Preservation): For existing Delta tables, we used the Deep Clone operation. This was critical because it instantly replicates the table metadata and data pointers to the new UC location, ensuring an atomic, high-speed transfer. Crucially, Deep Clone preserves the full Delta table version history and time travel capability, meaning no audit history was lost during the migration.
Non-Delta Tables (Conversion): For legacy formats (like Parquet or ORC), the script performed a read-and-write transformation: it read the table data into a PySpark DataFrame, wrote it to the target UC external path (abfss://…/schema/table_name), and then created the final external table definition under Unity Catalog governance.
iii. Mass Code Refactoring and Pipeline Automation (1,000+ Assets)
The primary complexity was refactoring all downstream assets. We had to convert over 1,000 affected Databricks Notebooks and corresponding Azure Data Factory (ADF) pipeline definitions from the old Hive syntax to the new UC three-part naming convention (catalog.schema.table).
This automated refactoring process successfully managed dependencies across thousands of files, eliminating manual errors, ensuring consistency, and rapidly enabling the new centralized governance model of Unity Catalog.
i. Performance Tuning for Terabyte-Scale Data Assets
I possess extensive experience in optimizing high-volume data workloads across both modern Delta Lake architectures and traditional relational databases, focusing on resource efficiency and query latency reduction for terabyte-scale financial data.
ii. Delta Lake Performance Optimization (Cloud)
For data stored in Delta tables within a cloud environment (such as Azure Databricks), tuning focuses on maximizing file organization and managing memory state to ensure the fastest possible processing and lowest scan cost.
We enforced strict in-memory persistence using df.cache() over df.persist() to ensure the data stays in fast memory, avoiding slow spillover to external storage (disk).
To guarantee the DataFrame is fully calculated and saved to the cache before the next transformation begins, we established a strict optimization sequence for intermediate DataFrames: df.repartition() -> df.cache() -> df.count().
By forcing an action (df.count()), the pre-processed data is materialized and available instantly for subsequent steps (e.g., Step 4), which drastically reduces latency and compute cost for massive, multi-step workflows.
iii. Relational Database Optimization (SQL Server, Oracle)
For complex Stored Procedures and ETL processes involving traditional relational databases, the focus was on intelligent query rewriting and index optimization to minimize resource consumption: