Expertise

Use Case 1

## Processed 1 Trillion rows using AWS Glue, PySpark, Python

### Scaling Data Engineering for Deterministic Results

When dealing with financial data, precision and determinism are non-negotiable. We architected a solution leveraging AWS Glue 4.0 and PySpark to manage extreme scale and speed, successfully processing a colossal 1 Trillion rows in just 55 minutes. This was not merely a speed challenge, but a precision mandate.

### Engineering for Financial Precision

A critical component of this project involved safeguarding the integrity of sensitive financial calculations. We identified and addressed common pitfalls in high-volume processing:

  • Eliminating Floating-Point Noise: We mandated the use of DECIMAL data types across all processing pipelines instead of standard floating-point types (FLOAT or DOUBLE). This seemingly small change was a game-changer, eliminating non-deterministic errors and ensuring that financial aggregations were exact and auditable.
  • Traceability and Auditability: The deterministic pipeline ensures that every run produces the identical, precise result, which is crucial for compliance and regulatory reporting in the financial sector.

### Designing and Implementing the Data Mesh with Apache Iceberg

To ensure long-term agility and data ownership, we evolved the architecture from a central Data Lake to a decentralized Data Mesh. This shift provided domain-oriented data products for the organization:

  • Data Mesh Foundation: The data domains were implemented using Apache Iceberg tables. Iceberg provides transactional capabilities for data lakes, allowing for schema evolution, hidden partitioning, and powerful time-travel features necessary for financial record keeping.
  • ADLS Gen 2 Integration: The entire data flow from various sources (SQL, Oracle, Cosmos DB, APIs, SharePoint, and ADLS) was standardized, with the Azure Data Lake Storage Gen 2 acting as the foundation for the Bronze/Silver/Gold layers.

### Data Integration and API Development

We extended the data lake’s capabilities by implementing robust, on-demand data retrieval services:

  • PySpark API Development: Using PySpark within AWS Glue, we developed production-ready APIs to retrieve and incorporate crucial external data, such as real-time Weather information and structured Insurance-related records. This allowed us to enrich existing financial datasets without complex, brittle ETL scripts.

This framework demonstrates expertise in building high-performance, compliant data platforms ready for modern architectural demands.

Use Case 2

## Architecting a Scalable, Metadata-Driven Data Pipeline using Data Mesh on Azure

We designed and implemented a metadata-driven data pipeline to transition the organization to a Data Mesh architecture on Azure. This system unifies data ingestion and processing across multiple business groups, ensuring both extreme flexibility and stringent governance via the Medallion Architecture.

### The Dynamic Ingestion Framework (Azure Synapse)

To manage n number of diverse data sources (with requirements for full load, incremental, and CDC runs) without requiring one-off development, we built a highly efficient orchestration layer:

  • Metadata Control: The entire process is dictated by a central Metadata Control Table. This table determines the connection details, source object, and load type for every ingestion job.
  • Master/Child Pipelines: The architecture relies on one Master Azure Synapse Pipeline that reads the metadata and dispatches the work to a single, reusable Child Pipeline.
  • Universal Connector for Parallelism: Crucially, the Master Pipeline uses a ForEach activity configured for parallel execution. This allows the single Child Pipeline to be dynamically triggered for multiple tables simultaneously. When we pull multiple source tables, they run in parallel, drastically maximizing throughput and reducing overall processing window time.

### Unified Processing and Governance (Azure Databricks)

The entire transformation workflow is implemented using Azure Databricks Workflows governed by Delta Live Tables (DLT), consolidating all logic into just two highly adaptable notebooks.

| Layer | Technology | Key Functionality |
| :— | :— | :— |

  • | Bronze Layer | Single Notebook (Auto Loader) | The dedicated Bronze notebook utilizes Auto Loader to reliably and incrementally copy data from the Landing Zone into the Bronze Delta tables, automatically handling schema evolution. This single notebook dynamically processes any source data passed to it. |
  • | Silver Layer | Single Notebook (Dynamic DQ/RDM) | The Silver notebook applies core business logic, including comprehensive Data Quality (DQ) and Reference Data Management (RDM) rules. It reads from the Bronze layer and dynamically transforms the data based on configuration rules tied to the table name. |
  • | Gold Layer | Delta Live Tables (SCD Type 2) | The final layer focuses on creating optimized consumption data products (Fact and Dimension tables). This layer uses
  • Delta Live Tables with SCD Type 2 implementation to maintain robust historical records. |

This modular, single-notebook approach allows for rapid onboarding of new data domains, dramatically accelerating development cycles while maintaining consistency and high data quality. The resulting Gold layer directly feeds Power BI dashboards, providing the business with timely and trustworthy data for critical decision-making.

Use Case 3

That is an excellent detail to add! Highlighting that Deep Clone preserves the entire history is a huge differentiator from a standard copy and showcases a deeper understanding of Delta Lake features.

Here is the updated Use Case 3, with the preservation of Delta table versions explicitly stated.

## Custom Migration of 10,000+ Tables, ADF Pipeline & Databricks Notebooks to Databricks Unity Catalog

We led a complex, high-stakes project to transition the organization’s core data infrastructure by migrating over 10,000 legacy tables and their dependent pipelines from an existing Hive Metastore to the secure, centralized governance model of Databricks Unity Catalog (UC).

This initiative evolved beyond a simple data transfer into a massive DevOps and Code Refactoring effort across the entire Azure data stack.

### The Migration Challenge: Building a Custom Solution

The native Unity Catalog migration tool (UCX) failed to execute the transfer because the legacy environment presented two key roadblocks: non-standardized external table paths and conflicting schema names. This required a custom, programmatic approach:

1. The Master Mapping Document: We first created a comprehensive Mapping Document—the single source of truth for the entire migration. This document precisely mapped the Hive Metastore Database/Table Name to the new Unity Catalog Name, UC Schema, and the standardized target UC External Table Path.
2. Custom Data Migration Scripting: Using this map, we executed a unified migration script designed to maximize efficiency for each table type:

Delta Tables (Deep Clone & Version Preservation): For existing Delta tables, we used the Deep Clone operation. This was critical because it instantly replicates the table metadata and data pointers to the new UC location, ensuring an atomic, high-speed transfer. Crucially, Deep Clone preserves the full Delta table version history and time travel capability, meaning no audit history was lost during the migration.
Non-Delta Tables (Conversion): For legacy formats (like Parquet or ORC), the script performed a read-and-write transformation: it read the table data into a PySpark DataFrame, wrote it to the target UC external path (abfss://…/schema/table_name), and then created the final external table definition under Unity Catalog governance.

### Mass Code Refactoring and Pipeline Automation (1,000+ Assets)

The primary complexity was refactoring all downstream assets. We had to convert over 1,000 affected Databricks Notebooks and corresponding Azure Data Factory (ADF) pipeline definitions from the old Hive syntax to the new UC three-part naming convention (catalog.schema.table).

  • ADLS/GitOps Integration: We downloaded all existing ADF pipeline definitions (stored in JSON format) and staged them alongside the Databricks notebook code in an ADLS location.
  • Regex-Powered Conversion: We wrote a powerful Python script that leveraged the
  • Mapping Document and Regular Expressions (Regex) to automate the code changes:
  •  The script iterated through all Notebook files and the ADF JSON code.
  • It identified instances of old Hive Metastore database and table names.
  • It replaced the legacy references with the equivalent Unity Catalog name, schema, and path from the mapping document.
  • Automated Deployment: After successful conversion, the script automatically deployed the modified infrastructure:
  • The updated ADF pipeline JSON code was deployed back to Azure Data Factory.
  • The converted Databricks Notebook code was updated in the workspace.

This automated refactoring process successfully managed dependencies across thousands of files, eliminating manual errors, ensuring consistency, and rapidly enabling the new centralized governance model of Unity Catalog.

Use Case 4

## Performance Tuning for Terabyte-Scale Data Assets

I possess extensive experience in optimizing high-volume data workloads across both modern Delta Lake architectures and traditional relational databases, focusing on resource efficiency and query latency reduction for terabyte-scale financial data.

### Delta Lake Performance Optimization (Cloud)

For data stored in Delta tables within a cloud environment (such as Azure Databricks), tuning focuses on maximizing file organization and managing memory state to ensure the fastest possible processing and lowest scan cost.

  • Adaptive Clustering (Liquid Clustering) & Data Skipping: We moved beyond fixed, static partitioning schemes to implement Liquid Clustering on large, frequently queried Delta tables. This feature dynamically manages data clustering based on query patterns, significantly improving query performance by minimizing data shuffling. By implementing the correct partitioning and leveraging Delta Lake’s Data Skipping features, we ensured queries only scanned necessary data, achieving orders-of-magnitude reduction in query execution time.
  • Advanced In-Memory Trillion-Row Caching: For processing pipelines involving multiple complex, chained calculations (e.g., Step 1 to Step 4) over massive datasets (e.g., trillion rows), intermediate caching becomes crucial. Reprocessing steps 1 through 3 for every run is prohibitively expensive and time-consuming at this scale.

We enforced strict in-memory persistence using df.cache() over df.persist() to ensure the data stays in fast memory, avoiding slow spillover to external storage (disk).

To guarantee the DataFrame is fully calculated and saved to the cache before the next transformation begins, we established a strict optimization sequence for intermediate DataFrames: df.repartition() -> df.count() -> df.cache().
By forcing an action (df.count()), the pre-processed data is materialized and available instantly for subsequent steps (e.g., Step 4), which drastically reduces latency and compute cost for massive, multi-step workflows.

### Relational Database Optimization (SQL Server, Oracle)

For complex Stored Procedures and ETL processes involving traditional relational databases, the focus was on intelligent query rewriting and index optimization to minimize resource consumption:

  • Strategic Join Condition Re-sequencing: I specialized in diagnosing and resolving performance bottlenecks in complex stored procedures. The core strategy involved ensuring the database engine executes joins in an optimal order: joining smaller result sets to large tables first.

By joining a smaller size table to a large table (containing billions of records), we filtered a vast number of records early in the execution plan.
This prevents expensive intermediate result sets that occur when large-to-large table joins are executed first, significantly reducing disk I/O and CPU load.

  • Index Management: We conducted thorough analyses to create the right types of indexes (Clustered, Non-Clustered, etc.) necessary for accelerating specific query patterns, ensuring high write performance was maintained while maximizing read efficiency.