Expertise

Use Case 1

Self-Healing Modern Data Pipelines

<<Coming Soon …>>

Use Case 2

i. Processed 1 Trillion rows using AWS Glue, PySpark, Python

Scaling Data Engineering for Deterministic Results

When dealing with financial data, precision and determinism are non-negotiable. We architected a solution leveraging AWS Glue 4.0 and PySpark to manage extreme scale and speed, successfully processing a colossal 1 Trillion rows in just 55 minutes. This was not merely a speed challenge, but a precision mandate.

ii. Engineering for Financial Precision

A critical component of this project involved safeguarding the integrity of sensitive financial calculations. We identified and addressed common pitfalls in high-volume processing:

  • Eliminating Floating-Point Noise: We mandated the use of DECIMAL data types across all processing pipelines instead of standard floating-point types (FLOAT or DOUBLE). This seemingly small change was a game-changer, eliminating non-deterministic errors and ensuring that financial aggregations were exact and auditable.
  • Traceability and Auditability: The deterministic pipeline ensures that every run produces the identical, precise result, which is crucial for compliance and regulatory reporting in the financial sector.


iii. Designing and Implementing the Data Mesh with Apache Iceberg

To ensure long-term agility and data ownership, we evolved the architecture from a central Data Lake to a decentralized Data Mesh. This shift provided domain-oriented data products for the organization:

  • Data Mesh Foundation: The data domains were implemented using Apache Iceberg tables. Iceberg provides transactional capabilities for data lakes, allowing for schema evolution, hidden partitioning, and powerful time-travel features necessary for financial record keeping.
  • S3 Integration: The entire data flow from various sources (SQL, Oracle, Cosmos DB, APIs and SharePoint) was standardized, with the S3 acting as the foundation for the Bronze/Silver/Gold layers.


iV. Data Integration and API Development

We extended the data lake’s capabilities by implementing robust, on-demand data retrieval services:

  • PySpark API Development: Using PySpark within AWS Glue, we developed production-ready APIs to retrieve and incorporate crucial external data, such as real-time Weather information and structured Insurance-related records. This allowed us to enrich existing financial datasets without complex, brittle ETL scripts. This framework demonstrates expertise in building high-performance, compliant data platforms ready for modern architectural demands.

Use Case 3

i. Global Data Ingestion Framework:: Architecting a Scalable, Metadata-Driven Data Pipeline using Data Mesh on Azure

We designed and implemented a metadata-driven data pipeline to transition the organization to a Data Mesh architecture on Azure. This system unifies data ingestion and processing across multiple business groups, ensuring both extreme flexibility and stringent governance via the Medallion Architecture.

ii. The Dynamic Ingestion Framework (Azure Synapse)

To manage n number of diverse data sources (with requirements for full load, incremental, and CDC runs) without requiring one-off development, we built a highly efficient orchestration layer:

  • Metadata Control: The entire process is dictated by a central Metadata Control Table. This table determines the connection details, source object, and load type for every ingestion job.
  • Master/Child Pipelines: The architecture relies on one Master Azure Synapse Pipeline that reads the metadata and dispatches the work to a single, reusable Child Pipeline.
  • Universal Connector for Parallelism: Crucially, the Master Pipeline uses a ForEach activity configured for parallel execution. This allows the single Child Pipeline to be dynamically triggered for multiple tables simultaneously. When we pull multiple source tables, they run in parallel, drastically maximizing throughput and reducing overall processing window time.


iii. Unified Processing and Governance (Azure Databricks)

The entire transformation workflow is implemented using Azure Databricks Workflows governed by Delta Live Tables (DLT), consolidating all logic into just two highly adaptable notebooks.

  • Bronze Layer  
    The Bronze notebook(s) incrementally copy data from the Source Systems into the Bronze Delta tables, automatically handling schema evolution. This notebook dynamically processes any source data passed to it. 
  • Silver Layer
    The Silver notebook(s) applies core business logic, including business transformation rules. It reads from the Bronze layer and dynamically transforms the data based on configuration rules tied to the table name. 
  • Gold Layer | Delta Live Tables (SCD Type 2)
    The final layer focuses on creating optimized consumption data products (Business Tables – Fact and Dimension tables). 
  • Delta Live Tables with SCD Type 2 
    Implementation to maintain robust historical records. This modular, single-notebook approach allows for rapid onboarding of new data domains, dramatically accelerating development cycles while maintaining consistency and high data quality. The resulting Gold layer directly feeds Power BI / Tableau dashboards, providing the business with timely and trustworthy data for critical decision-making.

Use Case 4

i. Performance Tuning for Terabyte-Scale Data Assets

I possess extensive experience in optimizing high-volume data workloads across both modern Delta Lake architectures and traditional relational databases, focusing on resource efficiency and query latency reduction for terabyte-scale financial data.

ii. Delta Lake Performance Optimization (Cloud)

For data stored in Delta tables within a cloud environment (such as Azure Databricks), tuning focuses on maximizing file organization and managing memory state to ensure the fastest possible processing and lowest scan cost.

  • Adaptive Clustering & Data Skipping: We moved beyond fixed, static partitioning schemes to implement Liquid Clustering on large, frequently queried Delta tables. This feature dynamically manages data clustering based on query patterns, significantly improving query performance by minimizing data shuffling. By implementing the correct partitioning and leveraging Delta Lake’s Data Skipping features, we ensured queries only scanned necessary data, achieving orders-of-magnitude reduction in query execution time.
  • Advanced In-Memory Trillion-Row Caching: For processing pipelines involving multiple complex, chained calculations (e.g., Step 1 to Step 4) over massive datasets (e.g., trillion rows), intermediate caching becomes crucial. Reprocessing steps 1 through 3 for every run is prohibitively expensive and time-consuming at this scale.

We enforced strict in-memory persistence using df.cache() over df.persist() to ensure the data stays in fast memory, avoiding slow spillover to external storage (disk).

To guarantee the DataFrame is fully calculated and saved to the cache before the next transformation begins, we established a strict optimization sequence for intermediate DataFrames: df.repartition() -> df.cache() -> df.count().
By forcing an action (df.count()), the pre-processed data is materialized and available instantly for subsequent steps (e.g., Step 4), which drastically reduces latency and compute cost for massive, multi-step workflows.

iii. Relational Database Optimization (SQL Server, Oracle)

For complex Stored Procedures and ETL processes involving traditional relational databases, the focus was on intelligent query rewriting and index optimization to minimize resource consumption:

  • Strategic Join Condition Re-sequencing: 
    specialized in diagnosing and resolving performance bottlenecks in complex stored procedures. The core strategy involved ensuring the database engine executes joins in an optimal order: joining smaller result sets to large tables first.
    By joining a smaller size table to a large table (containing billions of records), we filtered a vast number of records early in the execution plan.
    This prevents expensive intermediate result sets that occur when large-to-large table joins are executed first, significantly reducing disk I/O and CPU load.
  • Index Management: We conducted thorough analyses to create the right types of indexes (Clustered, Non-Clustered, etc.) necessary for accelerating specific query patterns, ensuring high write performance was maintained while maximizing read efficiency.