Optimising Data Transformation: Streamlining the Medallion Architecture

Cameron Price
Aug 19, 2024
3 min read

Updated: 1 day ago

On average, in most data architectures, a piece of data is physically moved and stored up to 15 times. There are many historical reasons for this, but unfortunately this leads to a solution that is cumbersome and complex to manage, with many points of failure across scripts, programs, ETL/ELT processes, scheduling, and dependencies.

In the era of the data warehouse, multiple layers of data storage and separation were common, such as raw/landing zone, staging zone, curated/processed zone, refined zone, a consumption zone, and maybe an archive zone.

Then came the Data Lake, which looked to simplify this architecture by providing additional flexibility and approach to such architectures. Unfortunately, most customers followed the same pattern, deploying architectures such as a raw zone, staging zone, curated zone, refined/trusted zone, discovery/sandbox zone, and maybe an archive zone.

This was no less complex to manage and arguably more complex due to the maturity and complexity of the tools involved.

Then view this in a lakehouse architecture where a data lake and data warehouse are combined, and those customers found it impossible to return any investment on such architectures or support critical digital transformation requirements across their organisations.

Within a Medallion Architecture, the industry is following a similar pattern, with a Bronze (Raw) Zone, Silver (Refined Zone), and a Gold (Aggregated Zone). As an industry and data practitioners, we have an opportunity to utilise this architecture to its best potential, rather than following the same patterns of the last 30 years and being very disappointed in the coming 2-5 years as we fail to deliver on the critical data transformation activities that are so important for our organisations.

The most significant challenge with such layered data architectures, including the Medallion Architecture, is the amount of ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processing required to manage and move data between these layers. This challenge manifests in several ways:

Complexity: As data progresses through different layers, it often undergoes multiple transformations and checks, which can make the ETL/ELT processes complex and difficult to manage.

Resource Intensive: Each transformation step consumes computational resources. With large volumes of data, this can become quite resource-intensive, requiring significant computing power and potentially increasing costs.

Latency: Multiple ETL/ELT stages can introduce latency. The time taken to process and move data from one layer to another can impact the timeliness of the data, which is particularly critical for real-time analytics.

Maintenance Overhead: Maintaining multiple ETL/ELT pipelines, especially in dynamic environments where data schemas and business requirements change frequently, can be challenging and labour-intensive.

Data Quality and Consistency: Ensuring data quality and consistency across multiple transformations and layers is challenging. Errors or inconsistencies introduced at any stage can propagate through the system.

Governance and Compliance: As data moves through different layers, keeping track of its lineage, ensuring compliance with regulations, and managing access and security become more complex.

To address these challenges, Data Tiles advocates the adoption of new approaches to ensure efficiency of data architectures including the Medallion data architecture. These include:

Streamlining ETL Processes: Simplifying and optimising ETL/ELT processes to reduce complexity and resource consumption.

ELT (Extract, Load, Transform): Don’t process for process’ sake. Use smart storage rules to inform where data should be stored, and in what technology. Do not use a one size fits all approach. Avoid moving towards an ELT model as a solution, as this just moves the challenges downstream to different technology, without reducing the complexity and the amount of physical data that is created.

Data Mesh: Move towards a data mesh architecture, allowing users to access and analyze data without moving it through multiple layers, reducing the need for extensive ETL/ELT processes, whilst maintaining governance, lineage, and quality. For example, Databricks advocate this approach: “The Medallion architecture is compatible with the concept of a data mesh, bronze and silver tables can be joined together in a one-to-many fashion” (Databricks public website 2023).

Automation and AI: Automating ETL/ELT processes and using AI to manage data pipelines can reduce manual overhead and improve efficiency.

Incremental and Real-Time Processing: Incrementally processing data as it changes and adopting real-time data processing techniques can reduce latency and improve data timeliness.

By adopting such approaches, organisations gain the ability to organise and manage data effectively, especially in complex environments with diverse data types and large volumes. The key is to balance the need for data processing with efficient architecture and technology choices.

Optimising Data Transformation: Streamlining the Medallion Architecture

Recent Posts

Comments