Evolution of Data Architectures

The separation of data from business operations and various analytical workloads (BI, Data Science, Cognitive Solutions, etc.) is as old as IT systems and business applications are. As analytical workloads are resource intensive, they need to be separated from the IT systems that run business operations so that operational workloads run smoothly without any resource constraints, thereby ensuring a positive customer experience.

Our dependency on big data and business analytics has significantly increased over the years, with its market size expected to reach USD 684.12 billion by 2030. Globally, various industries invest in analyzing their massive volumes of data and creating effective data strategies. Data architectures are frameworks for how data strategies are supported through IT infrastructures. As the foundation of data strategies, data architectures play an essential role in effective strategy implementation. The evolution of data architectures over the years has accordingly shaped the effectiveness of data strategy.

Data models, architectures, and storage have evolved with time, catering to diverse analytical workloads. In this article, we will introduce various data architectures that have evolved to meet continuously growing analytical needs. Each of these evolutions deserves a book to describe the complete details that cannot be produced in one article. However, the purpose is to describe high-level details of each of them here and point to additional literature available. Let’s begin.

Evolution of Data Architecture

1. ODS, EDW and Data Marts

In the early days, an operational data store (ODS) was developed to cater to decision support systems, mainly targeting operational users who needed predefined reports. ODS stores only current data (6 months typically) for operational reporting and tactical decision making, such as a bank teller. It decides whether to offer an overdraft facility, increase the credit limit, etc., for a customer standing in the queue.

The arrival of Business Intelligence tools has broadened the analytics user base to cover senior executives who prefer summary information in a graphical representation. Data marts and dimensional modeling techniques like star/snowflake schemas have been developed to support this user base. Data Marts are typically used for descriptive and diagnostic analytics focusing on specific subject areas, helping users to understand what happened, what is happening, and why, and also to conduct what-if analysis.

While ODS and Data Marts serve two sets of different analytical users, they tend to get limited to specific functional areas. Enterprise Data Warehouses (EDW) have been developed to cater to the needs of cross-functional analysis. Data warehouses store historical data to find long-term patterns in the data. They have been designed with ER and dimensional modeling techniques depending on the organizational preferences.

A typical enterprise data analytics architecture looks like this at this stage:

Data Analytics Architecture

2. MDDBs

ODS, Data Marts, and EDW implemented with traditional RDBMS such as Db2, Oracle, and SQL Server serve the purpose of canned reports and executive dashboards that could be delivered in batch mode as per predefined schedules, typically daily. For ad-hoc reports and interactive analysis, they have severe performance constraints.

To serve these needs, multi-dimensional databases (MDDBs) such as Oracle Express, Cognos Power Play, Hyperion Essbase, etc., have been developed. These databases have been used for data marts for specific subject area analytics such as financial planning & budgeting, accounting, procurement, sales, marketing, etc., due to the size limitation of MDDBs (each cube could typically hold 2 GB of data). Users could perform interactive analysis with drill up/down/through, what-if analysis, and scenario planning with these MDDBs, though limited to a specific functional area.

3. Data Warehouse Appliances

Analytical applications have to process data at an aggregate level to find new patterns. Traditional RDBMS like DB2, Oracle, and SQL Server that run-on general-purpose/commodity hardware lag behind in meeting the demands of these analytical workloads. DBMS like Teradata, appliances like Netezza, Neoview, Parallel Data Warehouse, and SAP HANA came into the market to address those needs. They run on special purpose hardware that uses massively parallel architecture and in-memory processing, giving a required performance boost. These appliances have been used to implement a flavor of Enterprise Data Warehouse. However, except for Teradata, all other appliance technologies have minimal success.

4. Data Lakes

ODS, EDW, and Data Marts deal with enterprise structured data only. They cannot process and analyze semi-structured (JSON, XML, etc.) and unstructured data (text, documents, images, videos, audio, etc.).

In addition, they were developed before the cloud came into existence. Hence, there was tight integration between storage and computing resources. As a result, these resources had to be planned for peak load on the application, which will be underutilized most of the time when the load on the application is not high.

With the arrival of big data technologies, another variant of data architecture came into existence, the data lake. While the purpose of the data lake is similar to that of EDW or data marts, it also caters to semi-structured and unstructured data. It is a more prominent implementation on cloud infrastructures such as AWS S3, Azure ADLS, or Google’s GCS.

While data warehouses and data marts are built with a predefined purpose, a data lake is a raw storage of all types of data (at the lowest possible storage cost), which can be processed for specific purposes by spinning of a data warehouse, data marts, or data pipeline for data science and cognitive science applications.

Since a data lake holds raw data, it does not require schema when writing, unlike data warehouses and data marts that need pre-defined schema when loading data into them.

The low cost, object storage, and open format features of a data lake make it popular, as opposed to expensive and proprietary data warehouses. However, data lakes come with their own challenges, such as:

  • lack of support for transactional applications
  • low reliability
  • poor data quality
  • enforcement of security
  • drop in performance with data volume growth

This is what a typical data lake architecture looks like:

Data Lake Architecture

5. Data Mesh

Data warehouse and data lake architectures are centralized implementations that limit the scalability and availability of data for consumption. These implementations take a long time, limit domain understanding of the data, and are more technology-oriented than end user-oriented. They are designed and owned by data engineering specialists who are not readily available in large numbers, which is also a limitation of scalability and democratization of data for analysis. These data engineers are far away from business applications that generate the data; hence, it lacks business context and meaning of data.

Data Mesh architecture/concept has been developed to address these challenges. In this approach, data is organized as data products along with various functional/subject areas or domains. They are owned by those responsible for business applications, so they understand the business context, meaning, and usage of the data. These data product owners take help from data engineers to design and distribute analytical data products. There will be a catalog of these analytical data products, which every consumer in the organization can see, understand the context of, use any given data product, and interpret accordingly.

Data Mesh Approach

Core principles of Data Mesh are essentially:

  • data as a product
  • domain-oriented decentralized data ownership
  • self-service platform
  • federated governance of design and architecture

Data Mesh is still an approach for data architecture. There are no products available in the market yet that implement this architecture.

6. Data Fabric

Data Fabric is also trying to solve the same problems that Data Mesh is trying to do. However, their approaches are quite different. While Data Mesh is a domain and a business-oriented distributed approach, Data Fabric is a centralized meta data-driven and technology-centric approach.

Data Fabric is developed with metadata, catalog, logical data model, and data delivery APIs. Part of the data is virtualized, while the remaining data is centralized, just like a data warehouse. It is complemented with centrally managed data life cycle management policies, such as:

  • data governance - active metadata management, access controls, lineage, quality
  • privacy - mask/redact sensitive information such as credit card and personal data
  • compliance - GDPR, HIPPA, FCRA

SAP Data Intelligence, IBM’s Cloud Pak for Data, Oracle Coherence, and Talend Data Fabric are some of the products available in this space.

Denodo is another product that is more about data virtualization technology which is a core part of the data fabric approach.

Data Fabric Approach

7. Lakehouse Architecture

In the Data Lake architecture, each type of analytical workload requires its own data pipeline due to different data access requirements, leading to inconsistent understanding and usage of the same data.

It also introduces one more layer of data storage in between analytical applications (consumers) and business applications (sources) that generate data. First, data has to come into the data lake and then move to the consuming applications, which could reduce the value of key insights by the time they are acted upon.

Data lake does not support transactional applications and has many other limitations, as described in the section above.

Lakehouse architecture is trying to address these issues by having a common interface for all types of data analytics workloads. It supports ACID properties of transactional applications. It essentially combines the advantages of both data warehouse and data lake architectures while addressing the challenges of both.
Lakehouse Architecture

Conclusion

Data architectures have been evolving to meet the growing demands for various analytical and cognitive workloads, leveraging innovations in cloud and big data technologies. Depending on where the organization stands on the data analytics maturity, the variety of data it holds, and the kind of analytical workloads it requires, a specific type of data architecture can be chosen. While the Lakehouse architecture holds the promise of the best of all worlds, it is new and yet to mature for broader adoption.

Data architectures are at the core of all business data strategies; hence, paying attention to them is crucial. With the right data architectures for your specific use case, you can ensure the successful implementation of data strategies.

References