Data Mesh defined
The two latest trends in emerging data platform architectures are the Data Lakehouse (the subject of my last blog Data Lakehouse defined), and the Data Mesh, the subject of this blog.
Data Mesh was first introduced by ThoughtWorks via the blog How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. From that blog is the graphic (Data mesh architecture from 30,000 foot view):
The data mesh is a exciting new approach to designing and developing data architectures. Unlike a centralized and monolithic architecture based on a data warehouse and/or a data lake, a data mesh is a highly decentralized data architecture.
Data mesh tries to solve three challenges with a centralized data lake/warehouse:
- Lack of ownership: who owns the data – the data source team or the infrastructure team?
- Lack of quality: the infrastructure team is responsible for quality but does not know the data well
- Organizational scaling: the central team becomes the bottleneck, such as with an enterprise data lake/warehouse
Its goal is to treat data as a product, with each source having its own data product manager/owner (who are part of a cross-functional team of data engineers) and being its own clearly-focused domain that has an autonomous offering, becoming the fundamental building blocks of a mesh, leading to a domain-driven distributed architecture. Note that for performance reasons, you could have a domain that aggregates data from multiple sources. Each domain should be discoverable, addressable, self-describing, secure (governed by global access control), trustworthy, and interoperable (governed by an open standard). Each domain will store its data in a data lake and in many cases will also have a copy of some of the data in a relational database (see Data Lakehouse defined for why you still want a relational database in most cases).
Another component in a data mesh is data infrastructure as a platform, which provides storage, pipeline, data catalog, and access control to the domains. The main idea is to avoid duplicating effort. This will allow each data product team to build its data products quickly. Note this data infrastructure platform should not become a data platform (it stays domain agnostic).
It’s a mindset shift where you go from:
- Centralized ownership to decentralized ownership
- Pipelines as first-class concern to domain data as first-class concern
- Data as a by-product to data as a product
- A siloed data engineering team to cross-functional domain-data teams
- A centralized data lake/warehouse to an ecosystem of data products
As for my opinion on Data Mesh (to clarify, this is my opinion and not that of Microsoft), it’s something that sounds great in theory but I’m really interested to see how companies are going to solve it technically. It would seem to require a full proprietary virtualization software and doing data virtualization has many issues (I already blogged about those at Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement). There is also a large gap in open-source or commercial tooling to accelerate implementation of a data mesh (for example, implementation of a universal access model to time-based polyglot data). And you also have the challenge of Master Data Management (MDM) and Conformed dimensions. However, I think technology is on the way to solve this.
I am seeing some exciting attempts from Microsoft customers at building a data mesh. An excellent slide by John Mallinder from Microsoft (click to expand) for a customer building a data mesh, it which he uses the name “Harmonized Mesh”:
Creating this in the Azure world, Azure Purview would be your starting point for discovering data. If you need to do cross-domain queries, also called federated queries, you would use Synapse serverless with Azure virtual network peering if querying data from storage accounts (by linking the storage accounts in each Synapse workspace). If querying data from Synapse relational dedicated pools, that would currently require extra work, such as using Synapse Spark notebooks, Databricks, Power BI, or Azure Data Factory data flows to call multiple databases hosted in separate dedicated pools (but there are easier solutions on the way).
Keep in mind a data mesh only makes sense for companies with many large domains of data, and where there might be a lot of political infighting over who controls the data and/or data sovereignty is needed. So typically a data mesh is only for the largest companies as it can be difficult and time consuming to setup this environment. I am aware of a number of companies who have been building a data mesh – please comment below if you have a data mesh in production!
More info:
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake
The Data Mesh – The New Kid On The Data Architecture Block
The Distributed Data Mesh as a Solution to Centralized Data Monoliths
Video Keynote – Data Mesh by Zhamak Dehghani
Data Mesh Paradigm Shift in Data Platform Architecture
The Data Mesh: Re-Thinking Data Integration
What is a Data Mesh — and How Not to Mesh it Up
Video Introduction to Data Mesh: A Paradigm Shift in Analytical Data Management (Part I)
Video How to Build a Foundation for Data Mesh: A Principled Approach (Part II)
Retrospective: My Experience Writing Data Management at Scale
Data Mesh: Design, Benefits, Hype, and Reality
Will the Data Mesh save organizations from the Data Mess?
Video How to Make a Data Mesh; Not Data Mess. MIT CDOIQ 2021
Excellent view on a strategy to shorten the gap between corporate and self-service BI as two ends of a spectrum. I like to suggest CDM as one major vehicle to enable easy reuse of published data. As you pointed out data domain knowledge might be low and additional meta is in result crucial for usage of data on high quality level.
Pingback:The Data Mesh – Curated SQL
For companies that have many independent data domains that are akin to political or data fiefdoms, a data exchange might be a better approach. We wrote a report on this recently. See https://www.eckerson.com/register?content=the-rise-of-data-exchanges-frictionless-integration-of-third-party-data
Agreed! Great differentiator, Wayne.
Very enlighting article James, thanks for that. My thoughts are that Data Governance aims to achieve exactly the same goals, be it without prescribing the architecture but by creating the same responsibilities for data (domains), treating data as an asset (sort of product), training staff and ministration and administration of the data assets.
So if you enter in Data Mesh Architecture, you automatically have half of your Data Governance effort and if you start a Data Governance program, you are likely to end up with a structure like Data Mesh.
Pingback:Pragmatism: A guide to unlocking data’s strategic value | Intellect
Interesting. The data mesh term and concept has been around since the early 2000s. What led you to believe it’s a ThoughtWorks invention?
Hi Doug, I have not heard the term “data mesh” before, but certainly the concepts have been around a while (i.e. data marts with Kimball). Would be interested to see if you have heard the data mesh term before.
Hi James, Data marts are very different. Data mesh seems to be a much more academic/theoretical concept, rather than an actual architecture, that looks a lot like pseudo-controlled data chaos–at least how ThoughtWorks has attempted to redefine it. I just watched Zhamak Dehghani’s presentation and have read and re-read her posts on the topic. Either she’s making a simple concept much more convoluted and buzzword-laden than necessary (likely to claim propriety), or the concept itself is way over-baked and riddled with impracticality holes. I’ve made some notes with the intent to post a blog sometime soon. Stay tuned. -Doug
Hi Doug,
I feel data marts have some things similar to data mesh (i.e. separating data by domains) but certainly some big differences. Other concepts of data mesh have been around a while under different names. I also have many concerns over data mesh, and I posted just some of them at https://www.jamesserra.com/archive/2021/07/data-mesh-centralized-ownership-vs-decentralized-ownership/
Would love to see your blog on it!
As I’ve looked into Data Mesh more, including talking deeply with key folks at one of its biggest proponents, Starburst, I’ve come to realize it’s an age-old approach with newer technology (i.e., old wine in new wineskins). Call it VDW, EII, or data virtualization. Guess who Starburst competes with? Denodo and Dremio, both DV vendors, although Dremio has moved away from that term because it’s “old school”. Starburst is getting a lot of traction with the new term but it’s actual customer use cases are all DV ones.
Wayne, you are ruining things for companies selling their product, training, or services with “data mesh” in the title 🙂
Data mesh is driving a lot of leads to DV companies. It’s good marketecture. I like what Nikos says below — companies that do DG, DW, and DL right already federate a lot of capabilities, knowledge, and standards. How could you not? I think Occam’s razor applies here: the simplest solution is the best.
I am happy that these ideas are being grouped together and re-branded as “Data Mesh” because I think there’s merit to them. They are not new ideas, e.g. aligning domains to business capabilities, cross-functional teams, achieving data interoperability via semantic metadata and master data management, centralised vs decentralised operating models for data governance & stewardship, etc. Neither are Domain-Driven Design ideas new, e.g. many of us already didn’t use a straw-man monolithic Enterprise Data Model, but a loosely coupled collection of subject area (domain) models, with controlled vocabularies and glossaries, concept models, business rules, cross-domain translation dictionaries and even upper ontologies to go with.
Some of the potential issues with Data Mesh I see are:
1. Convincing operational dev teams that they need to do their own data quality, stewardship, metadata and master data management, on top of their main day-to-day activities.
2. Getting product owners to prioritise data management tasks over customer-focused user stories on any sprint. Expect regular and unpredictable lags on data governance across the business.
3. Augmenting 2-pizza teams already staffed predominantly with “imperative-mindset” software developers (not being polemical) with a minority of “data-mindset” owners and stewards risks creating the known tensions between developer and DBA at distributed scale.
4. Metadata and master/reference data management cannot necessarily happen in isolation within each domain.
5. Proliferation of transformations across consuming domains will create inconsistent views in absence of precise data element semantics and enforcement thereof (may or may not be an issue).
6. Business capability-aligned domains are, in essence, silos. The value stream that serves an end customer combines capabilities and those need to work flawlessly to achieve successful and efficient execution of the value stream. In terms of data, this means that unless perfect interoperability is achieved between Data Mesh “data products”, the combination of those to deliver end customer value can easily be disrupted and require manual intervention.
Really great points Nikos, I appreciate the feedback!
Nikos, you said it, thanks! “Business capability-aligned domains are, in essence, silos.”
Yes, data mesh panders to the lowest common denominator, reinforcing an organization’s worst tendencies to silo data. However, data mesh technology (i.e., data virtualization) is a critical element of any data architecture, creating a data service that gives users and applications transparent access to distributed data. It federates data to support a holistic data environment.
Pingback:Data Mesh: Centralized ownership vs decentralized ownership | James Serra's Blog
Pingback:Data Mesh: Centralized vs decentralized data architecture | James Serra's Blog
Pingback:The Future of the Modern Data Stack in 2022 – Data Science Austria
Pingback:Data Mesh, towards a new paradigm in Data? - SogetiLabs
Pingback:The Way forward for the Trendy Knowledge Stack in 2022 - Atlan - hapidzfadli
Pingback:The Future of the Modern Data Stack in 2022 - Atlan - CW ORG
Pingback:The Way forward for the Fashionable Information Stack in 2022 - Atlan - Lecheyre.ch
Pingback:The Way forward for the Trendy Information Stack in 2022 - Atlan -
Pingback:The Future of the Modern Data Stack in 2022 – Atlan – Apk Know
Pingback:The Future of the Modern Data Stack in 2022 – Atlan | Technophoney
Pingback:The Way forward for the Trendy Knowledge Stack in 2022 – Atlan – Floryup
Pingback:The Future of the Modern Data Stack in 2022 – RadioStudio