Skip to content

Month: April 2021

Lakehouse isn’t an architecture; it’s a way of life

Recently a tweet of mine was revealed to have been included in a Twilio board of directors presentation from 2011. The tweet was about the simplicity of the developer experience for both Twilio and Google App Engine. What’s this have to do with Lakehouses? Everything.

Apparently screen caps in 2011 weren’t high res

My entire career has been about enabling simplified experiences for technologists so they can focus on what matters — the key differentiators of their business or application.

Google App Engine, though released before its time, made it a lot easier to launch and maintain production-ready web applications.
Google Apps Marketplace made it easier to market B2B apps to a large audience.
Neo4j and the property graph data model makes it easier to understand and query the relationships between data.

The same is true with Databricks and the Lakehouse architecture.

Nearly all large enterprises today have two-tier data architectures — combining a data lake and a data warehouse to power their data science, machine learning, data analytics, and business intelligence (BI).

Data Lake, storing all the fresh enterprise data, populated directly from key business applications and sources. By using popular open data formats, the data lake is great for compatibility with popular distributed computing frameworks and data science tools. However, a traditional data lake is typically slow to query plus lacks schema validation, transactions, and other features needed to ensure data integrity.
Data Warehouse, with a subset of the enterprise data ETLd from the data lake, stores mission critical data “cleanly” in proprietary data formats. It’s fast to query this subset, but the underlying data is often stale due to the complex ETL processes used to move the data from the business apps -> lake -> warehouse. The proprietary data formats also make it difficult to work with the data in other systems and keep users locked in to their warehouse engine.

Simplicity is king

Why have a two-tier data architecture when a single tier will satisfy the performance and data integrity requirements, improve data freshness and reduce cost?

It simply wasn’t possible before the advent of data technologies like Delta Lake, enabling highly-performant access to data stored in open data formats (like Parquet) with the data integrity constraints and ACID transactions only previously possible in data warehouses.

bestOf(DataLake) + bestOf(DataWarehouse) => (DataLakehouse)

Reduce your (mental, financial, ops) overhead

I’d encourage y’all to invest in simplicity and reduce the complexity of your data architecture. The first step is reading up on new technologies like Delta Lake. There is a great VLDB paper on the technology as well as a Getting Started with Delta Lake tech talk series by some of my esteemed colleagues.