The data lake was a promise: dump everything in S3, figure out schemas later, democratize analytics. A decade on, the swamp jokes write themselves. Most enterprise data lakes I've walked into are petabyte-scale write-only memory — full of data nobody can find, nobody trusts, and nobody owns.
The lakehouse is the architecture that takes the actual lesson from those ten years: storage and compute should be separate, but tables need table semantics. Schemas, ACID transactions, time travel, governance. That's what Iceberg, Delta Lake, and Hudi exist to provide. This is the first data architecture in a while that's honest about what it's doing.
Dumping is not ingestion
The original sin of data lakes: "land it raw, we'll clean it downstream." Teams ingest CDC streams, clickstreams, SaaS exports, partner feeds into a raw/ prefix and call it an ingestion pipeline. It isn't. It's a dumping ground with a cron.
The medallion pattern (bronze / silver / gold) isn't new terminology; it's a discipline that enforces what the lake failed to:
- Bronze: raw, immutable, append-only. Source-of-truth for replay. Nobody queries this directly for business use.
- Silver: cleaned, deduplicated, conformed. Joins resolved. Schema enforced. This is what data engineers build against.
- Gold: business-ready marts. Aggregated, denormalized for BI. This is what analysts query.
Skip silver and your gold layer is a duct-tape pile of one-off cleanups. Skip bronze and you can't reprocess when a transformation bug is found six months later.
Table formats matter more than engines
Apache Iceberg vs. Delta Lake vs. Hudi is the argument data Twitter wants to have. The more important argument is: you need a table format at all. Parquet files in a prefix is not a table. It's a filesystem with a schema you hope nobody broke.
What a modern table format gives you:
- ACID transactions (so your Spark job doesn't leave a half-written table visible to the query engine).
- Schema evolution without rewriting petabytes.
- Time travel —
SELECT * FROM orders VERSION AS OF '2026-03-01'is an incident-response superpower. - Partition evolution without downtime.
- Engine-agnostic reads: Spark writes, Trino reads, Snowflake reads, DuckDB reads locally.
Iceberg has the strongest multi-engine story in 2026 — AWS, Snowflake, Databricks, Google BigLake all support it natively or via the REST catalog spec. Delta is excellent if you're a Databricks shop. Hudi if you have heavy upsert workloads. Pick one and commit; running three is how you end up back at "files in a prefix."
Governance before scale, not after
The catalog is the most important decision in your data platform, and it's the one most teams make last. Unity Catalog, AWS Glue + Lake Formation, Polaris, Nessie — the options have diverged quickly. What matters:
If you can't answer "who owns this table, who can read it, and what PII is in it" in under 30 seconds, you have a compliance incident waiting to happen — not a data platform.
Column-level access control, row-level filters for tenancy, tag-based policies for PII/PHI, lineage that actually traces across engines. Do this before you scale to a thousand tables, not after.
Most teams need less than they think
I've watched teams architect for a hundred-node Spark cluster and Airflow at scale before they had ten reliable pipelines. If your company has under 10 TB of meaningful data and fewer than 50 analysts, you probably don't need a lakehouse. You need Postgres, dbt, and a BI tool. That's not a failure — it's capacity matching.
The lakehouse earns its complexity at scale: multi-petabyte storage, ML feature stores with time-travel semantics, real-time CDC from dozens of source systems, compliance regimes that demand auditable lineage. If that isn't your problem, build the thing your problem actually needs.
The takeaway: The lakehouse works because it stopped pretending files in a bucket were a database. Pick a table format, enforce medallion discipline, and nail the catalog before you chase scale. And be honest about whether you need a lakehouse at all — the answer is more often "not yet" than the vendor slides suggest.