Designing Data-intensive Applications with Martin Kleppmann
2412 segments
Should I consider multiszone,
multi-reion or even a multi- cloud
setup? How much availability risk are
you willing to take on versus the
computational overheads, but also the
human overheads actually designing and
operating the system? Macro produce is
dead. Nobody uses it anymore. But other
areas where we've increased the coverage
are systems in support of AI like vector
indexes. Is there any risk as a software
engineer that you're no longer
incentivized to understand the
underlying layer? If you rely on a
higher level abstraction, you're no
longer thinking about the lower level
details. If you're building higher level
business logic, actually, I think it's
just fine. LLMs increase the need for
these formal proofs because we're vip
coding a bunch of stuff. The reason I
think that formal verification could
become more important in the future. One
is that
designing data intensive applications
has been the go-to book for anyone
building large backend systems. 9 years
after publishing this book, the second
edition is here. Martin Klutman is the
author of this generational book. I sat
down with him and today we cover how
working on CFKA at LinkedIn directly
shaped ideas that became the first
edition of the book, what's new in the
second edition, and why things like map
produce got removed from this updated
version. Formal methods, local first
software, decentralized access, and many
more. If you care about how large
systems work, where they're heading, and
what the fundamentals are that don't
change, this episode is for you. This
episode is presented by SATSIC, the
Unifi platform for flags, analytics,
experiments, and more. This episode is
brought to you by Sonar. Sonar, the
makers of Sonar Cube, understands that
code quality is about more than just
avoiding syntax errors. It's about
long-term maintainability by protecting
the structural integrity of the system.
As agents generate code at massive
scale, they often ignore your system
structural integrity. This creates
tangles, duplicated code, and other
maintainability issues. These issues
turn a module design into a big ball of
mud, making it increasingly difficult to
extend. But here's something that's
really helpful. Sonar Cub's architecture
management. It moves architectural
governance out of static wikis and into
your automated workflow. It allows you
to visualize your current architecture,
define architectural boundaries, and
manage architectural issues in real
time. Whether it's a human or an AI
agent at the keyboard, Sonar acts as a
circuit breaker for structural decay. It
ensures every commit respects the
systems blueprint protecting the
long-term health of your most complex
applications. Head to
sonarsource.com/pragmatic
to find out more. So Martin, welcome to
the podcast.
>> Hi Ger, it's great to be here. It's
amazing to to have you here. I don't
think you need introduction to many
software engineers, including myself.
You're the author of this iconic book
that I've had on my bookshelf for
probably about 10 years, not not much
longer after it came out. Before we get
into this book, which we're going to
talk about, how did you get into the
technology field?
>> Yes. Well, I did a undergraduate
computer science like like many others.
And then after that, I wasn't quite sure
what to do with my life, but I thought,
well, is like starting a startup seems
like an interesting thing to try. So, I
started a startup having no clue what I
was going to actually do and then spent
the first while searching around for
things that might be interesting. it the
first startup didn't work out that well
but through that I met some others who
then became my co-founders for the
second startup which worked better and
uh we sold that one to LinkedIn and then
after that I started being interested in
like teaching these distributed systems
concepts so that's when I got into
writing the book and then during the
writing of the book I also switched over
from industry back to academia can we
talk a little bit about your first and
second startup yeah go test it this was
like 2008 or something like that. It was
the age where people were having really
difficulties getting their JavaScript
working cross browser. Internet Explorer
was still pretty big at the time. Chrome
had just come out. Uh all the browsers
were incompatible with each other and so
Go Test. It was a cross browser
automated testing service for websites
was based on Selenium, an open source
project that still exists. And the idea
is you would write like test scripts
that automate the a user clicking
through the various uh interactions with
a website and then just check that the
right behavior happens. And so yeah, it
was based on selenium but just as it
provided as a hosted service so people
wouldn't have to run various VMs with
various operating systems themselves. It
worked technically but um I found it
really hard to actually get adoption for
it. A lot of uh people building websites
like in theory said oh yeah this is
great. we we need to test cross browser
and in practice actually it was really
difficult to get them to integrate it
into their workflow and just get in the
habit of using it and investing in
writing the test scripts. So, so that
ended up not really going anywhere.
>> So, so like there wasn't like a business
to be done or or like revenue to be
generated in meaningful sense.
>> Yeah. Well, there's at least one other
maybe two other companies from that same
era that did manage to make a business.
Source Labs is one that that managed to
actually succeed. Um, but it even for
them it was a pretty slow running
business. I think it was not an easy
business to be in. And for the startup,
were you in in the UK building it?
>> I was in the UK at the time.
>> Was it was it bootstrapped? Did you
raise some some kind of funding? How big
was the team? How can we imagine this?
>> It was mostly bootstrapped. So I did a
bunch of consulting in order to fund
hiring some people and then hired some
like friends uh on the cheap to help
contribute to actually building the
product. And so it was done all all very
cheaply. I had a very small amount of uh
of angel money in there but mostly
bootstrapped.
>> Mhm. And then when you decided to to not
uh go forward with this, how did the
next startup come? Uh reportive, right?
>> Yeah, the second one was reportive. That
went a lot better. So, uh, that was
putting social media inside Gmail
basically. So, the idea was that if you
get an email from someone you don't
know, we had a little browser extension
which manipulated the Gmail web
interface so that on the side next to
the email, we'd show you a summary
social profile with like a profile
picture and like a job title pulled from
LinkedIn and recent tweets pulled from
Twitter and maybe recent Facebook post
or things like that. just whatever we
could find about that person uh and put
that as a as a social summary next to
the email. We started in 2010 or
something like that. It was then pretty
quickly became quite popular. Um and so
on the back of that we were then able to
raise some money from my combinator
which was still fairly young at the
time.
>> That was very young. That you must have
been one of the very early batches.
>> Yeah, I can't remember exactly when they
started but it was um it was certainly
in the early years. I think Y Combinator
had already built up a quite a good
reputation at the time, but it was still
fairly small.
>> And then as part of Y Combinator, did
you have to fly you from from the UK to
San Francisco to attend that 10e program
if I remember?
>> Exactly. Yes. So we um initially came
for for the 3 months or whatever it was
of the Y combinator but then we were
able to get US work visas for ourselves
and uh set up permanently uh in in San
Francisco.
>> How was that shift from from the UK
where you spent going to university your
first startup the first part of this to
coming to San Francisco? It was very
exciting because uh you know it felt
like you know going going to the the
center of where it was all happening
really and we at the started out not
knowing anybody at all. we knew like one
or two people in the entire Bay Area,
but we like contacted them and they
introduced us to more people and they
introduced us to more people. And so we
were able to pretty quickly actually
build up a a network and that that's
something that I I really appreciated
that it was actually so open to
outsiders like us who could just
basically turn up with an idea and an
early stage startup and we managed to
raise some money and managed to like
actually become somewhat established in
the in the Bay Area. And can you tell me
how the how the company grew and and at
what point did the LinkedIn acquisition
offer come and and how can we imagine
even you were a founder of this company.
It was about in 2012 that we sold it. Um
and we were five people at the time. So
it's all still pretty small. Um not vast
amounts of money involved but it it was
a success I would say uh for everybody
involved. The acquisition process it
itself was fine. is like as always with
these kinds of transactions, there was
like twists and turns and moments where
we thought it would all fall apart and
then we were almost running out of money
and uh hadn't really succeeded in
raising another round. So, we kind of
had to sell or shut down. So, we were
under quite a bit of pressure. We
couldn't reduce our own salaries because
to do so would have violated the
conditions of our visas. Yes. Um so, we
were in a slightly stuck situation given
our lack of leverage in that situation.
And actually I'm pretty happy how it all
turned out.
>> Yeah, it's nice that you know like for
10 plus years we can talk about this
honestly because often times you see an
acquisition by LinkedIn and of course
you might ask the founders and they
would say like this was our either our
dream or our goal or we will do so many
things together but some things that you
don't often hear is well that there was
a pressure involved as well. So, did you
go into this wanting to sell the company
because you saw that things were getting
a little either you needed to raise a
new round or you sell to someone and
then you found LinkedIn to be the the
best of or the only or or or the best
option to to go into. We tried a little
bit to see like what revenue generating
options we had and hadn't really managed
to make that work. So, we were just
burning money and uh and our user growth
was okay but not really enough to go and
raise a big round. Um, so we were like a
little bit stuck there and selling the
company seemed like the least bad option
there in a way. And I'm pretty happy how
it turned out because you know LinkedIn
was great actually. They they were very
good to us. They allowed us to operate
as essentially like a independent team
within the company.
>> So So your team stayed together?
>> Our team stayed together. We continued
working on the product that we wanted to
make.
>> Oh, you you got to keep working on
reportive.
>> Yes. Well, actually, so report of the
Gmail browser extension uh sort of got
put on life support, but we were working
on a new product at the time, which did
eventually get released under the name
LinkedIn intro. It kind of got a
slightly weird reception at the time and
it ended up getting shut down shortly
after we released it. this kind of
longer background story there, but um
I'm still really happy with LinkedIn
like how they gave us the freedom to do
this and allowed us to launch this
product and even though it didn't
succeed, you know, they were very good
to us throughout that process and then
after that got shut down then our team
got disbanded. Um but we had a good run
within LinkedIn um building this
product. What tech stack did you work at
the time which what do you use? The
reporter was fairly unexciting. It was a
Rails app with a Postgress database
basically and some Reddit and some
similar things like that mixed in. So
actually you know nothing particularly
revolutionary. We essentially built a
graph database on top of Postgres. So
there was a a little bit of technical
interest in there but you know nothing
particularly outrageous. And then you
you spent time after LinkedIn intro you
still work inside LinkedIn as I
understand you worked on data
infrastructure right?
>> Yes data infrastructure. Um after our
team got disbanded, I switched over to
the uh stream processing team. So CFKA
had just been developed at LinkedIn and
had just
right. Oh, it was just being open
sourced.
>> Yeah, I think it had just been open
sourced and then uh I got to work on
samsa which was a stream processing
framework on top of Kafka. I always
wanted to ask this question so this
comes here. Why did LinkedIn build Kafka
or or develop Kafka? every time it's now
such a fun foundational technology there
always I was always curious like why did
a company feel the necessity to build
this thing that seems pretty generic and
it seems everyone would have needed it.
Yes. So I think Jay Kreps has a pretty
good uh blog post from from that era uh
called the log where he explains his
motivation behind CFKA and you know why
why make it an appendon log rather than
like a traditional message Q or
something like that. I think the mo
motivation was really about data
integration because there were a whole
bunch of databases and and like event
generating systems you know like um
activity events from users for example
they were all generating data that in a
sort of stream shape and then a bunch of
downstream systems that wanted to
consume this like wanted to get it into
the data warehouse and wanted to be able
to get it into the Hadoop cluster at the
time in order to run like machine
learning and things over it and there
was just this data integration problem
of actually like how do you physically
get the data out of one system and into
another and uh Jay designed CFKA as this
integration point essentially like the
almost the kind of lowest common
denominator but still a general purpose
abstraction uh for integrating v various
data sources and to downstream data
syncs working at LinkedIn at at you know
like CFKA and at LinkedIn scale what did
you learn or what surprised you about
working at this type of scale as I
understand this was for the first time
that you hands-on worked at a really
large system, right?
>> That's right. Yes. Because like
previously the biggest company I had
worked in was Reporter with five people.
We had a sizable database but it was
still like a single instance database
and not really that big in the grand
scheme of things. And then yet suddenly
I was at LinkedIn and oh we got to get
get to use their big Hadoop cluster.
That was fun like hand coding map
produce jobs in Java at the time and so
I I learned a huge amount there. Um
especially when the stream processing
ideas uh came up and Jay was
evangelizing the use of CFKA and the
things you could do with it. That was
kind of a revelation for me really where
I suddenly like felt ah this this kind
of makes sense like I'm I start to
understand how these various data
systems fit together what they have in
common what the fundamental principles
are and so that experience then fed
directly into the writing of the book.
At what point did you decide to leave
LinkedIn? To me, in in your careers, I'm
looking through the career, start out in
the UK, do a startup, do a second
startup, Y Cominator, move to San
Francisco, get acquired by LinkedIn. The
arc that most people would draw would
be, okay, do something more in Silicon
Valley or maybe start a second startup,
etc. And and instead you decided to
leave LinkedIn. Yeah. So, first I
decided to move back to the UK actually
and I continued working for LinkedIn
remotely. Okay. That was m mostly
because my girlfriend at the time, now
wife, was still in the UK and
long-distance relationship is not a lot
of fun and I didn't feel that at home in
the Bay Area. So, I wasn't really
encouraging her to move to the Bay Area
either. I thought it was better for me
to go back to Europe and I'm very happy
with that decision. Like, I still have a
lot of great friends in the Bay Area. I
love it as a place to visit, but I
wouldn't want to live here honestly.
Then I was still remotely working for
LinkedIn and that worked all right uh
for a while. When I then started writing
the book, LinkedIn even gave me 50% of
my time free to work on my book
alongside my software engineering
duties, which is really great.
>> Amazing. Yeah, that is so nice of them.
>> Absolutely. And there they don't have to
do that. And LinkedIn didn't directly
get anything out of it in response other
than like a book that they could use for
internal training purposes. Well, shout
out shout out to LinkedIn for this.
>> Yeah, absolutely. Though then I did find
then that actually trying to write a
book in parallel with doing a software
engineering job and being on call etc. I
just wasn't able to do it. It's just too
much context switching and it's very
easy for the urgent things from the on
call to dominate and and then not to
have the you know the freedom of that
you need in order to to write something
new. Um and so then after a while I
decided okay like it's it's probably
better if I focus full-time on the book.
So I then left LinkedIn and just took a
sbatical unpaid sobatical i.e.
unemployment um to just focus full-time
on the book for a while and then it's
only after that that I actually even
considered getting into academia. So how
did the idea of the book come? What was
a point where you decided you would
write and in your mind what were you
deciding to write? What was was it
already you know this this book with
with with this layout or you had an
early idea back then?
>> I had an idea that it of course the
final product ended up looking somewhat
different but the the overall goal I
think stayed the same. So what I knew I
wanted to write something that was a
broad conceptual overview. So not about
how you use any one specific system or
tool but comparing the trade-offs
between many different types of tools.
And I knew that I wanted to be
practitioner focused like not a
theoretical textbook but something that
people could use to build real systems.
That was basically like the the goal
with which I appreciate approached it.
And this was exactly the book that I
wish I had had when I was starting out
and uh working at Reportive for example
because we were all like searching
around in the dark where we're having
performance problems with our database
and we had no idea what to do basically
because we were totally lacking the
foundations to actually understand what
was going on and how to diagnose the
issues. And so I felt that well if I had
had a bit more background on how these
data systems actually work internally
then I could have had an intuition about
how to debug these kinds of performance
issues. And then after a while after I'd
learned more about how data systems work
I thought well okay it's it's time to
write this down so that others don't
have to learn it the hard way um but can
hopefully just get a better idea of how
these systems work and thus be better at
managing their their own data systems.
to start with how did you learn about
for example how databases work because
again from from your story at report if
you you build systems you've had some
performance issues at a smaller scale to
to be fair compared to LinkedIn then you
worked at LinkedIn and you saw a little
bit of how the sausage was made but I
know a lot of software engineers who
have been in this path and they still
don't really know how the fundamental
systems work they just know okay we have
a platform team inside our company and
they build it I could read the RFC's but
it's a lot of work or the planning docs
I could look look at the source code it
feels to me that even at that point you
just went down and and tried to dig in.
What resources did you use? How how did
you find out those those basics which
you later put into the book? A lot of of
it was just kind of being curious and
talking to people actually and just
asking them lots of questions. And at
LinkedIn there were like a bunch of
senior data systems engineers who
understood their stuff very well but
hadn't maybe necessarily written it
down.
>> Mh. And so I just talked to a bunch of
them and and quizzed them and that way
started building a an image in my own
mind of how this stuff works. And then
once I sort of got the basics from these
conversations, then I was able to go and
read research papers for example. They
go into much more detail of exactly how
and why things are designed in such a
way. Um but you know it is timeconuming
to read those things. Um so so then what
I tried to do was like pull out what
what are really the essential ideas. I
just read a ton of blog posts as well.
Um and so the reason why you see so many
references at the end of each chapter in
the book is well that is actually the
material that I myself used in order to
uh understand what was going on. And
then I thought well okay well if I found
these things useful then I'll also cite
them in the book as a way for anyone any
reader who wants to go beyond the basics
covered in the book here are some some
good sources to further reading. Yeah,
the the structure of the book, this
first book at least, it's foundational
data systems, distributed data, and
derived data. If I understood, these are
three big parts. Did you already have a
structure in mind when you started
writing the book or did it shape as you
went? This three-part structure is not
that critical in the design of the of
the design of the book really. That's
sort of more after the fact I thought,
oh, well, it seems like we can group the
chapters into roughly this sort of
structure. But the topics of the
chapters were more or less what I had
envisaged. So I um I knew that I wanted
to talk about like what a transaction
actually is. I knew that I want to talk
about replication. Knew that I wanted to
talk about sharding or partitioning.
Knew that I want to talk about like
consistency and consensus. Those the
sort of highlevel topics I think uh were
clear from like my initial book proposal
to the publisher. the details within
each chapter. That is something that I
often figured out once I got to that
chapter. So, I wrote one chapter at a
time and started each chapter work with
just a lot of background research to
actually get up to speed on the topic
myself. And it's often only then that
say for then replication I decided okay
well it seems like the three major ways
of doing this are single leader,
multi-leader or leaderless. Okay.
>> Mhm.
>> I would decide on that structure at
essentially when I started writing each
chapter and then try to fit the various
points I wanted to make into into this
uh narrative structure. As a as a fellow
author who also wrote a book, one thing
I've noticed there's a bit of parallels
between estimating a book and estimating
a software project in that you come in
with a estimate and if you've never done
it before you tend to be wildly off. How
was this in your journey? And and
addition, you also had a publisher and
publishers are are a little bit like
project managers. They, you know, they
they like to have a a schedule. They
like to try to keep you on track. They
they like to ask what when is it done?
How did you manage that part as well?
And and in the end, how long did you
estimate it would take when you started
and how long did it actually take?
>> As always, it takes vastly longer than
expected. It's the same for software and
projects as it is for writing, I think.
So I think it took me about four years
to write the first edition and that was
not four years of full-time maybe like
two and a half years of full-time
equivalent or something like that but uh
written over the course of about four
years. So it definitely took a long
time. The uh publisher deadline I missed
by a ludicrous margin. I think I missed
it by about 2 and a half years or
something like that. Uh but fortunately
O'Reilly were pretty laid-back with the
with the second with the first edition
and were happy for me to just take my
time and make it good. Uh when it came
to the second edition then actually
O'Reilly got a bit more aggressive and
pushy about uh sticking to deadlines. I
guess by that point the book had been
established and people were waiting
eagerly for the second edition. So, I
kind of understand the the desire to to
want to accelerate it, but at the same
time, I I really appreciated the the
freedom that I had for the first edition
to work on my own schedule. Um, and I
had a bit less of that with the second.
The tagline for the first edition, which
I believe is the same as second edition,
the big ideas behind reliable, scalable,
and maintainable systems. Reliable,
scalable, and maintainable. What do
these objectives mean to you?
>> Yeah. So they're all slightly vaguely
defined, right? So there's there's not a
a formal definition of those things. But
uh for me, reliability means fall
tolerance primarily. So meaning that a
system should on the whole continue
working even if like a network link is
interrupted or a node crashes or
something like that. So a lot of the
book is about techniques that support
fall tolerance like replication for
example. Um so that's reliability. Uh
scalability is one of those terms that
gets thrown around a lot and it's sort
of so much and it's it's like
fashionable and cool to make things
scalable, you know, because it's it
suggests success and millions of users
and so that's of course everyone wants
things to be scalable because everyone
wants success for this book. here tried
to take a bit more dispassionate kind of
approach and said scalability is just
like what mechanisms we have for dealing
with changes in load if load increases
how can we add computing capacity to a
system for example so that the system
still continues working and then the
techniques that you use to achieve
scalability well they are like sharding
for example and and but in this case
scalability your definition do I
understand that you're mostly referring
to horizontal scalability so they cannot
compute
up or down pretty much.
>> Yeah, I guess because that's the the
more interesting one like yes, you can
always buy a bigger machine and
>> what's interesting about that
>> and exactly there's just there's not
that much to be said about it. I mean
there are details of how you scale even
on a single machine but I think like
part of what is become interesting about
like modern cloud services and just uh
backend services in general is like how
they've introduced this idea of hor
horizontal scalability and uh shared
nothing systems. So we can build uh
systems that you know are able to cope
with very high load even if the
individual components are just fairly
cheap commodity machines. But maybe sort
of part of the scalability story which I
wasn't thinking about as much at the
time but started thinking about more
recently is not just scaling up but
scaling down as well.
>> So actually um how do you run a service
in such a way that if it has a very
small amount of load it's really cheap
to run it. That's sort of a in a way the
same question as how do you continue
running a service if it has very high
load. Um generally like you just want
the the cost and the computing capacity
to be roughly proportional to the load
that you have. And at the low end that
means actually being able to scale down
to something that is extremely cheap to
run. And that's like not so necessarily
a given. That's something that is hard
with on premises software for example
because like if you've got a machine a
physical machine that's like a a unit of
deployment and yes you could carve it up
into two dozen virtual machines and make
those small virtual machines but um it
still requires like some sort of
resource allocation. So so part of
what's interesting about some serverless
systems for example is actually their
ability to scale down and say like okay
if you're going to handle just three
requests per day that's just fine as
well. Can you tell me about the second
edition? When did the idea come about?
Yeah, it it had been clear for a couple
of years that the second edition was
needed just because the first edition
was getting a bit dated. There were
changes in technology that just hadn't
been reflected in the in the first
edition. So, I I wanted to update it,
but you know, I now have an academic
job. I'm actually like doing research
and teaching is my main thing, and
updating the book is just a sort of
sideline business on the side in some
sense. So it actually took quite a while
to make progress with that because I was
always doing it alongside other projects
and essentially back to that context
switching problem that that I had while
writing the first edition but just now
um with an academic job that I didn't
want to just drop um because actually
quite enjoy it initially then I made
very slow progress with the second
edition and also I kind of realized that
I had slightly lost touch with current
industry practices because you know I'd
switched over to the the academic side.
I gone much deeper on the theory. Um,
but I was no longer up to speed on like
what people were doing with say data
legs or things like that. So then at
some point it I remembered Chris
Rkamini, an old colleague from LinkedIn.
I had worked with him um on the stream
processing stuff. Uh
>> you work with him. He's he's the author
of the missing readme.
>> Exactly.
>> Wow. What a small world.
>> Yeah. And uh I I had read Chris's book,
The Missing ReadMe, and thought, "Oh,
he's a great writer." And I had worked
with him as a software engineer and
found him him a great colleague and also
he had been writing this newsletter
called materialized view on uh on like
latest trends in data systems
essentially uh and become a startup
investor in in that space. Um, and so at
some point I thought, well, actually I
have to get in touch with Chris and ask
him whether he wants to help out with
the second edition. And he was keen to
do that. And that turned into such a
good collaboration because he was up to
date on like what the cutting edge was
in terms of uh technology in industry.
Um, I had strong opinions on how to
teach essentially. So how to explain
things in the book, make sure that we
were explaining everything in a in a way
that was like very precise, very
carefully chosen words, but at the same
time very accessible so that it's
hopefully easy to read. And so we took
essentially like my writing style plus
Chris's knowledge of latest industry
trends to bring the book up to date and
that was a a great collaboration. what
are the big things that you added that
and and which ones of these you knew
would be missing and which ones did you
realize during the writing process that
okay this needs to be in here now
>> yeah so the thing we knew from the start
that we wanted to reflect was uh
cloudnative systems architecture it's
it's a bit of a vague term um but what I
mean with that is essentially building
uh data systems on top of cloud services
as the foundational abstraction in the
first edition the assumption was
basically that you have some machines.
Each machine has some local discs. You
can run a database instance on a
machine. It will write its data to the
local disk. If you want to replicate it
to another machine, then well the
database software will replicate it at
the database level to another machine
which will also write the data to its
local discs. For a long time that was
exactly the way computers worked. And
now suddenly people are building
databases on top of object stores for
example. And now the replication happens
at the object store level. No, no longer
at the database level. or maybe there's
still some replication at the database
level but it really changes the the
nature of things uh if you're building
on top of an object store and this is
different from say building on top of a
virtual block device like EBS or so
because these block devices although
they are cloud services but they still
offer the abstraction that is a sort of
single node operating system abstraction
of a block device on top of which you
run a file system whereas an object
store is just like a brand new
abstraction it just looks different from
a file system, it behaves differently.
And so then building on top of that as a
foundational abstraction is something
that like people were starting to do at
the time of the first edition, but since
the first edition that has really taken
off like a whole lot of system have have
been built in that style now. And so
that's an idea that we really wanted to
incorporate and we weaved that in
throughout the book. So it's not just
like one section here. Um but it's it's
sort of a an idea that we've integrated
throughout the entire narrative.
>> There's now a lot of managed services as
well. The per primitives that we use,
but there's also so many managed
services that all the cloud providers
use and a lot of engineers, they often
just use the managed services as is
because they they take care of
replication. They have SLAs for uptime
and so on. But when you build on top of
these things and you you kind of use
those as a as primitives as well, is
there any risk as a software engineer
that you're no longer incentivized to
understand the underlying layer or are
we building better systems because of
that? How do you think about this? It it
feels there's a move of abstraction
because of cloud, right? Yeah, it's
definitely a a shift to different and
higher level abstractions,
but you know that's been the story of
the entire computing industry since the
start. It's like building new
abstractions. So it is true that like if
you rely on a higher level abstraction,
you're no longer thinking about the
lower level details. And so it's you're
using a a programming language with a
garbage collector, you're no longer
thinking about memory allocation. And so
is that a loss? Well, maybe. Like if you
if you're building low-level systems,
you should still have to care about
memory allocation. You're building
higher level business logic. Actually, I
think it's just fine for people not to
care about memory management. So I think
there's an analogous thing here with
data systems that if you're building the
higher level systems that don't need to
particularly care about the underlying
infrastructure, then that's fine. Just
use the higher level abstractions.
Nothing wrong with that. But somebody
still has to build those lower level
abstractions from lower level
components. Somebody's got to implement
the cloud services. Martin talked about
trade-offs that come with using cloud
services. And this is a good time to
talk about our season sponsor work OS.
If you've read designing data intensive
applications, you know that building
system at scale is all about trade-offs.
But one thing isn't a trade-off. That's
enterprise features. The moment you land
bigger customers, you need SSO,
directory sync, arbback, audit logs, all
the things they expect out of the box.
Building that yourself can take months.
Work gives you APIs to ship it in days
so you can stay focused on your core
product. That's why companies like
OpenAI and Antroic run on Work OS. Visit
work.com to learn more. I'd also like to
mention our presenting sponsor stats.
Static build a unified platform that
enables both experimentation and
continuous shipping. Built-in
experimentation means that every roll
out automatically becomes a learning
opportunity with proper statistical
analysis showing you exactly how
features impact your metrics. Feature
flags let you ship continuously with
confidence. And because it's all in one
platform with the same product data,
teams across your organization can
collaborate and make datadriven
decisions. To learn more, head to
stats.com/pragmatic.
With this, let's get back to Martin and
the trade-offs that come with using
cloud services.
And so those people will have to then
specialize even more in actually the
details of how you engineer those cloud
services, how you make them reliable,
how you operate them and so on. The
skills are still there. It's just a bit
of specialization happening that some
some people can worry about the higher
level things without having to concern
themselves with the lower level things.
Some people focus on the lower level
things and treat that higher level
aspect as their customers.
>> Interesting. So it it sounds to me that
if you're an engineer who is utilizing a
lot of these services, you might not
need to know how they exactly work.
>> Yes. And I would say like the underlying
philosophy of the entire book is to give
people insights into just the sort of
essence of how the systems work
internally. So that if for example they
start having weird performance behavior,
you can have a bit of intuition for why
it's doing that and how you might solve
it. So for example, say the storage
engine chapter tells you about how Bes
work and how lock structured LSM trees
storage engines work. And the book is
not intended for people who are going to
actually build their own databases and
implement their own storage engines. If
you want to do that, you have to go much
much more much greater depth than this
book covers. But the idea is that as an
app developer, if you know just a little
bit about how the storage engine works
internally, you'll be in a much better
place to use it in a way that is that
gives you good performance for example
and to diagnose any issues. That
philosophy we've kept also in the
context of cloud services where yes,
like cloud service hides some of the
operational details that app developers
don't need to think about anymore, but
they should still know a bit about how
they work internally just so that they
can use them effectively. I guess I
argue about the trade-offs deciding on
which which service to use, which
characteristics to look out for. Yeah.
For for your use case, right? Exactly.
And and you know, they're huge
differences of say if you're doing
analytics whether you're using row
oriented storage or column oriented
storage. That's a bit of a technical
distinction and it takes a little bit of
background reading to even understand
what that means, but it has a massive
performance implication in terms of the
final behavior of the system. And so
those are those places where I feel like
knowing a bit about the the internals is
actually like a superpower. Yeah. And I
guess engineers the one thing that we
always need to argue about or should
need to argue about is at the very least
cost versus performance. And by
performance I mean latency to the user
and of course resilience of if if
something happens you know like a region
go like a zone goes down a machine goes
down zone goes down region goes down how
our product is affected and what's
acceptable. The basic idea there seems
to be like how much availability risk
are you willing to take on versus the
both like the overheads in terms of um
the the system itself like the
computational overheads but also the
human overheads actually designing and
operating the system and and the cost
overhead.
>> Yeah, exactly. And so yes, you can have
a a system that is more able to tolerate
various types of faults but it which is
more expensive to uh to design and
operate versus a simpler system that you
know might go down a bit more often but
which is cheaper. And there's no right
and wrong with that. You know it's a
everyone needs to figure out where they
sit on that uh on that trade-off space
uh themselves. And I would say that like
multi-reion is like pushing in the
direction of like higher availability
because it means you could tolerate the
outage of an entire region. But then it
has implications on the consistency
model that you can get across different
regions for example. So that's a
trade-off that the book tries to make
very explicit to help people reason that
through of like what is the right choice
for them. In terms of multicloud, for
example, one thing that I've been uh
concerned about just in the last month
really is uh European dependence on US
cloud services.
>> Yes. So what if geopolitics was to go
horribly wrong and tensions escalate and
Europe finds itself suddenly locked out
of US cloud services? I hope that
doesn't happen. I still think it's
fairly unlikely, but it's no longer
unthinkable. and and as a result I
coming sort of from this European
perspective have been thinking a fair
bit about how can we engineer systems to
be resilient against that sort of thing
and that's you know not just like a
regional outage but it's like a a
business risk essentially and a
multicloud sister uh setup could help
mitigate against that sort of risk so
that at least for example if one company
locks you out then you could still have
systems on on another company again that
that's very much towards the uh
expensive but uh high availability risk
reduction end of the spectrum. But for
the people who have you know really
critical workloads where they think this
sort of geopolitical risk is a
significant enough risk I think it's
seriously worth considering that kind of
setup. I'm thinking that that as we do
have the responsibility because because
who else will will do this? Yes,
totally. But I totally agree with you as
well that this um understanding what the
risks are and communicating what the
trade-offs are I think is is going to be
a core part of our role as engineers
moving forward as well. Maybe as AI
writes more and more code of our code,
it's less about like the details of how
you express logic in a particular
programming language and much more about
those kinds of highle trade-offs. How
has the definition of scale changed in
this book? Because as we talk with cloud
before cloud building a scalable system
it sounded pretty involved because
building a horizontally scalable system
it's it's complicated all all the pieces
you need to put it in in the first book
you detail a lot of this with cloud a
lot of the services actually they do
define how they allow horizontal scaling
what the tradeoffs are do you feel that
it's made a lot easier to reason about
scale scalability when you are using
these primitives so I think achieving
Being really high scale is still
challenging because even though we have
cloud services like object storage for
example which uh provide you this very
elastic storage model at least you don't
have to worry about capacity planning on
your discs anymore and running out of
disk space because those kinds of
operational things they're taking care
of but if you need sharding for example
that's something that actually does
reflect on the application code as well
you can't really make that entirely
transparent and so you're at a
sufficiently large scale The charting is
required because a single machine is not
powerful enough to process your
workload. Then I think even with cloud
systems you still have to do quite a bit
of engineering thinking of u of how to
realize that where I think the cloud has
helped quite a bit is actually at the
lower end of scaling down. Uh if you
want to have a very lightweight service
that processes only a small number of
requests. what we've got with serverless
systems being able to very quickly spin
up and spin down uh an instance very
lightweight that's quite a a good
innovation that has enabled those those
very low scale uh services and that's
something that's would be much harder to
do without cloud services because you
would have to statically allocate a
certain amount of memory and certain CPU
resources to a particular virtual
machine I love serverless I I have a
small website that runs on serverless
and my bill is like 13 cents per month
because it has very little load.
>> Absolutely. It's just making more
efficient use of computational
resources. Let's talk about sharding. In
in the first book and when you wrote the
first book when I was working at Uber,
we talked a lot about sharding and there
was a lot of internal implementations or
interviews involved asking about
sharding because we were designing
systems that were sharding. I did sense
that over time again as as cloud systems
start to become available that give you
turnkey solutions more that act more
like platforms. You send the data and it
takes care of of these things. Fewer
engineers have to actually implement
sharding with cloud native systems in
your research. What have you seen? What
what are the cases where putting
sharding in place is still important and
where are the places where it it might
have just disappeared as a as a concern?
I mean it's still nice to know but you
might not have to implement it. I think
it's probably less of an effect of cloud
and more of just hardware getting more
powerful that oh actually like a big
machine nowadays can do a lot on a big
machine you if you and that means that
more and more workloads you can just run
on a single machine and that is
sufficient actually to achieve quite
significant scale already there's still
concerns of like how to actually
efficiently make use of hundreds of CPU
cores that you have on a single machine
so there's still parallelism is still
are a required thing to think about
there and sharding is one way of
achieving parallelism. But at least this
sort of sharding across multiple
machines is maybe become less of a
pressing issue just because more and
more workloads can just run on a single
machine. Some people still have very
large scale workloads that do have to be
sharded across multiple machines but
it's not going away entirely and uh
replication is still relevant even at
smaller scales because that's for fall
tolerance that's not for scalability.
You have a chapter called the troubles
with distributed systems uh which goes
through a lot of things that can go
wrong without going through the whole
chapter. Can you recall some of the
things that are memorable to you or some
of the things that you feel are are
important to remember? Yeah. The whole
idea of this chapter is that in
distributed system theory there are
certain things that we tend to assume.
Like for example, we just assume that
there's no upper bound on how long it
might take for a message to go over the
network. So you send a message, it might
arrive within a 100 microsconds or it
might take 10 years and distributed
system theory just doesn't make any
assumptions about that sort of timing if
we can avoid it or rather some some
theory does make those assumptions but
it's an dangerous assumption to make
because occasionally the network delay
does become much higher than than what
is typical. Another thing is about uh
crashes. For example, the distributed
system theory just says like nodes can
crash but what does that actually mean?
Like what in practice does it mean for a
node to become unavailable because it
might be a software crash but yes it
might be a hardware failure. It might be
somebody unplugging the power cable. It
might be that the node is actually still
running but it's just become
disconnected from the network. The point
of this book chapter really is to defend
and justify those theoretical models
that we use for analyzing distributed
systems and just giving a lot of
stories and case studies that show that
you know actually tons of stuff does go
wrong and like don't believe anyone who
says oh failures are rare it's don't
don't worry about it it's fine. Uh the
the the moral of this chapter is really
that actually know if you want to make
things reliable, you really do have to
worry about a whole bunch of weird
unusual but but certainly possible edge
cases. Timing is another one of those
things like you know it's very easy to
assume that your clocks are correct and
most of the times the clocks are pretty
correct but we just can't rely on it
because actually they're just not
precise enough uh on the whole and so a
lot of it is about it's very tempting to
make certain assumptions
um that things are well behaved and and
in distributed systems we just have to
try to get away from those assumptions
if we want the systems to work reliably
even in the face of things going wrong
but it was a really fun chapter to
Right? Because you know it's it's
essentially a big collection of stuff
that has gone wrong. And so I went
through a bunch of postmortems published
by various tech companies for example in
order to see okay what was the root
cause of how things went wrong and what
kind of lessons can we draw from this
that apply to the the book in general.
And uh you know there's some fun stuff
like the the sharks biting undersea
cables and damaging them that just you
know makes for a great story. And then I
I hear that in recent years the
shielding of undersea cables has got
better and therefore the sharks are not
biting them anymore. But instead the
cows on land are stepping on cables and
occasionally causing network
interruptions that way. And you know
that sort of thing is just uh it makes
it a bit more fun. That chapter is so
interesting also because when depending
on what kind of teams you work on or
what kind of people you talk with when I
talk with the S3 team for them that
whole chapter is just their dayto-day.
It's it's they they don't it's not a
weird thing when you know like a a hard
drive goes up or or there might be okay
it might be a weird thing to have a fire
in a data center but they're prepared
for all of those things. They're at the
scale where these things just happen on
a regular cadence because they're one of
the the largest scales whereas at a
smaller company even if you read this
chapter and you know you will treat this
as like well this could happen but when
it h when it actually happens it will be
a once in 10 year and it will be a big
deal. Yeah. But I think there's there's
no like right answer. It's a it's a
trade-off between risk and cost broadly
speaking. And that's means a business
decision has to be made in terms of
where the business wants to lie uh on
that trade-off. And so the goal of this
chapter is really just to give people
the information in order to make an
educated decision. But I don't want to
make that decision for people. That's
for businesses themselves to decide. Uh
that's very clear. Have you come across
some concepts or sips as mentioned in
the book in the first edition and now in
the second edition that are becoming
either more popular or less popular over
time more or less referenced by your
readers thinking about from things like
streaming systems, batch processing or
or anything else? Yeah. So the some
things that we've been able to take out
uh out of the book compared to the first
edition in particular for example
coverage of map reduce was quite
detailed in the first edition but
basically map reduceuce is dead nobody
uses it anymore. It's successors like in
the form of spark and flink for example
they are used and so we still reference
map reduce in the second edition but
more as a learning tool in order to
understand how these kind of partition
sharded batch processing systems work.
So that's one thing where we've been
able to reduce the coverage. Um, but
other areas where we've increased the
coverage are, for example, systems in
support of AI. And so, even though this
is not an AI book, but there are still
data systems concerns that arise when
needing to support AI applications, like
a classic one is vector indexes, for
example. And so, we've added some
coverage of vector indexes to the
storage engine chapter. Fit in really
well there because it already covers
various different indexing strategies
anyway. Uh and so vector indexes, you
know, it's just another indexing
strategy. We also added some coverage of
data frames, for example. That's not an
exclusively AI thing. Um but data frames
are quite a good data representation for
training data, for example. And that was
not one of the data models that we
discussed in the first edition, but we
decided to add to the second edition
because it has actually become a very
important data model that people are
using alongside all of the classic data
models like relational and graph and uh
JSON documents and so on. And so there
these these places where we've just
expanded the coverage a bit to to
reflect the kinds of systems people are
building for example to support AI
without it like changing the direction
of the book entirely. The final
subsection in this first edition the
first few I guess like sub parts were
titled doing the right thing and in the
second edition this has its own chapter.
The final chapter is doing the right
thing and I I quote a little bit from
it. We the engineers building these
systems have a responsibility to
carefully consider those consequences
and consciously decide what kind of
world we want to live in. Can we talk a
little bit about this section and the
importance of it?
>> Absolutely. Yeah. So the motivation for
putting in an ethics section there in in
the first edition was that I just felt
it had been quite ignored as a concern
during my time in industry. um that like
especially in startups people were very
focused on like building a product that
their customers would love and really
like deprioritizing these sort of
ethical questions in the in the process.
And so for example with the consumerf
facing products it might be that the
products are very much geared towards
essentially data harvesting collecting
behavioral data um because that's what
can be monetized in the form of
advertising and there seemed to be just
very little reflection on what was good
and bad about these sort of things. So I
really just wanted to encourage a bit of
thinking there. Um not really wanting to
prescribe too much like a a particular
approach there but at least to point out
you know there there is this thing such
as data protection legislation now which
we do have to think about in the
architecture of our data systems and
there is an ethical responsibility. You
know pe people say that uh you get into
tech in order to change the world. If
you want to change the world, then
thinking about the impact that your
technologies have on the world is part
of your job. It's it's a really
essential part really and something that
engineers are often prone to ignoring as
we focus just on the technology and less
on the effects that that technology will
have out in the real world. And so this
chapter is really just an attempt to get
people thinking about it a bit. And it's
sort of a a reflection of my own process
as well because as I started working on
these systems, I didn't really think
about ethical things particularly
either. So I felt like um I had to put
that section in there for myself as well
as for the readers because it was my own
way of of grappling with these questions
a bit. Is it fair to say that as
engineers building these systems that
will have an impact on on a wide range
of things potentially societal wide
impact we are just in such a good
position to directly influence and maybe
even change course. So do I understand
that this section is a bit of reminder
that by building it we have a huge
opportunity to shape these we probably
have a lot stronger voices maybe as
strong voices as later on the regulator
might have years down the road. Right.
>> Exactly. I think engineers have a very
strong voice there and like we talked
about earlier um engineers need to
articulate trade-offs in such a way that
uh business leaders can then make
educated decisions about how to address
those trade-offs. And part of those
trade-offs is pointing out risks. And
risks include not just technical risks
like the data might get corrupted, but
they include societal risks as well. For
example, like um what negative uh
effects, what harms might arise from
this technology, what sort of unintended
consequences possibly or what like uh
risk for reputational damage if it turns
out that a technology has some harmful
effects. um you know that can reflect
badly on the company that made it and
that has to be part of the the trade-off
discussion and I just want people to
make intentional and deliberate
decisions about this kind of things and
not just sweep it under the carpet. One
of the hot topics these days is of
course AI and you've written a very
interesting post about this just in
December about formal verification and
how your conviction that formal
verification might be more important
with AI. Can we talk for for those of
users who have heard formal
verification, can we talk about what
this is and how you envision this
becoming more important? Yeah. So
there's a whole range of formal methods.
Um, one approach is to for example use a
specification language uh like FSBY or
TA+ or something like that to describe
the expected behavior of a system at a
at a high level and then use a model
checker which is essentially like a
randomized test case generator to just
play through a lot of scenarios and see
whether the the system has those desired
behaviors in in all the different
scenarios. That's like the sort of intro
level formal verification. I would say
the more advanced level is to use actual
formal proof and in that case you can
write a specification of some system in
a formal language is usually using
mathematical notation and then make a
mathematical proof that a certain
algorithm or certain implementation
always satisfies that specification. And
the distinction to testing there is that
well in testing you just try through a
couple of examples, give the algorithm
some example inputs and check whether
you get the expected output in those
particular examples. But a proof can
reason about potentially infinite state
spaces. So it can tell you things about
like every possible thing that could
possibly happen in the entire universe
show that for example a certain safety
property is is always given in those
formal verification is is a lot of work.
Um, I never used it in my time in
industry because it's just too too
timeconuming basically. Um, I only got
into formal verification when I was in
academia and I could afford to take the
time to spend a few months proving an
algorithm correct. But there I've
started finding this very useful
especially if I was working on very
subtle algorithms where it's very hard
to tell just from reading the
implementation whether this actually is
always correct under all possible cases.
But if it's an important algorithm where
for example uh it will corrupt data if
there's a mistake in it or it will have
a security vulnerability if there's a
mistake in it then when it's high stakes
uh things like that then I feel it's
worthwhile to have uh formal
verification and to really make sure
that the the code really is correct and
so I've done some uh formal proofs using
the Isabel proof assistant for example
there are a couple of others as well uh
uh like rock and lean and uh so on.
These proofs are really hard to write.
It's it takes a long time to learn the
language of writing those proofs. And
then even once you know the language,
it's just really laborious in order to
actually write the individual proof
steps. And when you say it's hard to
write, just as someone I I know how to
code, you know, all so many different
different languages. Can you just
explain what what it means to hard to
write? Is is it does it feel like a a
strict programming language with all
sorts of rules or lots of math formulas?
What what makes it hard for for you to
to learn it and and get good at it?
Yeah. So, you're trying to make a proof
that a certain piece of code always
satisfies a certain property. In some
cases, that property might be quite easy
to to specify. Let's say as a really
simple example, you have two lists and
you want to concatenate them. And then
you want to prove that the length of the
concatenated list equals the sum of the
two individual lists. You know, very
very simple property. How would you
prove something like this? Well, you
would have a function that concatenates
two lists and then you would probably do
a proof by induction over one of the
lists uh that shows that okay, well, if
you have one list of length uh I and
another list of length zero, well then
the sum of the two is I. If you have a
list of length i appended with a list of
length one, well then it's i + one and
so on. And then by using a proof by
induction, you can then show that uh the
length of the concatenated list is i + j
where i and j are the lengths of two the
two input lists for every possible value
of i and j. And this is something that
uh you know in a test case you would in
tests you would maybe test it for the
cases of j equals 0, j equals 1 and j
equals 5. And then you're done. Nj
equals inter max. Yes, the edge case.
That's what we do. That that's how I
write my unit test. Exactly. And so this
is a trivial example like list
concatenation. You can easily just read
the code and convince yourself that it's
correct. But if it's a much more complex
algorithm, then you our brains just
can't like grock the algorithm well
enough to really convince ourselves that
it's correct if you don't prove it. And
that's where these these proofs then
become handy. If I'm I'm an engineer and
I would I would be interested in getting
started with formal verification for
example because I have the notion that
it will be more important with AI of
course it will be easier to to write
these things. Where would you point
engineers to to get started or how did
you get started in this field? I would
suggest starting with model checking. So
something like TA plus or FSB are much
friendlier to getting started with
compared to proof assistants like Isabel
Rock and Lean. that these proof
assistants just require a whole lot of
additional know knowledge and the
resources for learning about writing
these formal proofs are to be honest not
particularly good. I haven't really
found really great books on it as well.
The way I learned it was by working with
some colleagues uh in my lab who who had
learned it through years of prior
experience and I just sat down with them
and paired with them at a desk where
like I described the thing I was trying
to prove and they showed me how to prove
it step by step how to break it down.
I'm interested to see if if what if
you're thinking will be correct which is
this thing will go more mainstream and
hopefully we'll have better books and
resources for it as well.
>> Yes, I do hope so. So the the reason I
think that um the I believe that this
formal verification could become more
important in the future is kind of
several aspects to it. One is that the
LLMs are getting increasingly good at
writing these proofs and if we don't
have to write the proofs by hand as
humans, it just becomes feasible to do
them in situations where previously it
would have not been economical. But
also, LLM increase the need for these
formal proofs because, you know, we're
vibe coding a bunch of stuff. If we have
to manually review all of that code,
then that will become the bottleneck.
So, we can't really have humans
reviewing all of the generated code
either if we really want to get the the
benefits of of AI. So, we need some
automated way of checking whether the
code is correct. And writing lots of
tests is a very good starting point. But
the thing that proof can do that tests
can't is to consider absolutely every
possible thing that could happen. And
that's really important in a security
context for example where it just takes
one little bug want to create a
vulnerability that destroys the security
of the whole system. And so I feel for
those domains where like really we want
to ensure there's a complete absence of
bugs that's the kind of places where
formal verification can really shine.
And I'm hoping that LLMs will actually
make that a lot more accessible to to
people who would have previously not
considered using formal verification
because it was just too hard and too
expensive. You've worked in the industry
and then you went into academia. Can you
tell us what the difference is between
us? Myself and and most people watching
work in what you would call industry and
the tech industry or work at different
companies. We're bootstrapping our own
or we're just doing build building our
things. How does academia contrast to to
this? What what do you and your
colleagues do inside of academia? Yeah,
within academia, there are lots of
different styles really. There's not not
one thing. Um, some people go full-on
theoretical, mathematical, don't care
about the real world at all, just want
to work on things that are
intellectually interesting. And that's
fine. And some people uh are at the very
much at the applied end of wanting to do
research that is likely to have a real
world impact. I'm more on the applied
end. And that's fine too. But a common
distinction there is that academia can
just think much longer term. So the you
know if you're doing a startup you have
to ship something within a few months.
You can't afford to think 10 years into
the future. Well, maybe you'll have sort
of a a sort of long-term vision that
you're gradually getting towards, but uh
you do have to really ship things on a
fairly short time scale. At a bigger
company, maybe if you're working on
infrastructure or so, you can think on a
bit of a longer time scale because the
the requirements of what are needed is
are perhaps better understood. Um and in
that case, uh you know, they're making
sure that the system is like scalable,
operationally robust, and so on. it's
then fairly clear what the requirements
are and it's still a matter of
implementing it but in that case you can
think a bit longer term but in academia
what I really appreciate is the freedom
to work on things that are long-term and
which are not like immediately
commercially viable or which are not
aligned with the incentives of
commercial companies. Um so one of
research area that I've been on for
several years now is what we call local
first software which is this idea that
we want to take away a bit of the power
from cloud operators and give it back to
end users. So end users should be more
in control of their own data and less
dependent on cloud services for
providing the applications and the data
that that the users need. And that's
something that doesn't naturally come to
companies, right? Because uh software as
a service businesses, for example, the
whole reason why they can charge a
subscription is because they are able to
essentially hold a gun to the customer's
head and say, "Pay us your subscription,
otherwise we will delete all your data."
And I totally understand the the
commercial imperatives that lead to
that, but it also leads to this
situation where like the people have a
gun against their head all of the time.
That isn't really a healthy situation to
be in in my opinion. But changing that
in such a way to take away that gun from
customers heads is difficult if you're
in a business whose revenue depends on
perpetuating that kind of lock-in
situation. And there I feel like in
academia I have the freedom to work on
things that go against this commercial
incentive of companies and say like
actually no I'm going to do what I think
is right for the users and that I'm
going to say the commercial model of the
companies making the software is second
priority and I can afford to do that
because I'm I'm not dependent on this
commercial model.
>> To add to this, it's very interesting
and challenging engineering problems.
Right.
>> Yes. And it's wonderful to get to work
on interesting engineering and computer
science problems while at the same time
like trying to pursue this uh this
higher level vision for local first for
first software. What are some of these
really interesting engineering
challenges that we we will need to solve
or or we need to solve to get to a more
viable local first software? May that be
like let's say note-taking. It's a very
popular one, right?
>> Yeah. So with our vision of local first
software, we're trying to get away from
this dependency on centralized cloud
services. There may still be cloud
services involved in syncing data
between your phone and your laptop say
um because often going via cloud service
is just the most convenient way of
establishing that kind of communication.
But we just don't want to have to trust
on a cloud service providing a
particular function. Then if you can get
away from assuming this one cloud
service, you could for example have
multiple cloud services on multiple
cloud providers side by side and you
just sync by whichever happens to
respond first or sync with all of them
and then if one of them disappears, no
problem because you've got the other
one. And so it gives us a huge amount of
freedom and flexibility if we get away
from this assumption of centralized
cloud services. But that introduces a
whole bunch of interesting research and
engineering challenges because uh so one
thing that we've been working on lately
say is access control. You know simple
problem you have a document you want to
be able to grant collaborators access
and you want to be able to revoke that
access. Again totally obvious to should
be totally straightforward. In a
centralized cloud service model it is
totally straightforward because
>> you have the rules you you you confirm
that those sort of things and you check
for the right roles and that's it.
>> Yeah. But if you want to run your system
over multiple providers or even in a
peer-to-peer setting then well what
could happen is that uh a user gets
their edit permissions revoked and
concurrently that user makes an edit to
the document uh whose permissions have
just changed and now some devices may
see the edit to the document first and
the revocation second and so they would
accept the edit to the document and
another device may see it the other way
around. They may see the revocation
first and then the edit to the document
second and they'll drop the edit to the
document because they think it's not
authorized. And now those devices have
become inconsistent with each other
permanently inconsistent. So that means
if we actually want to ensure
consistency even for this fairly basic
setup we now have to somehow figure out
how to resolve this situation of an edit
that is concurrent with the revocation
of the user who made that edit. solving
that problem then mean in in a
decentralized setting where we don't
have just a single server that can make
that decision in a centralized setting
you know you just have one server it
decides did the edit to the document
come first or did the revocation come
first and that one decide server makes
that decision but if you have multiple
servers they might make different
decisions so then you could have a
consensus protocol but then consensus is
messy because it requires like some
quorum votes and requires nodes to be
online um and so we've been trying to do
the whole thing without doing consensus.
But but while um so while preserving
high availability, while preserving the
ability for user to work offline,
preserving the ability to uh synchronize
peer-to-peer without any servers, for
example, that just makes the engineering
challenge a lot harder and it's solvable
and we are close to solving it uh for
automerge, which is the the CLDT library
that that I work on. Um, but it's uh
it's just much less straightforward than
it is in the in the centralized case.
But that's a nice example of where
interesting engineering challenges arise
from this desire to get away from
centralized services. And then we were
just talking about clocks earlier. But
an obvious thing that came to mind is
well if if all of them had the same
clock exactly to the microscond, you
could just use a clock, you could use a
time stamp, but as you said in
distributed systems, we cannot always
trust the the clocks are always
synchronized. So I I assume like you
just have these a lot of the things that
you have been researching and writing
about are just coming back to
>> Absolutely. And in this particular
setting of like a user getting their
edit permissions revoked if a revoked
user still wants to say vandalize a
document they can just backdate their
edit give it an earlier time stamp. So
relying on clocks is absolutely useless
here because people can forge the time
stamps from those clocks and thereby
then potentially undermine the access
control mechanism. So in this kind of
system, we have to worry about
potentially maliciously uh generated uh
actions as well when the actions come
from end user devices. This is
fascinating because it feels to me that
you're solving a hard or maybe even
harder engineering challenge than some
startups would do because the startups
would go the easy route. They would take
on a constraint in this case a
centralized server which makes business
sense, makes revenue sense. But because
you are not doing this, you now need to
look for a solution for a harder
problem. And if you solve this harder
problem, you can give a building block
that can just move the industry forward.
Just give a an option for either a
business or an individual or an
institution to you know like have an
option not just to use centralized but
use this decentralized
local first approach and then of course
reason about the trade-off and decide
whichever makes sense.
>> Exactly. And that's what I mean with
this long-term thinking. This is an
example of it where because it's
research we can afford to take this
idealistic principled stance. I said yes
we're going to solve this harder
engineering problem because we think
decentralization is a valuable feature
and we know perfectly well that most
startups are not going to solve this
problem because they will just do the
easy pragmatic thing which is the right
thing for startups to do. Um, but we
have a different set of incentives and
we can afford to put in the time to try
and solve those hard problems. And as
you said, if we can solve them, then it
creates more optionality for anyone, any
users of this technology, they can if
they want to choose to use this
decentralized tech. And there's still
trade-offs around it, but at least if
they're not having to invent it from
scratch, it'll be a lot easier to adopt
this kind of uh decentralized tech for
for those who want to use it.
So in inside academia you're also
teaching. Uh what courses do you teach?
At the moment I have a concurrent and
distributed systems course for the
undergraduates and a cryptographic
protocol engineering course for the
master students. And then additionally
this year I have a uh a seminar course
on security and a uh and teaching also
the undergraduate operating systems
course. I've got quite a lot of teaching
this year. the distributed systems
course, it's available on on YouTube.
Can you summarize what people who would
go through this course which again is
freely available? Thank you for you and
the university for making it available.
What what what would they learn
throughout those courses? Yes. So that
distributed systems course, it's a bit
more theoretical than what is in the
book. So it's more focused on algorithms
and sort of the how we convince
ourselves that the algorithms behave
correctly under the assumptions of
distributed systems that we talked about
of like nodes may crash, communication
might be unreliable, uh clocks might be
wrong, etc. So that's really it. It's
it's not a very long course. It's just
uh eight lectures worth of of material.
But it's uh it goes into substantially
more detail on the algorithms than the
book. So for example, one of the
lectures goes through the entire raft
consensus algorithm which is pretty
complex. Um but I really wanted to show
the students exactly how it works
because it's just such a nice
illustration of the challenges of
distributed systems and the various
measures we need to take in order to
handle the various types of edge cases
and failures um that can happen and
showing that those those problems can be
overcome. It's not easy and the
algorithms are very subtle and it's very
easy to have bugs in them but it is
possible to solve consensus in a in a
way that works pretty well and uh and so
that's really this the sort of message
I'm trying to uh get across with this
course and you mentioned that when
you're when you're writing the book
together with Chris you brought a lot
industry insight and being up to date
and you brought your experience of of
teaching and and what works I don't
think I have a particularly like unique
teaching style just uh in lectures I
will go through slides. I I like to
annotate the slides by hand uh during
the lectures. I've just draw draw on an
iPad to make it a little bit more
interactive. But um other than that, it
it is fairly theoretical. That's partly
the way the Cambridge system works. It
kind of favors theoretical and pen and
paper courses over say implementation
practical courses. I think it it would
be possible certainly to do a practical
course on this and I may incorporate a
bit more practical exercise in the
future but right now it's mostly a
theoretical pen and paper course when
that is fine. Uh the cryptography course
that I do is that's much more uh
hands-on. So that's about actually
getting the students to like implement
some elliptic curves from scratch for
example. And how have you seen it in
your time in in academia which has been
it's now a longer time period. How have
you seen computer science education
changing? How do you think it might
change further in in the future
especially as we're seeing AI u be part
of industry and probably the world as
well? Yeah, I mean prior to AI explosion
happening actually rate of change is
very slow in in computer science
teaching. Partly that might be
Cambridge, you know, Cambridge is over
800 years old like everyone thinks on
longer time scales. People don't tend to
rush into the latest fad and instead try
to focus on the fundamentals and the
ideas that a lot of the fundamentals of
computer science were developed in the
1930s already and are still true today.
and you know lambda calculus and those
types of things for example and so we
have quite a bit of a focus on those
sort of fundamentals rather than chasing
the latest uh fashionable thing. That
said, AI has totally changed the way we
can assess coursework, for example,
because of course now we we can try
banning AI, but it's impossible to
actually enforce such a ban. And also,
it's kind of counterproductive because
we do want students to engage with new
technologies and figure out how to use
them productively for themselves. But we
want to somehow do that in a way that
supports their own learning and doesn't
undermine it. So, how do we get the
students to use AI in in a responsible
way, in a way that's mature? And we
can't necessarily rely on the students
being mature enough to know for
themselves what is a helpful use of AI
and what is a form of use of AI that
undermines their own learning because
some of them are quite mature and able
to decide that for themselves, but many
are not and so we need to provide some
guardrails for them. Um and we do need
to make sure that when we have assessed
work for example it's fair and it's
perceived as fair by the students and if
the students feel that some of their uh
co- students are getting really good
marks without doing any work that
undermines the trust in the entire
system and so we have to be very careful
with how we approach this and to be
honest we don't really have good answers
yet. So we do uh now for example have a
boot camp right at the start of the
first year for the new students to
expose them to basic software
engineering skills which is like this is
version control, this is unit testing,
this is generative AI and the sort of
basics that really everyone should be
familiar with and then the hope is that
they will use that throughout their
degree in order to just improve the work
that they do. But how exactly we handle
things for assessment for example we're
we're still in the process of figuring
out. So it it sounds like the the the
pace of of change is going to be fast in
the industry and also in academia we'll
probably adopt it and we'll see you know
like what what comes after. Yes. There's
a difference though which is in the
desired outcomes. I think with industry
generally the desired outcome is like a
working product for example. In academia
the actual artifacts that the students
produce like an essay that the students
write that's not really the point. We
don't ask the students to write essays
because we love reading their amazing
essays. We ask them to write essays
because we want them to go through a
thought process which helps them learn
something. And it's that thought process
and that learning which is really the
the desired outcome here. And so that
means that we do have to approach it a
little differently because in generally
in in industry, you know, if you can use
AI to get a job done faster and you get
to the an equivalent result, do it
absolutely because yes, that that is the
desired outcome. uh whereas in education
we do have to think about how we ensure
that the the learning outcomes and the
thought processes are still preserved
such that the the students benefit
intellectually. It's very relevant
especially entropic had a recent study
where they looked at junior engineers
they one of them used one group used AI
the other one did not and they found
unsurprisingly from what what you also
explained that the group who used AI
they had little to no learning whereas
the group that did not they actually
learned it. Yes, I saw that study as
well. I think the meth detailed methods
of that study we might be able to
quibble with a bit but I think the the
general principle seems true that yes so
sometimes in order to learn something
you just have to struggle with it a bit
not struggle too much so if people are
stuck on some technicality and they can
use AI to get unblocked and then be able
to focus really on the the main learning
outcome then I think uh it's good to use
these types of tools but if if the point
is to actually like grapple with some
difficult ideas and think them through
their own minds, then we need to still
find ways to make sure the students are
doing that.
>> You work both in industry and academia.
What what do you think industry could
learn from academia and academia can
learn from industry? The two really
could be closer together because often
they regard each other with uh sort of
disrespect really like the the industry
people will say, "Ah, that's
theoretical, that's academic, it's got
nothing to do with the real world." and
they're really missing a trick there
because actually there's a lot of
interesting insights from research that
are very relevant to the real world. Um
but they're not necessarily making their
way across that chasm. In the other
direction, the academics will say, "Oh,
this industry stuff, you know, that's
just engineering." They're not actually
doing any interesting thinking. It's
just like writing routine stuff. I think
I see it as one of my goals to try and
build better respect across both in both
directions by bringing interesting
insights from research into industrial
practice but also by informing our
research uh by the problems that uh
arise in in real world and so that way
like joining those two things up a bit
better. What are your current research
topics that you're working on ones that
you're excited about? I have two main
areas I'm working on at the moment. Uh
one is local first software. So that's
this idea that we want collaborative
software like Google Docs, like Figma,
etc., but in a way that uh gives better
protection to users data that's less
dependent on a single cloud provider who
can lock you out of your files and
that's therefore more resilient. Uh
gives users greater agency and greater
autonomy over their own data. U so
that's an area that I've been working on
for the last 10 years or so through a
mixture of open source work and
algorithm development and formal
verification and so on. I'm now also
trying to set up a brand new research
area in a totally different topic um
which is on using cryptography to prove
things about the physical world. So I'm
interested there in especially
sustainability related things. So for
example, if you want to verify that the
carbon emissions involved in
manufacturing a particular product were
X and you want to be sure that that
number is correct because maybe you want
to include emissions as part of your
purchasing decision and choose the
product with the lower emissions. For
that to be meaningful, the emissions
number has to be correct. And
unfortunately at the moment the numbers
are generally not correct because the
incentives are to lie and cheat and to
use creative accounting techniques all
as a way of like greenwashing basically
or a related thing is happening in the
EU for example which is bringing in new
regulations on preventing deforestation
of tropical rainforests. So that's for
example coffee, cocoa, palm oil etc
imported into the EU. the importer needs
to prove exactly which plot of land it
actually came from and then check
against satellite imagery that that was
not recently deforested. And so I've
been looking into using cryptography as
a tool of proving things about the
supply chains of these physical products
but without revealing commercially
sensitive information. For example, a a
company will not want to reveal who its
suppliers were and which ingredient to
its process it purchased from which
supplier, for example, because that
might reveal something about its secret
recipe that it uses. And so the hope
here is that cryptography can allow us
to prove that for example the the
accounting has been done correctly
across supply chains but without having
to reveal publicly any of this sensitive
data about suppliers or other customers.
What is your view from your vantage
point on the impact that AI is having on
academia not not just for for students
studying beyond that and also industry
with your industry contacts? Yeah, I
mean I'm not not that deeply into um the
AI things really. I'm seeing it more
through my collaborators who are making
very good use of of AI tools uh for for
software development especially. I
personally write very little code these
days and so I haven't had that much need
or occasion to actually use AI agents
myself personally. When when writing
pros like working on the book for
example, I prefer to still do that the
oldfashioned way of just write every
word by hand. So I I haven't let AI
anywhere near the text of the book for
example. And I don't know if that's
that's the right decision. It's not
really a a principle thing that I I
think it would be wrong to do so. It's
more that for myself the process of
writing is the way how I figure things
out and figuring things out is really my
goal here. So I'm I'm trying to figure
it out in my own head and for that I
just have to write it myself. Does there
doesn't seem to be any way around it.
But using AI as a way of like getting
feedback on ideas or exploring like
whether an idea really holds up to
scrutiny or things like that seems like
a very productive use of the technology
and that applies for for both industry
and academia I would say. So as as
closing for a student or a young
professional who is is still studying
and considering the route into either
industry or academia, what have you seen
uh who thrives in one or the other?
>> Yeah, my feeling is they're not really
that mutually exclusive or rather some
of the best PhD students uh I've worked
with for example actually have a few
years of industry experience. So they
might have done an undergraduate maybe
done a masters then spent a few years in
industry developing like actual doing
real software engineering learning about
the real world uh and then maybe at some
point got bored and thought oh actually
you know I want to work on maybe more
idealistic things or have more freedom
to choose uh their own research topics
and then start getting interested in
doing a PhD and that I find is is quite
a healthy route. You do get people who
go, you know, straight from their
undergraduate degree and masters into
doing a PhD, but sometimes those people
can just lack a bit of the breadth of
perspective. And so I think having seen
a bit of just real world engineering is
is actually really helpful for people
even if they then want to stay in
research. But in the opposite direction,
I think it can work very well too
because in in research research in
academia, we just get to think things
through a lot more carefully than people
often do in industry. Often people in
industry, I feel like sort of have short
circuit reasoning, like don't maybe
don't quite reason something through
from first principles, but just like uh
oh, I heard this from a conference talk.
I'm just going to go with that. And oh
yeah, what what academia can teach is
this sort of uh nuanced and and critical
thinking um to really reason through
trade-offs, for example, and to really
like justify why something is true. And
so I think it's really good actually if
people can weave in and out of industry
and academia a bit and not regard it as
like two totally mutually exclusive
career paths, but actually have a bit of
switching between the two.
>> Well, Martin, thank you very much. Uh I
expected us to talk a lot more about
your book which we did but I I have a
newfound curiosity and and respect for
all the important and interesting
academic work that you and everyone else
is doing. So thank you so much for this.
Thank you for the great interview. This
was really interesting.
>> I hope you enjoyed this rare
conversation with Martin Clubman. I
found it interesting to learn that the
first edition of the book assumed that
you have machines with local discs. But
actually today this is not how most
engineers build systems anymore.
cloudnative primitives like S3 change
how you build systems and this is why
this book just needed a refresh. I also
appreciated Martin's take on whether
engineers still need to undertest system
internals when they're using managed
services. If you're building business
logic on top of these services, you
probably don't need to know every
detail, but it can become useful to be
able to look deeper, especially when you
need to debug your system. By the end of
our conversation, I gained a lot of
appreciation for the academic research
that Martin is doing. the local first
software work, the access control
problem in decentralized systems, using
cryptography to verify supply chain
emissions. A lot of these are hard
engineuring problems that few startups
would take on. It was nice to understand
how academia is in a good position to do
work that has a long-term focus. Do
check out the show notes below for
related to primatic engineer deep dives.
If you've enjoyed this podcast, please
do subscribe on your favorite podcast
platform and on YouTube. A special thank
you if you also leave a rating on the
show. Thanks and see you in the next
one.
Ask follow-up questions or revisit key timestamps.
This episode features Martin Kleppmann, author of 'Designing Data-Intensive Applications', discussing his journey from startup founder and LinkedIn engineer to academia. The conversation covers the evolution of his generational book, the necessity of the second edition due to shifts toward cloud-native architectures, and his current research in academia, specifically focusing on local-first software and the use of cryptography for physical-world verification.
Videos recently processed by our community