Designing Data-intensive Applications with Martin Kleppmann

Watch on YouTube

Now Playing

Transcript

2412 segments

0:00

Should I consider multiszone,

0:02

multi-reion or even a multi- cloud

0:04

setup? How much availability risk are

0:06

you willing to take on versus the

0:08

computational overheads, but also the

0:10

human overheads actually designing and

0:12

operating the system? Macro produce is

0:14

dead. Nobody uses it anymore. But other

0:16

areas where we've increased the coverage

0:18

are systems in support of AI like vector

0:21

indexes. Is there any risk as a software

0:23

engineer that you're no longer

0:24

incentivized to understand the

0:26

underlying layer? If you rely on a

0:28

higher level abstraction, you're no

0:29

longer thinking about the lower level

0:31

details. If you're building higher level

0:32

business logic, actually, I think it's

0:34

just fine. LLMs increase the need for

0:36

these formal proofs because we're vip

0:38

coding a bunch of stuff. The reason I

0:40

think that formal verification could

0:42

become more important in the future. One

0:44

is that

0:49

designing data intensive applications

0:50

has been the go-to book for anyone

0:52

building large backend systems. 9 years

0:54

after publishing this book, the second

0:55

edition is here. Martin Klutman is the

0:57

author of this generational book. I sat

0:59

down with him and today we cover how

1:01

working on CFKA at LinkedIn directly

1:03

shaped ideas that became the first

1:05

edition of the book, what's new in the

1:06

second edition, and why things like map

1:08

produce got removed from this updated

1:10

version. Formal methods, local first

1:12

software, decentralized access, and many

1:14

more. If you care about how large

1:16

systems work, where they're heading, and

1:17

what the fundamentals are that don't

1:18

change, this episode is for you. This

1:20

episode is presented by SATSIC, the

1:22

Unifi platform for flags, analytics,

1:24

experiments, and more. This episode is

1:26

brought to you by Sonar. Sonar, the

1:28

makers of Sonar Cube, understands that

1:31

code quality is about more than just

1:33

avoiding syntax errors. It's about

1:35

long-term maintainability by protecting

1:37

the structural integrity of the system.

1:40

As agents generate code at massive

1:42

scale, they often ignore your system

1:44

structural integrity. This creates

1:47

tangles, duplicated code, and other

1:50

maintainability issues. These issues

1:52

turn a module design into a big ball of

1:55

mud, making it increasingly difficult to

1:57

extend. But here's something that's

1:58

really helpful. Sonar Cub's architecture

2:01

management. It moves architectural

2:03

governance out of static wikis and into

2:06

your automated workflow. It allows you

2:08

to visualize your current architecture,

2:10

define architectural boundaries, and

2:12

manage architectural issues in real

2:14

time. Whether it's a human or an AI

2:16

agent at the keyboard, Sonar acts as a

2:18

circuit breaker for structural decay. It

2:21

ensures every commit respects the

2:22

systems blueprint protecting the

2:24

long-term health of your most complex

2:26

applications. Head to

2:27

sonarsource.com/pragmatic

2:29

to find out more. So Martin, welcome to

2:32

the podcast.

2:32

>> Hi Ger, it's great to be here. It's

2:35

amazing to to have you here. I don't

2:37

think you need introduction to many

2:38

software engineers, including myself.

2:40

You're the author of this iconic book

2:42

that I've had on my bookshelf for

2:44

probably about 10 years, not not much

2:46

longer after it came out. Before we get

2:47

into this book, which we're going to

2:49

talk about, how did you get into the

2:51

technology field?

2:52

>> Yes. Well, I did a undergraduate

2:53

computer science like like many others.

2:55

And then after that, I wasn't quite sure

2:58

what to do with my life, but I thought,

2:59

well, is like starting a startup seems

3:01

like an interesting thing to try. So, I

3:03

started a startup having no clue what I

3:05

was going to actually do and then spent

3:07

the first while searching around for

3:09

things that might be interesting. it the

3:11

first startup didn't work out that well

3:13

but through that I met some others who

3:16

then became my co-founders for the

3:18

second startup which worked better and

3:20

uh we sold that one to LinkedIn and then

3:23

after that I started being interested in

3:26

like teaching these distributed systems

3:28

concepts so that's when I got into

3:30

writing the book and then during the

3:33

writing of the book I also switched over

3:34

from industry back to academia can we

3:36

talk a little bit about your first and

3:38

second startup yeah go test it this was

3:40

like 2008 or something like that. It was

3:43

the age where people were having really

3:46

difficulties getting their JavaScript

3:47

working cross browser. Internet Explorer

3:49

was still pretty big at the time. Chrome

3:51

had just come out. Uh all the browsers

3:54

were incompatible with each other and so

3:56

Go Test. It was a cross browser

3:58

automated testing service for websites

4:01

was based on Selenium, an open source

4:03

project that still exists. And the idea

4:05

is you would write like test scripts

4:06

that automate the a user clicking

4:08

through the various uh interactions with

4:11

a website and then just check that the

4:12

right behavior happens. And so yeah, it

4:15

was based on selenium but just as it

4:16

provided as a hosted service so people

4:18

wouldn't have to run various VMs with

4:20

various operating systems themselves. It

4:22

worked technically but um I found it

4:25

really hard to actually get adoption for

4:27

it. A lot of uh people building websites

4:30

like in theory said oh yeah this is

4:32

great. we we need to test cross browser

4:34

and in practice actually it was really

4:36

difficult to get them to integrate it

4:38

into their workflow and just get in the

4:39

habit of using it and investing in

4:41

writing the test scripts. So, so that

4:44

ended up not really going anywhere.

4:45

>> So, so like there wasn't like a business

4:47

to be done or or like revenue to be

4:50

generated in meaningful sense.

4:52

>> Yeah. Well, there's at least one other

4:54

maybe two other companies from that same

4:56

era that did manage to make a business.

4:59

Source Labs is one that that managed to

5:01

actually succeed. Um, but it even for

5:04

them it was a pretty slow running

5:06

business. I think it was not an easy

5:08

business to be in. And for the startup,

5:10

were you in in the UK building it?

5:13

>> I was in the UK at the time.

5:14

>> Was it was it bootstrapped? Did you

5:16

raise some some kind of funding? How big

5:18

was the team? How can we imagine this?

5:20

>> It was mostly bootstrapped. So I did a

5:22

bunch of consulting in order to fund

5:24

hiring some people and then hired some

5:26

like friends uh on the cheap to help

5:30

contribute to actually building the

5:32

product. And so it was done all all very

5:33

cheaply. I had a very small amount of uh

5:36

of angel money in there but mostly

5:38

bootstrapped.

5:39

>> Mhm. And then when you decided to to not

5:42

uh go forward with this, how did the

5:44

next startup come? Uh reportive, right?

5:46

>> Yeah, the second one was reportive. That

5:48

went a lot better. So, uh, that was

5:50

putting social media inside Gmail

5:52

basically. So, the idea was that if you

5:54

get an email from someone you don't

5:56

know, we had a little browser extension

5:58

which manipulated the Gmail web

5:59

interface so that on the side next to

6:02

the email, we'd show you a summary

6:03

social profile with like a profile

6:05

picture and like a job title pulled from

6:09

LinkedIn and recent tweets pulled from

6:11

Twitter and maybe recent Facebook post

6:14

or things like that. just whatever we

6:16

could find about that person uh and put

6:18

that as a as a social summary next to

6:20

the email. We started in 2010 or

6:23

something like that. It was then pretty

6:25

quickly became quite popular. Um and so

6:28

on the back of that we were then able to

6:30

raise some money from my combinator

6:32

which was still fairly young at the

6:34

time.

6:34

>> That was very young. That you must have

6:36

been one of the very early batches.

6:38

>> Yeah, I can't remember exactly when they

6:40

started but it was um it was certainly

6:42

in the early years. I think Y Combinator

6:44

had already built up a quite a good

6:45

reputation at the time, but it was still

6:48

fairly small.

6:49

>> And then as part of Y Combinator, did

6:50

you have to fly you from from the UK to

6:54

San Francisco to attend that 10e program

6:56

if I remember?

6:57

>> Exactly. Yes. So we um initially came

7:01

for for the 3 months or whatever it was

7:03

of the Y combinator but then we were

7:06

able to get US work visas for ourselves

7:09

and uh set up permanently uh in in San

7:12

Francisco.

7:13

>> How was that shift from from the UK

7:14

where you spent going to university your

7:16

first startup the first part of this to

7:18

coming to San Francisco? It was very

7:20

exciting because uh you know it felt

7:23

like you know going going to the the

7:25

center of where it was all happening

7:26

really and we at the started out not

7:30

knowing anybody at all. we knew like one

7:32

or two people in the entire Bay Area,

7:34

but we like contacted them and they

7:36

introduced us to more people and they

7:37

introduced us to more people. And so we

7:40

were able to pretty quickly actually

7:41

build up a a network and that that's

7:44

something that I I really appreciated

7:45

that it was actually so open to

7:47

outsiders like us who could just

7:48

basically turn up with an idea and an

7:50

early stage startup and we managed to

7:52

raise some money and managed to like

7:54

actually become somewhat established in

7:56

the in the Bay Area. And can you tell me

7:58

how the how the company grew and and at

8:00

what point did the LinkedIn acquisition

8:02

offer come and and how can we imagine

8:04

even you were a founder of this company.

8:06

It was about in 2012 that we sold it. Um

8:09

and we were five people at the time. So

8:12

it's all still pretty small. Um not vast

8:15

amounts of money involved but it it was

8:17

a success I would say uh for everybody

8:20

involved. The acquisition process it

8:21

itself was fine. is like as always with

8:23

these kinds of transactions, there was

8:25

like twists and turns and moments where

8:28

we thought it would all fall apart and

8:29

then we were almost running out of money

8:32

and uh hadn't really succeeded in

8:34

raising another round. So, we kind of

8:36

had to sell or shut down. So, we were

8:38

under quite a bit of pressure. We

8:39

couldn't reduce our own salaries because

8:41

to do so would have violated the

8:43

conditions of our visas. Yes. Um so, we

8:45

were in a slightly stuck situation given

8:48

our lack of leverage in that situation.

8:50

And actually I'm pretty happy how it all

8:52

turned out.

8:52

>> Yeah, it's nice that you know like for

8:54

10 plus years we can talk about this

8:56

honestly because often times you see an

8:58

acquisition by LinkedIn and of course

8:59

you might ask the founders and they

9:01

would say like this was our either our

9:03

dream or our goal or we will do so many

9:05

things together but some things that you

9:06

don't often hear is well that there was

9:08

a pressure involved as well. So, did you

9:11

go into this wanting to sell the company

9:13

because you saw that things were getting

9:15

a little either you needed to raise a

9:17

new round or you sell to someone and

9:18

then you found LinkedIn to be the the

9:20

best of or the only or or or the best

9:23

option to to go into. We tried a little

9:26

bit to see like what revenue generating

9:28

options we had and hadn't really managed

9:30

to make that work. So, we were just

9:32

burning money and uh and our user growth

9:36

was okay but not really enough to go and

9:38

raise a big round. Um, so we were like a

9:42

little bit stuck there and selling the

9:44

company seemed like the least bad option

9:46

there in a way. And I'm pretty happy how

9:48

it turned out because you know LinkedIn

9:50

was great actually. They they were very

9:51

good to us. They allowed us to operate

9:54

as essentially like a independent team

9:57

within the company.

9:58

>> So So your team stayed together?

9:59

>> Our team stayed together. We continued

10:01

working on the product that we wanted to

10:03

make.

10:03

>> Oh, you you got to keep working on

10:05

reportive.

10:06

>> Yes. Well, actually, so report of the

10:08

Gmail browser extension uh sort of got

10:11

put on life support, but we were working

10:13

on a new product at the time, which did

10:15

eventually get released under the name

10:17

LinkedIn intro. It kind of got a

10:19

slightly weird reception at the time and

10:21

it ended up getting shut down shortly

10:23

after we released it. this kind of

10:25

longer background story there, but um

10:28

I'm still really happy with LinkedIn

10:30

like how they gave us the freedom to do

10:32

this and allowed us to launch this

10:34

product and even though it didn't

10:35

succeed, you know, they were very good

10:36

to us throughout that process and then

10:38

after that got shut down then our team

10:40

got disbanded. Um but we had a good run

10:43

within LinkedIn um building this

10:44

product. What tech stack did you work at

10:47

the time which what do you use? The

10:48

reporter was fairly unexciting. It was a

10:51

Rails app with a Postgress database

10:53

basically and some Reddit and some

10:55

similar things like that mixed in. So

10:57

actually you know nothing particularly

10:59

revolutionary. We essentially built a

11:01

graph database on top of Postgres. So

11:03

there was a a little bit of technical

11:05

interest in there but you know nothing

11:08

particularly outrageous. And then you

11:10

you spent time after LinkedIn intro you

11:13

still work inside LinkedIn as I

11:15

understand you worked on data

11:16

infrastructure right?

11:17

>> Yes data infrastructure. Um after our

11:20

team got disbanded, I switched over to

11:22

the uh stream processing team. So CFKA

11:25

had just been developed at LinkedIn and

11:27

had just

11:29

right. Oh, it was just being open

11:30

sourced.

11:31

>> Yeah, I think it had just been open

11:33

sourced and then uh I got to work on

11:35

samsa which was a stream processing

11:37

framework on top of Kafka. I always

11:39

wanted to ask this question so this

11:41

comes here. Why did LinkedIn build Kafka

11:44

or or develop Kafka? every time it's now

11:46

such a fun foundational technology there

11:49

always I was always curious like why did

11:50

a company feel the necessity to build

11:53

this thing that seems pretty generic and

11:55

it seems everyone would have needed it.

11:57

Yes. So I think Jay Kreps has a pretty

12:00

good uh blog post from from that era uh

12:03

called the log where he explains his

12:06

motivation behind CFKA and you know why

12:09

why make it an appendon log rather than

12:11

like a traditional message Q or

12:13

something like that. I think the mo

12:15

motivation was really about data

12:16

integration because there were a whole

12:18

bunch of databases and and like event

12:21

generating systems you know like um

12:23

activity events from users for example

12:25

they were all generating data that in a

12:28

sort of stream shape and then a bunch of

12:30

downstream systems that wanted to

12:32

consume this like wanted to get it into

12:33

the data warehouse and wanted to be able

12:35

to get it into the Hadoop cluster at the

12:37

time in order to run like machine

12:39

learning and things over it and there

12:43

was just this data integration problem

12:44

of actually like how do you physically

12:46

get the data out of one system and into

12:48

another and uh Jay designed CFKA as this

12:52

integration point essentially like the

12:54

almost the kind of lowest common

12:55

denominator but still a general purpose

12:57

abstraction uh for integrating v various

13:01

data sources and to downstream data

13:04

syncs working at LinkedIn at at you know

13:06

like CFKA and at LinkedIn scale what did

13:09

you learn or what surprised you about

13:11

working at this type of scale as I

13:13

understand this was for the first time

13:14

that you hands-on worked at a really

13:16

large system, right?

13:17

>> That's right. Yes. Because like

13:19

previously the biggest company I had

13:20

worked in was Reporter with five people.

13:22

We had a sizable database but it was

13:24

still like a single instance database

13:26

and not really that big in the grand

13:28

scheme of things. And then yet suddenly

13:30

I was at LinkedIn and oh we got to get

13:32

get to use their big Hadoop cluster.

13:34

That was fun like hand coding map

13:36

produce jobs in Java at the time and so

13:39

I I learned a huge amount there. Um

13:41

especially when the stream processing

13:44

ideas uh came up and Jay was

13:46

evangelizing the use of CFKA and the

13:49

things you could do with it. That was

13:51

kind of a revelation for me really where

13:52

I suddenly like felt ah this this kind

13:55

of makes sense like I'm I start to

13:58

understand how these various data

13:59

systems fit together what they have in

14:01

common what the fundamental principles

14:03

are and so that experience then fed

14:05

directly into the writing of the book.

14:07

At what point did you decide to leave

14:09

LinkedIn? To me, in in your careers, I'm

14:12

looking through the career, start out in

14:13

the UK, do a startup, do a second

14:15

startup, Y Cominator, move to San

14:17

Francisco, get acquired by LinkedIn. The

14:19

arc that most people would draw would

14:21

be, okay, do something more in Silicon

14:23

Valley or maybe start a second startup,

14:25

etc. And and instead you decided to

14:27

leave LinkedIn. Yeah. So, first I

14:29

decided to move back to the UK actually

14:31

and I continued working for LinkedIn

14:32

remotely. Okay. That was m mostly

14:35

because my girlfriend at the time, now

14:37

wife, was still in the UK and

14:39

long-distance relationship is not a lot

14:40

of fun and I didn't feel that at home in

14:43

the Bay Area. So, I wasn't really

14:46

encouraging her to move to the Bay Area

14:48

either. I thought it was better for me

14:49

to go back to Europe and I'm very happy

14:51

with that decision. Like, I still have a

14:52

lot of great friends in the Bay Area. I

14:54

love it as a place to visit, but I

14:56

wouldn't want to live here honestly.

14:58

Then I was still remotely working for

15:00

LinkedIn and that worked all right uh

15:03

for a while. When I then started writing

15:05

the book, LinkedIn even gave me 50% of

15:08

my time free to work on my book

15:10

alongside my software engineering

15:12

duties, which is really great.

15:14

>> Amazing. Yeah, that is so nice of them.

15:16

>> Absolutely. And there they don't have to

15:18

do that. And LinkedIn didn't directly

15:20

get anything out of it in response other

15:22

than like a book that they could use for

15:24

internal training purposes. Well, shout

15:26

out shout out to LinkedIn for this.

15:28

>> Yeah, absolutely. Though then I did find

15:30

then that actually trying to write a

15:32

book in parallel with doing a software

15:34

engineering job and being on call etc. I

15:37

just wasn't able to do it. It's just too

15:40

much context switching and it's very

15:42

easy for the urgent things from the on

15:45

call to dominate and and then not to

15:48

have the you know the freedom of that

15:50

you need in order to to write something

15:52

new. Um and so then after a while I

15:54

decided okay like it's it's probably

15:56

better if I focus full-time on the book.

15:58

So I then left LinkedIn and just took a

16:00

sbatical unpaid sobatical i.e.

16:02

unemployment um to just focus full-time

16:05

on the book for a while and then it's

16:07

only after that that I actually even

16:08

considered getting into academia. So how

16:11

did the idea of the book come? What was

16:13

a point where you decided you would

16:14

write and in your mind what were you

16:16

deciding to write? What was was it

16:18

already you know this this book with

16:21

with with this layout or you had an

16:23

early idea back then?

16:24

>> I had an idea that it of course the

16:27

final product ended up looking somewhat

16:29

different but the the overall goal I

16:32

think stayed the same. So what I knew I

16:34

wanted to write something that was a

16:36

broad conceptual overview. So not about

16:38

how you use any one specific system or

16:41

tool but comparing the trade-offs

16:43

between many different types of tools.

16:45

And I knew that I wanted to be

16:47

practitioner focused like not a

16:50

theoretical textbook but something that

16:53

people could use to build real systems.

16:56

That was basically like the the goal

16:58

with which I appreciate approached it.

17:00

And this was exactly the book that I

17:02

wish I had had when I was starting out

17:04

and uh working at Reportive for example

17:07

because we were all like searching

17:09

around in the dark where we're having

17:11

performance problems with our database

17:13

and we had no idea what to do basically

17:14

because we were totally lacking the

17:16

foundations to actually understand what

17:18

was going on and how to diagnose the

17:21

issues. And so I felt that well if I had

17:23

had a bit more background on how these

17:26

data systems actually work internally

17:27

then I could have had an intuition about

17:29

how to debug these kinds of performance

17:31

issues. And then after a while after I'd

17:34

learned more about how data systems work

17:37

I thought well okay it's it's time to

17:39

write this down so that others don't

17:40

have to learn it the hard way um but can

17:43

hopefully just get a better idea of how

17:45

these systems work and thus be better at

17:47

managing their their own data systems.

17:49

to start with how did you learn about

17:51

for example how databases work because

17:53

again from from your story at report if

17:55

you you build systems you've had some

17:57

performance issues at a smaller scale to

17:59

to be fair compared to LinkedIn then you

18:00

worked at LinkedIn and you saw a little

18:02

bit of how the sausage was made but I

18:04

know a lot of software engineers who

18:05

have been in this path and they still

18:07

don't really know how the fundamental

18:09

systems work they just know okay we have

18:10

a platform team inside our company and

18:12

they build it I could read the RFC's but

18:15

it's a lot of work or the planning docs

18:16

I could look look at the source code it

18:18

feels to me that even at that point you

18:20

just went down and and tried to dig in.

18:23

What resources did you use? How how did

18:25

you find out those those basics which

18:27

you later put into the book? A lot of of

18:29

it was just kind of being curious and

18:32

talking to people actually and just

18:34

asking them lots of questions. And at

18:36

LinkedIn there were like a bunch of

18:38

senior data systems engineers who

18:41

understood their stuff very well but

18:42

hadn't maybe necessarily written it

18:44

down.

18:45

>> Mh. And so I just talked to a bunch of

18:46

them and and quizzed them and that way

18:49

started building a an image in my own

18:52

mind of how this stuff works. And then

18:54

once I sort of got the basics from these

18:56

conversations, then I was able to go and

18:57

read research papers for example. They

19:00

go into much more detail of exactly how

19:03

and why things are designed in such a

19:05

way. Um but you know it is timeconuming

19:07

to read those things. Um so so then what

19:11

I tried to do was like pull out what

19:12

what are really the essential ideas. I

19:15

just read a ton of blog posts as well.

19:18

Um and so the reason why you see so many

19:20

references at the end of each chapter in

19:22

the book is well that is actually the

19:24

material that I myself used in order to

19:27

uh understand what was going on. And

19:30

then I thought well okay well if I found

19:32

these things useful then I'll also cite

19:34

them in the book as a way for anyone any

19:37

reader who wants to go beyond the basics

19:39

covered in the book here are some some

19:42

good sources to further reading. Yeah,

19:44

the the structure of the book, this

19:46

first book at least, it's foundational

19:47

data systems, distributed data, and

19:49

derived data. If I understood, these are

19:51

three big parts. Did you already have a

19:53

structure in mind when you started

19:54

writing the book or did it shape as you

19:56

went? This three-part structure is not

19:59

that critical in the design of the of

20:01

the design of the book really. That's

20:03

sort of more after the fact I thought,

20:06

oh, well, it seems like we can group the

20:08

chapters into roughly this sort of

20:09

structure. But the topics of the

20:11

chapters were more or less what I had

20:14

envisaged. So I um I knew that I wanted

20:17

to talk about like what a transaction

20:19

actually is. I knew that I want to talk

20:20

about replication. Knew that I wanted to

20:22

talk about sharding or partitioning.

20:25

Knew that I want to talk about like

20:27

consistency and consensus. Those the

20:29

sort of highlevel topics I think uh were

20:34

clear from like my initial book proposal

20:37

to the publisher. the details within

20:39

each chapter. That is something that I

20:41

often figured out once I got to that

20:43

chapter. So, I wrote one chapter at a

20:45

time and started each chapter work with

20:48

just a lot of background research to

20:50

actually get up to speed on the topic

20:52

myself. And it's often only then that

20:54

say for then replication I decided okay

20:56

well it seems like the three major ways

20:59

of doing this are single leader,

21:01

multi-leader or leaderless. Okay.

21:02

>> Mhm.

21:03

>> I would decide on that structure at

21:05

essentially when I started writing each

21:07

chapter and then try to fit the various

21:10

points I wanted to make into into this

21:12

uh narrative structure. As a as a fellow

21:14

author who also wrote a book, one thing

21:17

I've noticed there's a bit of parallels

21:19

between estimating a book and estimating

21:21

a software project in that you come in

21:23

with a estimate and if you've never done

21:25

it before you tend to be wildly off. How

21:27

was this in your journey? And and

21:30

addition, you also had a publisher and

21:32

publishers are are a little bit like

21:33

project managers. They, you know, they

21:35

they like to have a a schedule. They

21:37

like to try to keep you on track. They

21:38

they like to ask what when is it done?

21:40

How did you manage that part as well?

21:42

And and in the end, how long did you

21:44

estimate it would take when you started

21:45

and how long did it actually take?

21:47

>> As always, it takes vastly longer than

21:49

expected. It's the same for software and

21:51

projects as it is for writing, I think.

21:53

So I think it took me about four years

21:55

to write the first edition and that was

21:57

not four years of full-time maybe like

22:00

two and a half years of full-time

22:02

equivalent or something like that but uh

22:04

written over the course of about four

22:06

years. So it definitely took a long

22:08

time. The uh publisher deadline I missed

22:12

by a ludicrous margin. I think I missed

22:15

it by about 2 and a half years or

22:16

something like that. Uh but fortunately

22:20

O'Reilly were pretty laid-back with the

22:22

with the second with the first edition

22:24

and were happy for me to just take my

22:26

time and make it good. Uh when it came

22:29

to the second edition then actually

22:31

O'Reilly got a bit more aggressive and

22:33

pushy about uh sticking to deadlines. I

22:37

guess by that point the book had been

22:40

established and people were waiting

22:42

eagerly for the second edition. So, I

22:43

kind of understand the the desire to to

22:47

want to accelerate it, but at the same

22:49

time, I I really appreciated the the

22:51

freedom that I had for the first edition

22:53

to work on my own schedule. Um, and I

22:56

had a bit less of that with the second.

22:58

The tagline for the first edition, which

23:00

I believe is the same as second edition,

23:02

the big ideas behind reliable, scalable,

23:04

and maintainable systems. Reliable,

23:06

scalable, and maintainable. What do

23:08

these objectives mean to you?

23:10

>> Yeah. So they're all slightly vaguely

23:13

defined, right? So there's there's not a

23:15

a formal definition of those things. But

23:17

uh for me, reliability means fall

23:19

tolerance primarily. So meaning that a

23:22

system should on the whole continue

23:25

working even if like a network link is

23:27

interrupted or a node crashes or

23:30

something like that. So a lot of the

23:32

book is about techniques that support

23:34

fall tolerance like replication for

23:36

example. Um so that's reliability. Uh

23:39

scalability is one of those terms that

23:42

gets thrown around a lot and it's sort

23:44

of so much and it's it's like

23:46

fashionable and cool to make things

23:48

scalable, you know, because it's it

23:50

suggests success and millions of users

23:53

and so that's of course everyone wants

23:55

things to be scalable because everyone

23:57

wants success for this book. here tried

23:59

to take a bit more dispassionate kind of

24:01

approach and said scalability is just

24:03

like what mechanisms we have for dealing

24:07

with changes in load if load increases

24:10

how can we add computing capacity to a

24:12

system for example so that the system

24:14

still continues working and then the

24:16

techniques that you use to achieve

24:17

scalability well they are like sharding

24:19

for example and and but in this case

24:21

scalability your definition do I

24:24

understand that you're mostly referring

24:25

to horizontal scalability so they cannot

24:27

compute

24:28

up or down pretty much.

24:30

>> Yeah, I guess because that's the the

24:32

more interesting one like yes, you can

24:33

always buy a bigger machine and

24:35

>> what's interesting about that

24:36

>> and exactly there's just there's not

24:38

that much to be said about it. I mean

24:40

there are details of how you scale even

24:42

on a single machine but I think like

24:44

part of what is become interesting about

24:47

like modern cloud services and just uh

24:51

backend services in general is like how

24:54

they've introduced this idea of hor

24:56

horizontal scalability and uh shared

24:59

nothing systems. So we can build uh

25:03

systems that you know are able to cope

25:05

with very high load even if the

25:08

individual components are just fairly

25:09

cheap commodity machines. But maybe sort

25:12

of part of the scalability story which I

25:15

wasn't thinking about as much at the

25:16

time but started thinking about more

25:18

recently is not just scaling up but

25:19

scaling down as well.

25:21

>> So actually um how do you run a service

25:24

in such a way that if it has a very

25:27

small amount of load it's really cheap

25:28

to run it. That's sort of a in a way the

25:31

same question as how do you continue

25:34

running a service if it has very high

25:36

load. Um generally like you just want

25:39

the the cost and the computing capacity

25:41

to be roughly proportional to the load

25:43

that you have. And at the low end that

25:45

means actually being able to scale down

25:47

to something that is extremely cheap to

25:49

run. And that's like not so necessarily

25:51

a given. That's something that is hard

25:53

with on premises software for example

25:55

because like if you've got a machine a

25:57

physical machine that's like a a unit of

25:59

deployment and yes you could carve it up

26:01

into two dozen virtual machines and make

26:04

those small virtual machines but um it

26:07

still requires like some sort of

26:08

resource allocation. So so part of

26:11

what's interesting about some serverless

26:12

systems for example is actually their

26:14

ability to scale down and say like okay

26:16

if you're going to handle just three

26:17

requests per day that's just fine as

26:19

well. Can you tell me about the second

26:21

edition? When did the idea come about?

26:23

Yeah, it it had been clear for a couple

26:25

of years that the second edition was

26:27

needed just because the first edition

26:28

was getting a bit dated. There were

26:31

changes in technology that just hadn't

26:33

been reflected in the in the first

26:35

edition. So, I I wanted to update it,

26:37

but you know, I now have an academic

26:40

job. I'm actually like doing research

26:42

and teaching is my main thing, and

26:45

updating the book is just a sort of

26:46

sideline business on the side in some

26:48

sense. So it actually took quite a while

26:51

to make progress with that because I was

26:53

always doing it alongside other projects

26:55

and essentially back to that context

26:57

switching problem that that I had while

26:59

writing the first edition but just now

27:02

um with an academic job that I didn't

27:04

want to just drop um because actually

27:06

quite enjoy it initially then I made

27:08

very slow progress with the second

27:10

edition and also I kind of realized that

27:12

I had slightly lost touch with current

27:14

industry practices because you know I'd

27:16

switched over to the the academic side.

27:18

I gone much deeper on the theory. Um,

27:21

but I was no longer up to speed on like

27:23

what people were doing with say data

27:25

legs or things like that. So then at

27:28

some point it I remembered Chris

27:30

Rkamini, an old colleague from LinkedIn.

27:32

I had worked with him um on the stream

27:35

processing stuff. Uh

27:37

>> you work with him. He's he's the author

27:39

of the missing readme.

27:40

>> Exactly.

27:41

>> Wow. What a small world.

27:42

>> Yeah. And uh I I had read Chris's book,

27:45

The Missing ReadMe, and thought, "Oh,

27:46

he's a great writer." And I had worked

27:49

with him as a software engineer and

27:50

found him him a great colleague and also

27:53

he had been writing this newsletter

27:55

called materialized view on uh on like

27:59

latest trends in data systems

28:00

essentially uh and become a startup

28:02

investor in in that space. Um, and so at

28:05

some point I thought, well, actually I

28:07

have to get in touch with Chris and ask

28:08

him whether he wants to help out with

28:09

the second edition. And he was keen to

28:12

do that. And that turned into such a

28:13

good collaboration because he was up to

28:16

date on like what the cutting edge was

28:18

in terms of uh technology in industry.

28:22

Um, I had strong opinions on how to

28:25

teach essentially. So how to explain

28:27

things in the book, make sure that we

28:29

were explaining everything in a in a way

28:31

that was like very precise, very

28:33

carefully chosen words, but at the same

28:35

time very accessible so that it's

28:37

hopefully easy to read. And so we took

28:40

essentially like my writing style plus

28:43

Chris's knowledge of latest industry

28:45

trends to bring the book up to date and

28:48

that was a a great collaboration. what

28:50

are the big things that you added that

28:51

and and which ones of these you knew

28:53

would be missing and which ones did you

28:55

realize during the writing process that

28:56

okay this needs to be in here now

28:58

>> yeah so the thing we knew from the start

29:00

that we wanted to reflect was uh

29:02

cloudnative systems architecture it's

29:05

it's a bit of a vague term um but what I

29:07

mean with that is essentially building

29:10

uh data systems on top of cloud services

29:14

as the foundational abstraction in the

29:16

first edition the assumption was

29:18

basically that you have some machines.

29:20

Each machine has some local discs. You

29:22

can run a database instance on a

29:24

machine. It will write its data to the

29:26

local disk. If you want to replicate it

29:27

to another machine, then well the

29:29

database software will replicate it at

29:31

the database level to another machine

29:33

which will also write the data to its

29:34

local discs. For a long time that was

29:36

exactly the way computers worked. And

29:38

now suddenly people are building

29:39

databases on top of object stores for

29:41

example. And now the replication happens

29:43

at the object store level. No, no longer

29:45

at the database level. or maybe there's

29:48

still some replication at the database

29:49

level but it really changes the the

29:51

nature of things uh if you're building

29:53

on top of an object store and this is

29:55

different from say building on top of a

29:57

virtual block device like EBS or so

30:00

because these block devices although

30:02

they are cloud services but they still

30:04

offer the abstraction that is a sort of

30:06

single node operating system abstraction

30:08

of a block device on top of which you

30:10

run a file system whereas an object

30:13

store is just like a brand new

30:14

abstraction it just looks different from

30:15

a file system, it behaves differently.

30:18

And so then building on top of that as a

30:21

foundational abstraction is something

30:24

that like people were starting to do at

30:26

the time of the first edition, but since

30:29

the first edition that has really taken

30:31

off like a whole lot of system have have

30:33

been built in that style now. And so

30:35

that's an idea that we really wanted to

30:37

incorporate and we weaved that in

30:39

throughout the book. So it's not just

30:41

like one section here. Um but it's it's

30:44

sort of a an idea that we've integrated

30:46

throughout the entire narrative.

30:48

>> There's now a lot of managed services as

30:51

well. The per primitives that we use,

30:53

but there's also so many managed

30:54

services that all the cloud providers

30:56

use and a lot of engineers, they often

30:58

just use the managed services as is

31:00

because they they take care of

31:02

replication. They have SLAs for uptime

31:05

and so on. But when you build on top of

31:06

these things and you you kind of use

31:08

those as a as primitives as well, is

31:10

there any risk as a software engineer

31:12

that you're no longer incentivized to

31:14

understand the underlying layer or are

31:17

we building better systems because of

31:18

that? How do you think about this? It it

31:20

feels there's a move of abstraction

31:22

because of cloud, right? Yeah, it's

31:24

definitely a a shift to different and

31:27

higher level abstractions,

31:29

but you know that's been the story of

31:30

the entire computing industry since the

31:34

start. It's like building new

31:36

abstractions. So it is true that like if

31:38

you rely on a higher level abstraction,

31:39

you're no longer thinking about the

31:41

lower level details. And so it's you're

31:43

using a a programming language with a

31:46

garbage collector, you're no longer

31:48

thinking about memory allocation. And so

31:50

is that a loss? Well, maybe. Like if you

31:52

if you're building low-level systems,

31:54

you should still have to care about

31:55

memory allocation. You're building

31:57

higher level business logic. Actually, I

31:59

think it's just fine for people not to

32:01

care about memory management. So I think

32:03

there's an analogous thing here with

32:06

data systems that if you're building the

32:08

higher level systems that don't need to

32:10

particularly care about the underlying

32:12

infrastructure, then that's fine. Just

32:15

use the higher level abstractions.

32:16

Nothing wrong with that. But somebody

32:18

still has to build those lower level

32:20

abstractions from lower level

32:21

components. Somebody's got to implement

32:24

the cloud services. Martin talked about

32:26

trade-offs that come with using cloud

32:27

services. And this is a good time to

32:29

talk about our season sponsor work OS.

32:31

If you've read designing data intensive

32:33

applications, you know that building

32:35

system at scale is all about trade-offs.

32:37

But one thing isn't a trade-off. That's

32:39

enterprise features. The moment you land

32:41

bigger customers, you need SSO,

32:43

directory sync, arbback, audit logs, all

32:46

the things they expect out of the box.

32:48

Building that yourself can take months.

32:49

Work gives you APIs to ship it in days

32:52

so you can stay focused on your core

32:54

product. That's why companies like

32:55

OpenAI and Antroic run on Work OS. Visit

32:58

work.com to learn more. I'd also like to

33:01

mention our presenting sponsor stats.

33:03

Static build a unified platform that

33:05

enables both experimentation and

33:07

continuous shipping. Built-in

33:08

experimentation means that every roll

33:10

out automatically becomes a learning

33:11

opportunity with proper statistical

33:14

analysis showing you exactly how

33:16

features impact your metrics. Feature

33:18

flags let you ship continuously with

33:20

confidence. And because it's all in one

33:21

platform with the same product data,

33:23

teams across your organization can

33:25

collaborate and make datadriven

33:26

decisions. To learn more, head to

33:28

stats.com/pragmatic.

33:30

With this, let's get back to Martin and

33:32

the trade-offs that come with using

33:33

cloud services.

33:35

And so those people will have to then

33:37

specialize even more in actually the

33:40

details of how you engineer those cloud

33:42

services, how you make them reliable,

33:44

how you operate them and so on. The

33:45

skills are still there. It's just a bit

33:47

of specialization happening that some

33:48

some people can worry about the higher

33:51

level things without having to concern

33:52

themselves with the lower level things.

33:54

Some people focus on the lower level

33:55

things and treat that higher level

33:57

aspect as their customers.

33:59

>> Interesting. So it it sounds to me that

34:00

if you're an engineer who is utilizing a

34:03

lot of these services, you might not

34:05

need to know how they exactly work.

34:07

>> Yes. And I would say like the underlying

34:09

philosophy of the entire book is to give

34:12

people insights into just the sort of

34:14

essence of how the systems work

34:16

internally. So that if for example they

34:19

start having weird performance behavior,

34:21

you can have a bit of intuition for why

34:23

it's doing that and how you might solve

34:25

it. So for example, say the storage

34:27

engine chapter tells you about how Bes

34:29

work and how lock structured LSM trees

34:32

storage engines work. And the book is

34:34

not intended for people who are going to

34:36

actually build their own databases and

34:38

implement their own storage engines. If

34:39

you want to do that, you have to go much

34:41

much more much greater depth than this

34:43

book covers. But the idea is that as an

34:45

app developer, if you know just a little

34:48

bit about how the storage engine works

34:50

internally, you'll be in a much better

34:52

place to use it in a way that is that

34:54

gives you good performance for example

34:56

and to diagnose any issues. That

34:58

philosophy we've kept also in the

35:00

context of cloud services where yes,

35:02

like cloud service hides some of the

35:04

operational details that app developers

35:06

don't need to think about anymore, but

35:08

they should still know a bit about how

35:10

they work internally just so that they

35:11

can use them effectively. I guess I

35:13

argue about the trade-offs deciding on

35:15

which which service to use, which

35:17

characteristics to look out for. Yeah.

35:19

For for your use case, right? Exactly.

35:21

And and you know, they're huge

35:23

differences of say if you're doing

35:25

analytics whether you're using row

35:27

oriented storage or column oriented

35:29

storage. That's a bit of a technical

35:31

distinction and it takes a little bit of

35:34

background reading to even understand

35:35

what that means, but it has a massive

35:37

performance implication in terms of the

35:39

final behavior of the system. And so

35:42

those are those places where I feel like

35:44

knowing a bit about the the internals is

35:47

actually like a superpower. Yeah. And I

35:49

guess engineers the one thing that we

35:51

always need to argue about or should

35:52

need to argue about is at the very least

35:55

cost versus performance. And by

35:57

performance I mean latency to the user

36:00

and of course resilience of if if

36:02

something happens you know like a region

36:04

go like a zone goes down a machine goes

36:06

down zone goes down region goes down how

36:08

our product is affected and what's

36:10

acceptable. The basic idea there seems

36:12

to be like how much availability risk

36:15

are you willing to take on versus the

36:18

both like the overheads in terms of um

36:21

the the system itself like the

36:22

computational overheads but also the

36:24

human overheads actually designing and

36:26

operating the system and and the cost

36:28

overhead.

36:29

>> Yeah, exactly. And so yes, you can have

36:31

a a system that is more able to tolerate

36:33

various types of faults but it which is

36:35

more expensive to uh to design and

36:38

operate versus a simpler system that you

36:41

know might go down a bit more often but

36:43

which is cheaper. And there's no right

36:44

and wrong with that. You know it's a

36:46

everyone needs to figure out where they

36:48

sit on that uh on that trade-off space

36:51

uh themselves. And I would say that like

36:53

multi-reion is like pushing in the

36:57

direction of like higher availability

36:59

because it means you could tolerate the

37:01

outage of an entire region. But then it

37:03

has implications on the consistency

37:05

model that you can get across different

37:07

regions for example. So that's a

37:09

trade-off that the book tries to make

37:11

very explicit to help people reason that

37:13

through of like what is the right choice

37:15

for them. In terms of multicloud, for

37:18

example, one thing that I've been uh

37:21

concerned about just in the last month

37:23

really is uh European dependence on US

37:26

cloud services.

37:27

>> Yes. So what if geopolitics was to go

37:30

horribly wrong and tensions escalate and

37:33

Europe finds itself suddenly locked out

37:35

of US cloud services? I hope that

37:37

doesn't happen. I still think it's

37:38

fairly unlikely, but it's no longer

37:40

unthinkable. and and as a result I

37:44

coming sort of from this European

37:45

perspective have been thinking a fair

37:47

bit about how can we engineer systems to

37:49

be resilient against that sort of thing

37:52

and that's you know not just like a

37:53

regional outage but it's like a a

37:56

business risk essentially and a

37:58

multicloud sister uh setup could help

38:01

mitigate against that sort of risk so

38:04

that at least for example if one company

38:06

locks you out then you could still have

38:08

systems on on another company again that

38:11

that's very much towards the uh

38:14

expensive but uh high availability risk

38:17

reduction end of the spectrum. But for

38:20

the people who have you know really

38:22

critical workloads where they think this

38:24

sort of geopolitical risk is a

38:26

significant enough risk I think it's

38:28

seriously worth considering that kind of

38:29

setup. I'm thinking that that as we do

38:32

have the responsibility because because

38:33

who else will will do this? Yes,

38:35

totally. But I totally agree with you as

38:37

well that this um understanding what the

38:40

risks are and communicating what the

38:42

trade-offs are I think is is going to be

38:44

a core part of our role as engineers

38:46

moving forward as well. Maybe as AI

38:49

writes more and more code of our code,

38:51

it's less about like the details of how

38:53

you express logic in a particular

38:55

programming language and much more about

38:57

those kinds of highle trade-offs. How

38:59

has the definition of scale changed in

39:02

this book? Because as we talk with cloud

39:04

before cloud building a scalable system

39:07

it sounded pretty involved because

39:09

building a horizontally scalable system

39:11

it's it's complicated all all the pieces

39:13

you need to put it in in the first book

39:14

you detail a lot of this with cloud a

39:16

lot of the services actually they do

39:19

define how they allow horizontal scaling

39:22

what the tradeoffs are do you feel that

39:25

it's made a lot easier to reason about

39:27

scale scalability when you are using

39:29

these primitives so I think achieving

39:32

Being really high scale is still

39:33

challenging because even though we have

39:36

cloud services like object storage for

39:38

example which uh provide you this very

39:41

elastic storage model at least you don't

39:42

have to worry about capacity planning on

39:44

your discs anymore and running out of

39:46

disk space because those kinds of

39:48

operational things they're taking care

39:50

of but if you need sharding for example

39:52

that's something that actually does

39:54

reflect on the application code as well

39:56

you can't really make that entirely

39:58

transparent and so you're at a

40:00

sufficiently large scale The charting is

40:02

required because a single machine is not

40:04

powerful enough to process your

40:06

workload. Then I think even with cloud

40:09

systems you still have to do quite a bit

40:10

of engineering thinking of u of how to

40:13

realize that where I think the cloud has

40:17

helped quite a bit is actually at the

40:18

lower end of scaling down. Uh if you

40:21

want to have a very lightweight service

40:23

that processes only a small number of

40:25

requests. what we've got with serverless

40:28

systems being able to very quickly spin

40:30

up and spin down uh an instance very

40:33

lightweight that's quite a a good

40:35

innovation that has enabled those those

40:37

very low scale uh services and that's

40:40

something that's would be much harder to

40:42

do without cloud services because you

40:44

would have to statically allocate a

40:46

certain amount of memory and certain CPU

40:48

resources to a particular virtual

40:50

machine I love serverless I I have a

40:52

small website that runs on serverless

40:54

and my bill is like 13 cents per month

40:58

because it has very little load.

40:59

>> Absolutely. It's just making more

41:01

efficient use of computational

41:02

resources. Let's talk about sharding. In

41:05

in the first book and when you wrote the

41:07

first book when I was working at Uber,

41:09

we talked a lot about sharding and there

41:10

was a lot of internal implementations or

41:13

interviews involved asking about

41:14

sharding because we were designing

41:16

systems that were sharding. I did sense

41:18

that over time again as as cloud systems

41:22

start to become available that give you

41:23

turnkey solutions more that act more

41:26

like platforms. You send the data and it

41:28

takes care of of these things. Fewer

41:30

engineers have to actually implement

41:32

sharding with cloud native systems in

41:36

your research. What have you seen? What

41:37

what are the cases where putting

41:39

sharding in place is still important and

41:41

where are the places where it it might

41:42

have just disappeared as a as a concern?

41:44

I mean it's still nice to know but you

41:45

might not have to implement it. I think

41:47

it's probably less of an effect of cloud

41:49

and more of just hardware getting more

41:51

powerful that oh actually like a big

41:54

machine nowadays can do a lot on a big

41:56

machine you if you and that means that

41:59

more and more workloads you can just run

42:00

on a single machine and that is

42:02

sufficient actually to achieve quite

42:04

significant scale already there's still

42:07

concerns of like how to actually

42:08

efficiently make use of hundreds of CPU

42:11

cores that you have on a single machine

42:13

so there's still parallelism is still

42:17

are a required thing to think about

42:19

there and sharding is one way of

42:21

achieving parallelism. But at least this

42:23

sort of sharding across multiple

42:25

machines is maybe become less of a

42:27

pressing issue just because more and

42:30

more workloads can just run on a single

42:32

machine. Some people still have very

42:33

large scale workloads that do have to be

42:35

sharded across multiple machines but

42:37

it's not going away entirely and uh

42:40

replication is still relevant even at

42:42

smaller scales because that's for fall

42:44

tolerance that's not for scalability.

42:45

You have a chapter called the troubles

42:48

with distributed systems uh which goes

42:50

through a lot of things that can go

42:53

wrong without going through the whole

42:55

chapter. Can you recall some of the

42:57

things that are memorable to you or some

42:59

of the things that you feel are are

43:00

important to remember? Yeah. The whole

43:02

idea of this chapter is that in

43:04

distributed system theory there are

43:05

certain things that we tend to assume.

43:07

Like for example, we just assume that

43:09

there's no upper bound on how long it

43:11

might take for a message to go over the

43:13

network. So you send a message, it might

43:14

arrive within a 100 microsconds or it

43:17

might take 10 years and distributed

43:20

system theory just doesn't make any

43:21

assumptions about that sort of timing if

43:23

we can avoid it or rather some some

43:25

theory does make those assumptions but

43:27

it's an dangerous assumption to make

43:29

because occasionally the network delay

43:31

does become much higher than than what

43:33

is typical. Another thing is about uh

43:36

crashes. For example, the distributed

43:38

system theory just says like nodes can

43:42

crash but what does that actually mean?

43:43

Like what in practice does it mean for a

43:46

node to become unavailable because it

43:48

might be a software crash but yes it

43:50

might be a hardware failure. It might be

43:51

somebody unplugging the power cable. It

43:53

might be that the node is actually still

43:55

running but it's just become

43:56

disconnected from the network. The point

43:58

of this book chapter really is to defend

44:02

and justify those theoretical models

44:04

that we use for analyzing distributed

44:06

systems and just giving a lot of

44:10

stories and case studies that show that

44:12

you know actually tons of stuff does go

44:14

wrong and like don't believe anyone who

44:16

says oh failures are rare it's don't

44:18

don't worry about it it's fine. Uh the

44:21

the the moral of this chapter is really

44:23

that actually know if you want to make

44:24

things reliable, you really do have to

44:27

worry about a whole bunch of weird

44:30

unusual but but certainly possible edge

44:33

cases. Timing is another one of those

44:35

things like you know it's very easy to

44:37

assume that your clocks are correct and

44:38

most of the times the clocks are pretty

44:40

correct but we just can't rely on it

44:42

because actually they're just not

44:43

precise enough uh on the whole and so a

44:46

lot of it is about it's very tempting to

44:48

make certain assumptions

44:51

um that things are well behaved and and

44:53

in distributed systems we just have to

44:55

try to get away from those assumptions

44:57

if we want the systems to work reliably

45:00

even in the face of things going wrong

45:02

but it was a really fun chapter to

45:03

Right? Because you know it's it's

45:05

essentially a big collection of stuff

45:07

that has gone wrong. And so I went

45:09

through a bunch of postmortems published

45:11

by various tech companies for example in

45:13

order to see okay what was the root

45:14

cause of how things went wrong and what

45:17

kind of lessons can we draw from this

45:19

that apply to the the book in general.

45:20

And uh you know there's some fun stuff

45:22

like the the sharks biting undersea

45:24

cables and damaging them that just you

45:27

know makes for a great story. And then I

45:29

I hear that in recent years the

45:31

shielding of undersea cables has got

45:33

better and therefore the sharks are not

45:35

biting them anymore. But instead the

45:36

cows on land are stepping on cables and

45:38

occasionally causing network

45:39

interruptions that way. And you know

45:42

that sort of thing is just uh it makes

45:44

it a bit more fun. That chapter is so

45:46

interesting also because when depending

45:49

on what kind of teams you work on or

45:50

what kind of people you talk with when I

45:53

talk with the S3 team for them that

45:56

whole chapter is just their dayto-day.

45:58

It's it's they they don't it's not a

46:01

weird thing when you know like a a hard

46:03

drive goes up or or there might be okay

46:05

it might be a weird thing to have a fire

46:06

in a data center but they're prepared

46:07

for all of those things. They're at the

46:09

scale where these things just happen on

46:10

a regular cadence because they're one of

46:12

the the largest scales whereas at a

46:14

smaller company even if you read this

46:16

chapter and you know you will treat this

46:18

as like well this could happen but when

46:21

it h when it actually happens it will be

46:23

a once in 10 year and it will be a big

46:25

deal. Yeah. But I think there's there's

46:29

no like right answer. It's a it's a

46:30

trade-off between risk and cost broadly

46:33

speaking. And that's means a business

46:36

decision has to be made in terms of

46:38

where the business wants to lie uh on

46:40

that trade-off. And so the goal of this

46:42

chapter is really just to give people

46:44

the information in order to make an

46:45

educated decision. But I don't want to

46:47

make that decision for people. That's

46:49

for businesses themselves to decide. Uh

46:51

that's very clear. Have you come across

46:53

some concepts or sips as mentioned in

46:55

the book in the first edition and now in

46:57

the second edition that are becoming

46:58

either more popular or less popular over

47:01

time more or less referenced by your

47:02

readers thinking about from things like

47:04

streaming systems, batch processing or

47:06

or anything else? Yeah. So the some

47:09

things that we've been able to take out

47:11

uh out of the book compared to the first

47:12

edition in particular for example

47:14

coverage of map reduce was quite

47:16

detailed in the first edition but

47:18

basically map reduceuce is dead nobody

47:20

uses it anymore. It's successors like in

47:23

the form of spark and flink for example

47:24

they are used and so we still reference

47:28

map reduce in the second edition but

47:30

more as a learning tool in order to

47:32

understand how these kind of partition

47:34

sharded batch processing systems work.

47:37

So that's one thing where we've been

47:38

able to reduce the coverage. Um, but

47:41

other areas where we've increased the

47:43

coverage are, for example, systems in

47:46

support of AI. And so, even though this

47:48

is not an AI book, but there are still

47:51

data systems concerns that arise when

47:54

needing to support AI applications, like

47:55

a classic one is vector indexes, for

47:57

example. And so, we've added some

48:00

coverage of vector indexes to the

48:01

storage engine chapter. Fit in really

48:03

well there because it already covers

48:05

various different indexing strategies

48:07

anyway. Uh and so vector indexes, you

48:09

know, it's just another indexing

48:11

strategy. We also added some coverage of

48:13

data frames, for example. That's not an

48:16

exclusively AI thing. Um but data frames

48:19

are quite a good data representation for

48:21

training data, for example. And that was

48:23

not one of the data models that we

48:25

discussed in the first edition, but we

48:26

decided to add to the second edition

48:28

because it has actually become a very

48:30

important data model that people are

48:32

using alongside all of the classic data

48:34

models like relational and graph and uh

48:36

JSON documents and so on. And so there

48:40

these these places where we've just

48:42

expanded the coverage a bit to to

48:45

reflect the kinds of systems people are

48:48

building for example to support AI

48:51

without it like changing the direction

48:53

of the book entirely. The final

48:55

subsection in this first edition the

48:57

first few I guess like sub parts were

49:00

titled doing the right thing and in the

49:02

second edition this has its own chapter.

49:04

The final chapter is doing the right

49:05

thing and I I quote a little bit from

49:07

it. We the engineers building these

49:09

systems have a responsibility to

49:10

carefully consider those consequences

49:12

and consciously decide what kind of

49:14

world we want to live in. Can we talk a

49:16

little bit about this section and the

49:18

importance of it?

49:19

>> Absolutely. Yeah. So the motivation for

49:21

putting in an ethics section there in in

49:24

the first edition was that I just felt

49:27

it had been quite ignored as a concern

49:31

during my time in industry. um that like

49:35

especially in startups people were very

49:37

focused on like building a product that

49:40

their customers would love and really

49:42

like deprioritizing these sort of

49:45

ethical questions in the in the process.

49:47

And so for example with the consumerf

49:49

facing products it might be that the

49:52

products are very much geared towards

49:54

essentially data harvesting collecting

49:56

behavioral data um because that's what

49:59

can be monetized in the form of

50:00

advertising and there seemed to be just

50:04

very little reflection on what was good

50:06

and bad about these sort of things. So I

50:08

really just wanted to encourage a bit of

50:11

thinking there. Um not really wanting to

50:14

prescribe too much like a a particular

50:17

approach there but at least to point out

50:20

you know there there is this thing such

50:21

as data protection legislation now which

50:24

we do have to think about in the

50:26

architecture of our data systems and

50:29

there is an ethical responsibility. You

50:31

know pe people say that uh you get into

50:34

tech in order to change the world. If

50:36

you want to change the world, then

50:38

thinking about the impact that your

50:40

technologies have on the world is part

50:42

of your job. It's it's a really

50:44

essential part really and something that

50:47

engineers are often prone to ignoring as

50:49

we focus just on the technology and less

50:52

on the effects that that technology will

50:54

have out in the real world. And so this

50:56

chapter is really just an attempt to get

50:58

people thinking about it a bit. And it's

51:01

sort of a a reflection of my own process

51:04

as well because as I started working on

51:06

these systems, I didn't really think

51:08

about ethical things particularly

51:10

either. So I felt like um I had to put

51:13

that section in there for myself as well

51:15

as for the readers because it was my own

51:18

way of of grappling with these questions

51:20

a bit. Is it fair to say that as

51:22

engineers building these systems that

51:24

will have an impact on on a wide range

51:27

of things potentially societal wide

51:29

impact we are just in such a good

51:31

position to directly influence and maybe

51:35

even change course. So do I understand

51:38

that this section is a bit of reminder

51:39

that by building it we have a huge

51:43

opportunity to shape these we probably

51:45

have a lot stronger voices maybe as

51:47

strong voices as later on the regulator

51:49

might have years down the road. Right.

51:51

>> Exactly. I think engineers have a very

51:53

strong voice there and like we talked

51:55

about earlier um engineers need to

51:57

articulate trade-offs in such a way that

52:00

uh business leaders can then make

52:01

educated decisions about how to address

52:03

those trade-offs. And part of those

52:05

trade-offs is pointing out risks. And

52:08

risks include not just technical risks

52:10

like the data might get corrupted, but

52:13

they include societal risks as well. For

52:15

example, like um what negative uh

52:19

effects, what harms might arise from

52:21

this technology, what sort of unintended

52:23

consequences possibly or what like uh

52:26

risk for reputational damage if it turns

52:28

out that a technology has some harmful

52:31

effects. um you know that can reflect

52:33

badly on the company that made it and

52:35

that has to be part of the the trade-off

52:37

discussion and I just want people to

52:39

make intentional and deliberate

52:41

decisions about this kind of things and

52:43

not just sweep it under the carpet. One

52:45

of the hot topics these days is of

52:48

course AI and you've written a very

52:50

interesting post about this just in

52:52

December about formal verification and

52:55

how your conviction that formal

52:57

verification might be more important

52:58

with AI. Can we talk for for those of

53:00

users who have heard formal

53:02

verification, can we talk about what

53:03

this is and how you envision this

53:06

becoming more important? Yeah. So

53:08

there's a whole range of formal methods.

53:10

Um, one approach is to for example use a

53:13

specification language uh like FSBY or

53:16

TA+ or something like that to describe

53:19

the expected behavior of a system at a

53:21

at a high level and then use a model

53:24

checker which is essentially like a

53:26

randomized test case generator to just

53:29

play through a lot of scenarios and see

53:31

whether the the system has those desired

53:33

behaviors in in all the different

53:35

scenarios. That's like the sort of intro

53:38

level formal verification. I would say

53:40

the more advanced level is to use actual

53:43

formal proof and in that case you can

53:46

write a specification of some system in

53:49

a formal language is usually using

53:51

mathematical notation and then make a

53:54

mathematical proof that a certain

53:55

algorithm or certain implementation

53:57

always satisfies that specification. And

54:00

the distinction to testing there is that

54:02

well in testing you just try through a

54:04

couple of examples, give the algorithm

54:06

some example inputs and check whether

54:08

you get the expected output in those

54:10

particular examples. But a proof can

54:12

reason about potentially infinite state

54:13

spaces. So it can tell you things about

54:16

like every possible thing that could

54:18

possibly happen in the entire universe

54:20

show that for example a certain safety

54:23

property is is always given in those

54:25

formal verification is is a lot of work.

54:27

Um, I never used it in my time in

54:30

industry because it's just too too

54:32

timeconuming basically. Um, I only got

54:34

into formal verification when I was in

54:36

academia and I could afford to take the

54:39

time to spend a few months proving an

54:41

algorithm correct. But there I've

54:43

started finding this very useful

54:44

especially if I was working on very

54:45

subtle algorithms where it's very hard

54:48

to tell just from reading the

54:50

implementation whether this actually is

54:52

always correct under all possible cases.

54:55

But if it's an important algorithm where

54:57

for example uh it will corrupt data if

54:59

there's a mistake in it or it will have

55:01

a security vulnerability if there's a

55:03

mistake in it then when it's high stakes

55:06

uh things like that then I feel it's

55:09

worthwhile to have uh formal

55:11

verification and to really make sure

55:14

that the the code really is correct and

55:17

so I've done some uh formal proofs using

55:19

the Isabel proof assistant for example

55:21

there are a couple of others as well uh

55:24

uh like rock and lean and uh so on.

55:27

These proofs are really hard to write.

55:29

It's it takes a long time to learn the

55:32

language of writing those proofs. And

55:33

then even once you know the language,

55:35

it's just really laborious in order to

55:37

actually write the individual proof

55:39

steps. And when you say it's hard to

55:41

write, just as someone I I know how to

55:43

code, you know, all so many different

55:44

different languages. Can you just

55:46

explain what what it means to hard to

55:48

write? Is is it does it feel like a a

55:50

strict programming language with all

55:52

sorts of rules or lots of math formulas?

55:55

What what makes it hard for for you to

55:56

to learn it and and get good at it?

55:59

Yeah. So, you're trying to make a proof

56:01

that a certain piece of code always

56:03

satisfies a certain property. In some

56:06

cases, that property might be quite easy

56:08

to to specify. Let's say as a really

56:11

simple example, you have two lists and

56:13

you want to concatenate them. And then

56:15

you want to prove that the length of the

56:17

concatenated list equals the sum of the

56:19

two individual lists. You know, very

56:21

very simple property. How would you

56:22

prove something like this? Well, you

56:24

would have a function that concatenates

56:26

two lists and then you would probably do

56:28

a proof by induction over one of the

56:30

lists uh that shows that okay, well, if

56:33

you have one list of length uh I and

56:37

another list of length zero, well then

56:39

the sum of the two is I. If you have a

56:41

list of length i appended with a list of

56:44

length one, well then it's i + one and

56:47

so on. And then by using a proof by

56:49

induction, you can then show that uh the

56:52

length of the concatenated list is i + j

56:54

where i and j are the lengths of two the

56:56

two input lists for every possible value

56:59

of i and j. And this is something that

57:03

uh you know in a test case you would in

57:05

tests you would maybe test it for the

57:06

cases of j equals 0, j equals 1 and j

57:09

equals 5. And then you're done. Nj

57:11

equals inter max. Yes, the edge case.

57:14

That's what we do. That that's how I

57:15

write my unit test. Exactly. And so this

57:18

is a trivial example like list

57:20

concatenation. You can easily just read

57:21

the code and convince yourself that it's

57:23

correct. But if it's a much more complex

57:24

algorithm, then you our brains just

57:27

can't like grock the algorithm well

57:29

enough to really convince ourselves that

57:31

it's correct if you don't prove it. And

57:33

that's where these these proofs then

57:35

become handy. If I'm I'm an engineer and

57:37

I would I would be interested in getting

57:39

started with formal verification for

57:41

example because I have the notion that

57:42

it will be more important with AI of

57:44

course it will be easier to to write

57:46

these things. Where would you point

57:48

engineers to to get started or how did

57:50

you get started in this field? I would

57:53

suggest starting with model checking. So

57:55

something like TA plus or FSB are much

57:58

friendlier to getting started with

58:00

compared to proof assistants like Isabel

58:02

Rock and Lean. that these proof

58:04

assistants just require a whole lot of

58:07

additional know knowledge and the

58:10

resources for learning about writing

58:12

these formal proofs are to be honest not

58:14

particularly good. I haven't really

58:16

found really great books on it as well.

58:18

The way I learned it was by working with

58:20

some colleagues uh in my lab who who had

58:23

learned it through years of prior

58:25

experience and I just sat down with them

58:28

and paired with them at a desk where

58:30

like I described the thing I was trying

58:32

to prove and they showed me how to prove

58:34

it step by step how to break it down.

58:35

I'm interested to see if if what if

58:38

you're thinking will be correct which is

58:40

this thing will go more mainstream and

58:42

hopefully we'll have better books and

58:43

resources for it as well.

58:44

>> Yes, I do hope so. So the the reason I

58:47

think that um the I believe that this

58:50

formal verification could become more

58:52

important in the future is kind of

58:54

several aspects to it. One is that the

58:57

LLMs are getting increasingly good at

58:59

writing these proofs and if we don't

59:00

have to write the proofs by hand as

59:02

humans, it just becomes feasible to do

59:04

them in situations where previously it

59:06

would have not been economical. But

59:08

also, LLM increase the need for these

59:11

formal proofs because, you know, we're

59:13

vibe coding a bunch of stuff. If we have

59:15

to manually review all of that code,

59:17

then that will become the bottleneck.

59:19

So, we can't really have humans

59:20

reviewing all of the generated code

59:22

either if we really want to get the the

59:24

benefits of of AI. So, we need some

59:26

automated way of checking whether the

59:28

code is correct. And writing lots of

59:30

tests is a very good starting point. But

59:33

the thing that proof can do that tests

59:35

can't is to consider absolutely every

59:37

possible thing that could happen. And

59:39

that's really important in a security

59:40

context for example where it just takes

59:42

one little bug want to create a

59:44

vulnerability that destroys the security

59:45

of the whole system. And so I feel for

59:48

those domains where like really we want

59:51

to ensure there's a complete absence of

59:54

bugs that's the kind of places where

59:56

formal verification can really shine.

59:58

And I'm hoping that LLMs will actually

60:01

make that a lot more accessible to to

60:04

people who would have previously not

60:05

considered using formal verification

60:07

because it was just too hard and too

60:08

expensive. You've worked in the industry

60:10

and then you went into academia. Can you

60:13

tell us what the difference is between

60:15

us? Myself and and most people watching

60:18

work in what you would call industry and

60:20

the tech industry or work at different

60:22

companies. We're bootstrapping our own

60:24

or we're just doing build building our

60:26

things. How does academia contrast to to

60:29

this? What what do you and your

60:31

colleagues do inside of academia? Yeah,

60:34

within academia, there are lots of

60:35

different styles really. There's not not

60:37

one thing. Um, some people go full-on

60:41

theoretical, mathematical, don't care

60:43

about the real world at all, just want

60:46

to work on things that are

60:48

intellectually interesting. And that's

60:49

fine. And some people uh are at the very

60:52

much at the applied end of wanting to do

60:55

research that is likely to have a real

60:57

world impact. I'm more on the applied

60:59

end. And that's fine too. But a common

61:03

distinction there is that academia can

61:05

just think much longer term. So the you

61:09

know if you're doing a startup you have

61:10

to ship something within a few months.

61:11

You can't afford to think 10 years into

61:13

the future. Well, maybe you'll have sort

61:15

of a a sort of long-term vision that

61:17

you're gradually getting towards, but uh

61:19

you do have to really ship things on a

61:21

fairly short time scale. At a bigger

61:23

company, maybe if you're working on

61:25

infrastructure or so, you can think on a

61:27

bit of a longer time scale because the

61:29

the requirements of what are needed is

61:32

are perhaps better understood. Um and in

61:35

that case, uh you know, they're making

61:37

sure that the system is like scalable,

61:39

operationally robust, and so on. it's

61:40

then fairly clear what the requirements

61:42

are and it's still a matter of

61:44

implementing it but in that case you can

61:46

think a bit longer term but in academia

61:49

what I really appreciate is the freedom

61:51

to work on things that are long-term and

61:55

which are not like immediately

61:57

commercially viable or which are not

61:59

aligned with the incentives of

62:01

commercial companies. Um so one of

62:03

research area that I've been on for

62:06

several years now is what we call local

62:08

first software which is this idea that

62:12

we want to take away a bit of the power

62:15

from cloud operators and give it back to

62:18

end users. So end users should be more

62:20

in control of their own data and less

62:23

dependent on cloud services for

62:26

providing the applications and the data

62:28

that that the users need. And that's

62:32

something that doesn't naturally come to

62:33

companies, right? Because uh software as

62:35

a service businesses, for example, the

62:38

whole reason why they can charge a

62:39

subscription is because they are able to

62:41

essentially hold a gun to the customer's

62:43

head and say, "Pay us your subscription,

62:45

otherwise we will delete all your data."

62:47

And I totally understand the the

62:49

commercial imperatives that lead to

62:50

that, but it also leads to this

62:52

situation where like the people have a

62:54

gun against their head all of the time.

62:56

That isn't really a healthy situation to

62:58

be in in my opinion. But changing that

63:00

in such a way to take away that gun from

63:03

customers heads is difficult if you're

63:05

in a business whose revenue depends on

63:08

perpetuating that kind of lock-in

63:10

situation. And there I feel like in

63:12

academia I have the freedom to work on

63:14

things that go against this commercial

63:16

incentive of companies and say like

63:18

actually no I'm going to do what I think

63:19

is right for the users and that I'm

63:22

going to say the commercial model of the

63:24

companies making the software is second

63:25

priority and I can afford to do that

63:28

because I'm I'm not dependent on this

63:30

commercial model.

63:31

>> To add to this, it's very interesting

63:33

and challenging engineering problems.

63:35

Right.

63:36

>> Yes. And it's wonderful to get to work

63:38

on interesting engineering and computer

63:41

science problems while at the same time

63:43

like trying to pursue this uh this

63:46

higher level vision for local first for

63:48

first software. What are some of these

63:51

really interesting engineering

63:52

challenges that we we will need to solve

63:55

or or we need to solve to get to a more

63:57

viable local first software? May that be

64:00

like let's say note-taking. It's a very

64:01

popular one, right?

64:02

>> Yeah. So with our vision of local first

64:05

software, we're trying to get away from

64:07

this dependency on centralized cloud

64:10

services. There may still be cloud

64:12

services involved in syncing data

64:15

between your phone and your laptop say

64:17

um because often going via cloud service

64:18

is just the most convenient way of

64:20

establishing that kind of communication.

64:22

But we just don't want to have to trust

64:25

on a cloud service providing a

64:27

particular function. Then if you can get

64:29

away from assuming this one cloud

64:31

service, you could for example have

64:32

multiple cloud services on multiple

64:34

cloud providers side by side and you

64:36

just sync by whichever happens to

64:37

respond first or sync with all of them

64:40

and then if one of them disappears, no

64:42

problem because you've got the other

64:43

one. And so it gives us a huge amount of

64:46

freedom and flexibility if we get away

64:48

from this assumption of centralized

64:50

cloud services. But that introduces a

64:53

whole bunch of interesting research and

64:55

engineering challenges because uh so one

64:58

thing that we've been working on lately

65:00

say is access control. You know simple

65:02

problem you have a document you want to

65:04

be able to grant collaborators access

65:06

and you want to be able to revoke that

65:07

access. Again totally obvious to should

65:09

be totally straightforward. In a

65:11

centralized cloud service model it is

65:13

totally straightforward because

65:14

>> you have the rules you you you confirm

65:16

that those sort of things and you check

65:17

for the right roles and that's it.

65:19

>> Yeah. But if you want to run your system

65:21

over multiple providers or even in a

65:23

peer-to-peer setting then well what

65:25

could happen is that uh a user gets

65:28

their edit permissions revoked and

65:30

concurrently that user makes an edit to

65:32

the document uh whose permissions have

65:34

just changed and now some devices may

65:37

see the edit to the document first and

65:39

the revocation second and so they would

65:41

accept the edit to the document and

65:43

another device may see it the other way

65:44

around. They may see the revocation

65:46

first and then the edit to the document

65:47

second and they'll drop the edit to the

65:49

document because they think it's not

65:50

authorized. And now those devices have

65:52

become inconsistent with each other

65:53

permanently inconsistent. So that means

65:56

if we actually want to ensure

65:58

consistency even for this fairly basic

66:00

setup we now have to somehow figure out

66:03

how to resolve this situation of an edit

66:06

that is concurrent with the revocation

66:08

of the user who made that edit. solving

66:11

that problem then mean in in a

66:13

decentralized setting where we don't

66:14

have just a single server that can make

66:16

that decision in a centralized setting

66:18

you know you just have one server it

66:20

decides did the edit to the document

66:22

come first or did the revocation come

66:24

first and that one decide server makes

66:26

that decision but if you have multiple

66:27

servers they might make different

66:29

decisions so then you could have a

66:31

consensus protocol but then consensus is

66:34

messy because it requires like some

66:36

quorum votes and requires nodes to be

66:38

online um and so we've been trying to do

66:40

the whole thing without doing consensus.

66:42

But but while um so while preserving

66:45

high availability, while preserving the

66:47

ability for user to work offline,

66:48

preserving the ability to uh synchronize

66:51

peer-to-peer without any servers, for

66:53

example, that just makes the engineering

66:55

challenge a lot harder and it's solvable

66:57

and we are close to solving it uh for

67:00

automerge, which is the the CLDT library

67:02

that that I work on. Um, but it's uh

67:06

it's just much less straightforward than

67:08

it is in the in the centralized case.

67:10

But that's a nice example of where

67:12

interesting engineering challenges arise

67:14

from this desire to get away from

67:16

centralized services. And then we were

67:18

just talking about clocks earlier. But

67:20

an obvious thing that came to mind is

67:22

well if if all of them had the same

67:23

clock exactly to the microscond, you

67:25

could just use a clock, you could use a

67:27

time stamp, but as you said in

67:28

distributed systems, we cannot always

67:30

trust the the clocks are always

67:32

synchronized. So I I assume like you

67:34

just have these a lot of the things that

67:36

you have been researching and writing

67:38

about are just coming back to

67:40

>> Absolutely. And in this particular

67:42

setting of like a user getting their

67:44

edit permissions revoked if a revoked

67:46

user still wants to say vandalize a

67:48

document they can just backdate their

67:51

edit give it an earlier time stamp. So

67:53

relying on clocks is absolutely useless

67:55

here because people can forge the time

67:57

stamps from those clocks and thereby

67:59

then potentially undermine the access

68:01

control mechanism. So in this kind of

68:03

system, we have to worry about

68:05

potentially maliciously uh generated uh

68:08

actions as well when the actions come

68:10

from end user devices. This is

68:11

fascinating because it feels to me that

68:13

you're solving a hard or maybe even

68:16

harder engineering challenge than some

68:19

startups would do because the startups

68:20

would go the easy route. They would take

68:22

on a constraint in this case a

68:23

centralized server which makes business

68:25

sense, makes revenue sense. But because

68:27

you are not doing this, you now need to

68:30

look for a solution for a harder

68:32

problem. And if you solve this harder

68:34

problem, you can give a building block

68:36

that can just move the industry forward.

68:38

Just give a an option for either a

68:40

business or an individual or an

68:42

institution to you know like have an

68:44

option not just to use centralized but

68:46

use this decentralized

68:48

local first approach and then of course

68:50

reason about the trade-off and decide

68:51

whichever makes sense.

68:53

>> Exactly. And that's what I mean with

68:54

this long-term thinking. This is an

68:55

example of it where because it's

68:57

research we can afford to take this

69:00

idealistic principled stance. I said yes

69:03

we're going to solve this harder

69:04

engineering problem because we think

69:06

decentralization is a valuable feature

69:09

and we know perfectly well that most

69:11

startups are not going to solve this

69:12

problem because they will just do the

69:14

easy pragmatic thing which is the right

69:15

thing for startups to do. Um, but we

69:18

have a different set of incentives and

69:20

we can afford to put in the time to try

69:22

and solve those hard problems. And as

69:24

you said, if we can solve them, then it

69:26

creates more optionality for anyone, any

69:28

users of this technology, they can if

69:30

they want to choose to use this

69:32

decentralized tech. And there's still

69:34

trade-offs around it, but at least if

69:36

they're not having to invent it from

69:37

scratch, it'll be a lot easier to adopt

69:39

this kind of uh decentralized tech for

69:42

for those who want to use it.

69:44

So in inside academia you're also

69:46

teaching. Uh what courses do you teach?

69:49

At the moment I have a concurrent and

69:51

distributed systems course for the

69:53

undergraduates and a cryptographic

69:55

protocol engineering course for the

69:57

master students. And then additionally

69:59

this year I have a uh a seminar course

70:02

on security and a uh and teaching also

70:06

the undergraduate operating systems

70:08

course. I've got quite a lot of teaching

70:10

this year. the distributed systems

70:12

course, it's available on on YouTube.

70:15

Can you summarize what people who would

70:17

go through this course which again is

70:19

freely available? Thank you for you and

70:21

the university for making it available.

70:22

What what what would they learn

70:23

throughout those courses? Yes. So that

70:25

distributed systems course, it's a bit

70:27

more theoretical than what is in the

70:29

book. So it's more focused on algorithms

70:32

and sort of the how we convince

70:33

ourselves that the algorithms behave

70:35

correctly under the assumptions of

70:38

distributed systems that we talked about

70:40

of like nodes may crash, communication

70:42

might be unreliable, uh clocks might be

70:45

wrong, etc. So that's really it. It's

70:47

it's not a very long course. It's just

70:49

uh eight lectures worth of of material.

70:52

But it's uh it goes into substantially

70:56

more detail on the algorithms than the

70:58

book. So for example, one of the

70:59

lectures goes through the entire raft

71:01

consensus algorithm which is pretty

71:03

complex. Um but I really wanted to show

71:05

the students exactly how it works

71:08

because it's just such a nice

71:10

illustration of the challenges of

71:12

distributed systems and the various

71:15

measures we need to take in order to

71:17

handle the various types of edge cases

71:19

and failures um that can happen and

71:22

showing that those those problems can be

71:24

overcome. It's not easy and the

71:26

algorithms are very subtle and it's very

71:27

easy to have bugs in them but it is

71:29

possible to solve consensus in a in a

71:32

way that works pretty well and uh and so

71:35

that's really this the sort of message

71:36

I'm trying to uh get across with this

71:38

course and you mentioned that when

71:40

you're when you're writing the book

71:42

together with Chris you brought a lot

71:43

industry insight and being up to date

71:45

and you brought your experience of of

71:48

teaching and and what works I don't

71:50

think I have a particularly like unique

71:52

teaching style just uh in lectures I

71:55

will go through slides. I I like to

71:57

annotate the slides by hand uh during

71:59

the lectures. I've just draw draw on an

72:01

iPad to make it a little bit more

72:02

interactive. But um other than that, it

72:05

it is fairly theoretical. That's partly

72:08

the way the Cambridge system works. It

72:11

kind of favors theoretical and pen and

72:13

paper courses over say implementation

72:16

practical courses. I think it it would

72:19

be possible certainly to do a practical

72:21

course on this and I may incorporate a

72:23

bit more practical exercise in the

72:24

future but right now it's mostly a

72:26

theoretical pen and paper course when

72:28

that is fine. Uh the cryptography course

72:30

that I do is that's much more uh

72:32

hands-on. So that's about actually

72:34

getting the students to like implement

72:36

some elliptic curves from scratch for

72:37

example. And how have you seen it in

72:39

your time in in academia which has been

72:42

it's now a longer time period. How have

72:44

you seen computer science education

72:46

changing? How do you think it might

72:47

change further in in the future

72:49

especially as we're seeing AI u be part

72:52

of industry and probably the world as

72:55

well? Yeah, I mean prior to AI explosion

72:59

happening actually rate of change is

73:01

very slow in in computer science

73:04

teaching. Partly that might be

73:06

Cambridge, you know, Cambridge is over

73:07

800 years old like everyone thinks on

73:10

longer time scales. People don't tend to

73:12

rush into the latest fad and instead try

73:15

to focus on the fundamentals and the

73:17

ideas that a lot of the fundamentals of

73:19

computer science were developed in the

73:21

1930s already and are still true today.

73:24

and you know lambda calculus and those

73:26

types of things for example and so we

73:29

have quite a bit of a focus on those

73:30

sort of fundamentals rather than chasing

73:33

the latest uh fashionable thing. That

73:36

said, AI has totally changed the way we

73:39

can assess coursework, for example,

73:42

because of course now we we can try

73:45

banning AI, but it's impossible to

73:47

actually enforce such a ban. And also,

73:49

it's kind of counterproductive because

73:51

we do want students to engage with new

73:54

technologies and figure out how to use

73:56

them productively for themselves. But we

73:58

want to somehow do that in a way that

74:01

supports their own learning and doesn't

74:02

undermine it. So, how do we get the

74:05

students to use AI in in a responsible

74:09

way, in a way that's mature? And we

74:12

can't necessarily rely on the students

74:14

being mature enough to know for

74:16

themselves what is a helpful use of AI

74:19

and what is a form of use of AI that

74:22

undermines their own learning because

74:24

some of them are quite mature and able

74:26

to decide that for themselves, but many

74:28

are not and so we need to provide some

74:30

guardrails for them. Um and we do need

74:33

to make sure that when we have assessed

74:34

work for example it's fair and it's

74:36

perceived as fair by the students and if

74:39

the students feel that some of their uh

74:42

co- students are getting really good

74:44

marks without doing any work that

74:46

undermines the trust in the entire

74:48

system and so we have to be very careful

74:50

with how we approach this and to be

74:53

honest we don't really have good answers

74:55

yet. So we do uh now for example have a

74:58

boot camp right at the start of the

74:59

first year for the new students to

75:01

expose them to basic software

75:03

engineering skills which is like this is

75:05

version control, this is unit testing,

75:07

this is generative AI and the sort of

75:10

basics that really everyone should be

75:12

familiar with and then the hope is that

75:13

they will use that throughout their

75:15

degree in order to just improve the work

75:18

that they do. But how exactly we handle

75:20

things for assessment for example we're

75:22

we're still in the process of figuring

75:23

out. So it it sounds like the the the

75:26

pace of of change is going to be fast in

75:29

the industry and also in academia we'll

75:31

probably adopt it and we'll see you know

75:33

like what what comes after. Yes. There's

75:35

a difference though which is in the

75:37

desired outcomes. I think with industry

75:39

generally the desired outcome is like a

75:41

working product for example. In academia

75:44

the actual artifacts that the students

75:47

produce like an essay that the students

75:49

write that's not really the point. We

75:51

don't ask the students to write essays

75:52

because we love reading their amazing

75:54

essays. We ask them to write essays

75:55

because we want them to go through a

75:57

thought process which helps them learn

75:58

something. And it's that thought process

76:00

and that learning which is really the

76:02

the desired outcome here. And so that

76:05

means that we do have to approach it a

76:07

little differently because in generally

76:09

in in industry, you know, if you can use

76:11

AI to get a job done faster and you get

76:13

to the an equivalent result, do it

76:15

absolutely because yes, that that is the

76:18

desired outcome. uh whereas in education

76:22

we do have to think about how we ensure

76:23

that the the learning outcomes and the

76:25

thought processes are still preserved

76:27

such that the the students benefit

76:30

intellectually. It's very relevant

76:32

especially entropic had a recent study

76:34

where they looked at junior engineers

76:36

they one of them used one group used AI

76:38

the other one did not and they found

76:41

unsurprisingly from what what you also

76:44

explained that the group who used AI

76:46

they had little to no learning whereas

76:48

the group that did not they actually

76:50

learned it. Yes, I saw that study as

76:52

well. I think the meth detailed methods

76:55

of that study we might be able to

76:56

quibble with a bit but I think the the

76:59

general principle seems true that yes so

77:02

sometimes in order to learn something

77:03

you just have to struggle with it a bit

77:05

not struggle too much so if people are

77:07

stuck on some technicality and they can

77:10

use AI to get unblocked and then be able

77:12

to focus really on the the main learning

77:14

outcome then I think uh it's good to use

77:17

these types of tools but if if the point

77:20

is to actually like grapple with some

77:21

difficult ideas and think them through

77:24

their own minds, then we need to still

77:25

find ways to make sure the students are

77:28

doing that.

77:28

>> You work both in industry and academia.

77:30

What what do you think industry could

77:32

learn from academia and academia can

77:34

learn from industry? The two really

77:36

could be closer together because often

77:38

they regard each other with uh sort of

77:40

disrespect really like the the industry

77:43

people will say, "Ah, that's

77:44

theoretical, that's academic, it's got

77:47

nothing to do with the real world." and

77:48

they're really missing a trick there

77:49

because actually there's a lot of

77:51

interesting insights from research that

77:53

are very relevant to the real world. Um

77:55

but they're not necessarily making their

77:57

way across that chasm. In the other

77:59

direction, the academics will say, "Oh,

78:01

this industry stuff, you know, that's

78:02

just engineering." They're not actually

78:03

doing any interesting thinking. It's

78:06

just like writing routine stuff. I think

78:08

I see it as one of my goals to try and

78:10

build better respect across both in both

78:13

directions by bringing interesting

78:16

insights from research into industrial

78:18

practice but also by informing our

78:21

research uh by the problems that uh

78:24

arise in in real world and so that way

78:28

like joining those two things up a bit

78:30

better. What are your current research

78:32

topics that you're working on ones that

78:34

you're excited about? I have two main

78:36

areas I'm working on at the moment. Uh

78:38

one is local first software. So that's

78:41

this idea that we want collaborative

78:43

software like Google Docs, like Figma,

78:46

etc., but in a way that uh gives better

78:50

protection to users data that's less

78:52

dependent on a single cloud provider who

78:54

can lock you out of your files and

78:56

that's therefore more resilient. Uh

78:59

gives users greater agency and greater

79:01

autonomy over their own data. U so

79:04

that's an area that I've been working on

79:05

for the last 10 years or so through a

79:08

mixture of open source work and

79:10

algorithm development and formal

79:12

verification and so on. I'm now also

79:14

trying to set up a brand new research

79:17

area in a totally different topic um

79:20

which is on using cryptography to prove

79:22

things about the physical world. So I'm

79:24

interested there in especially

79:26

sustainability related things. So for

79:28

example, if you want to verify that the

79:30

carbon emissions involved in

79:32

manufacturing a particular product were

79:34

X and you want to be sure that that

79:36

number is correct because maybe you want

79:38

to include emissions as part of your

79:40

purchasing decision and choose the

79:42

product with the lower emissions. For

79:43

that to be meaningful, the emissions

79:45

number has to be correct. And

79:47

unfortunately at the moment the numbers

79:49

are generally not correct because the

79:50

incentives are to lie and cheat and to

79:53

use creative accounting techniques all

79:55

as a way of like greenwashing basically

79:58

or a related thing is happening in the

80:01

EU for example which is bringing in new

80:03

regulations on preventing deforestation

80:06

of tropical rainforests. So that's for

80:08

example coffee, cocoa, palm oil etc

80:11

imported into the EU. the importer needs

80:13

to prove exactly which plot of land it

80:15

actually came from and then check

80:16

against satellite imagery that that was

80:18

not recently deforested. And so I've

80:20

been looking into using cryptography as

80:23

a tool of proving things about the

80:26

supply chains of these physical products

80:28

but without revealing commercially

80:29

sensitive information. For example, a a

80:31

company will not want to reveal who its

80:33

suppliers were and which ingredient to

80:36

its process it purchased from which

80:38

supplier, for example, because that

80:40

might reveal something about its secret

80:41

recipe that it uses. And so the hope

80:45

here is that cryptography can allow us

80:48

to prove that for example the the

80:49

accounting has been done correctly

80:51

across supply chains but without having

80:53

to reveal publicly any of this sensitive

80:56

data about suppliers or other customers.

80:59

What is your view from your vantage

81:02

point on the impact that AI is having on

81:04

academia not not just for for students

81:07

studying beyond that and also industry

81:09

with your industry contacts? Yeah, I

81:12

mean I'm not not that deeply into um the

81:15

AI things really. I'm seeing it more

81:17

through my collaborators who are making

81:20

very good use of of AI tools uh for for

81:24

software development especially. I

81:26

personally write very little code these

81:28

days and so I haven't had that much need

81:31

or occasion to actually use AI agents

81:33

myself personally. When when writing

81:36

pros like working on the book for

81:37

example, I prefer to still do that the

81:39

oldfashioned way of just write every

81:40

word by hand. So I I haven't let AI

81:43

anywhere near the text of the book for

81:44

example. And I don't know if that's

81:46

that's the right decision. It's not

81:48

really a a principle thing that I I

81:51

think it would be wrong to do so. It's

81:52

more that for myself the process of

81:54

writing is the way how I figure things

81:56

out and figuring things out is really my

81:59

goal here. So I'm I'm trying to figure

82:01

it out in my own head and for that I

82:03

just have to write it myself. Does there

82:05

doesn't seem to be any way around it.

82:06

But using AI as a way of like getting

82:08

feedback on ideas or exploring like

82:12

whether an idea really holds up to

82:13

scrutiny or things like that seems like

82:16

a very productive use of the technology

82:18

and that applies for for both industry

82:20

and academia I would say. So as as

82:22

closing for a student or a young

82:25

professional who is is still studying

82:28

and considering the route into either

82:29

industry or academia, what have you seen

82:32

uh who thrives in one or the other?

82:36

>> Yeah, my feeling is they're not really

82:38

that mutually exclusive or rather some

82:40

of the best PhD students uh I've worked

82:43

with for example actually have a few

82:45

years of industry experience. So they

82:47

might have done an undergraduate maybe

82:49

done a masters then spent a few years in

82:51

industry developing like actual doing

82:54

real software engineering learning about

82:55

the real world uh and then maybe at some

82:58

point got bored and thought oh actually

82:59

you know I want to work on maybe more

83:02

idealistic things or have more freedom

83:04

to choose uh their own research topics

83:06

and then start getting interested in

83:08

doing a PhD and that I find is is quite

83:11

a healthy route. You do get people who

83:13

go, you know, straight from their

83:15

undergraduate degree and masters into

83:17

doing a PhD, but sometimes those people

83:20

can just lack a bit of the breadth of

83:22

perspective. And so I think having seen

83:24

a bit of just real world engineering is

83:27

is actually really helpful for people

83:29

even if they then want to stay in

83:31

research. But in the opposite direction,

83:33

I think it can work very well too

83:34

because in in research research in

83:37

academia, we just get to think things

83:39

through a lot more carefully than people

83:42

often do in industry. Often people in

83:44

industry, I feel like sort of have short

83:46

circuit reasoning, like don't maybe

83:48

don't quite reason something through

83:50

from first principles, but just like uh

83:52

oh, I heard this from a conference talk.

83:54

I'm just going to go with that. And oh

83:56

yeah, what what academia can teach is

83:58

this sort of uh nuanced and and critical

84:02

thinking um to really reason through

84:05

trade-offs, for example, and to really

84:08

like justify why something is true. And

84:11

so I think it's really good actually if

84:13

people can weave in and out of industry

84:16

and academia a bit and not regard it as

84:18

like two totally mutually exclusive

84:19

career paths, but actually have a bit of

84:22

switching between the two.

84:23

>> Well, Martin, thank you very much. Uh I

84:25

expected us to talk a lot more about

84:26

your book which we did but I I have a

84:29

newfound curiosity and and respect for

84:31

all the important and interesting

84:33

academic work that you and everyone else

84:35

is doing. So thank you so much for this.

84:37

Thank you for the great interview. This

84:38

was really interesting.

84:39

>> I hope you enjoyed this rare

84:40

conversation with Martin Clubman. I

84:43

found it interesting to learn that the

84:44

first edition of the book assumed that

84:45

you have machines with local discs. But

84:48

actually today this is not how most

84:50

engineers build systems anymore.

84:51

cloudnative primitives like S3 change

84:53

how you build systems and this is why

84:55

this book just needed a refresh. I also

84:57

appreciated Martin's take on whether

84:59

engineers still need to undertest system

85:01

internals when they're using managed

85:03

services. If you're building business

85:04

logic on top of these services, you

85:06

probably don't need to know every

85:08

detail, but it can become useful to be

85:11

able to look deeper, especially when you

85:13

need to debug your system. By the end of

85:15

our conversation, I gained a lot of

85:17

appreciation for the academic research

85:18

that Martin is doing. the local first

85:20

software work, the access control

85:22

problem in decentralized systems, using

85:24

cryptography to verify supply chain

85:26

emissions. A lot of these are hard

85:28

engineuring problems that few startups

85:30

would take on. It was nice to understand

85:31

how academia is in a good position to do

85:33

work that has a long-term focus. Do

85:36

check out the show notes below for

85:37

related to primatic engineer deep dives.

85:39

If you've enjoyed this podcast, please

85:40

do subscribe on your favorite podcast

85:42

platform and on YouTube. A special thank

85:44

you if you also leave a rating on the

85:45

show. Thanks and see you in the next

85:48

one.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This episode features Martin Kleppmann, author of 'Designing Data-Intensive Applications', discussing his journey from startup founder and LinkedIn engineer to academia. The conversation covers the evolution of his generational book, the necessity of the second edition due to shifts toward cloud-native architectures, and his current research in academia, specifically focusing on local-first software and the use of cryptography for physical-world verification.