HomeVideos

The Infinite Software Crisis – Jake Nations, Netflix

Now Playing

The Infinite Software Crisis – Jake Nations, Netflix

Transcript

557 segments

0:13

[music]

0:20

Hey everyone, good afternoon. Um, I'm

0:23

going to start my talk with a bit of a

0:24

confession. Uh, I've shipped code I

0:28

didn't quite understand. Generated it,

0:30

tested it, deployed it. Couldn't explain

0:32

how it worked. And here's the thing,

0:34

though. I'm willing to bet every one of

0:36

you have, too. [applause]

0:40

So, now I'm going to admit that we all

0:41

ship code that we don't understand

0:42

anymore. I want to take a bit of a

0:44

journey, see how this kind of has come

0:45

to be. First, look back in history. We

0:48

see that history tends to repeat itself.

0:50

Second, we've fallen into a bit of a

0:52

trap. We've confused easy with simple.

0:55

Lastly, there is a fix, but it requires

0:57

us not to outsource our thinking.

1:01

So, I spent the last few years at

1:03

Netflix helping drive adoption of AI

1:04

tools, and I have to say the

1:05

acceleration is absolutely real. Backlog

1:08

items that used to take days now take

1:09

hours, and large refactors that have

1:12

been on the books for years are finally

1:13

being done. Here's the thing, though.

1:16

Large production systems always fail in

1:18

unexpected ways. Like, look what

1:19

happened with CloudFare recently. When

1:21

they do, you better understand the code

1:23

you're debugging. And the problem is now

1:25

we're generating code at such speed and

1:26

such volume our understanding is having

1:28

a hard time keeping up.

1:32

Hell, I know I've done it myself. I've

1:34

generated a bunch of code, looked at it,

1:36

thought, I have no idea how this what

1:38

this does. But, you know, the test pass,

1:40

it works. So, I shipped it. The thing

1:43

here is this isn't really new. Every

1:44

generation of software engineers has

1:46

eventually hit a wall where software

1:48

complexity has exceeded their ability to

1:49

manage it. We're not the fa first to

1:51

face a software crisis. were the first

1:53

to face it at this infinite scale of

1:54

generation. So let's take a step back to

1:57

see where this all started.

1:59

In the late 60s, early '7s, a bunch of

2:02

smart computer scientists at the time

2:03

came together and said, "Hey, we're in a

2:05

software crisis. We have this huge

2:08

demand for software and yet we're not

2:10

really able to keep up and like projects

2:12

are taking too long and it's just really

2:14

slow. We're not doing a good job."

2:16

So Dystra Kano came up with a really

2:19

great quote and he said when we had a

2:21

few weak computers and I mean to

2:22

paraphrase a longer quote when we had a

2:23

few weak computers programming was a

2:26

mild problem and now we have gigantic

2:27

computers programming has become a

2:29

gigantic problem. He was explaining as

2:31

hardware power grew by a factor of a

2:33

thousand society's wants of software

2:35

grew in proportion and so it left us the

2:38

programmers to figure out between the

2:40

ways and the means how do we support

2:42

this much more software.

2:44

So this kind of keeps happening in a

2:46

cycle. In the 70s we get the C

2:48

programming language so we could write

2:49

bigger systems. The 80s we have personal

2:51

computers. Now everyone can write

2:52

software. In the '9s we get

2:54

object-oriented programming inheritance

2:57

hierarchies from hell where you know

2:58

thanks Java for that. In the 2000s we

3:01

get agile and we sprints and scrum

3:03

masters telling us what to do. There's

3:05

no more waterfall. In the 2010s we had

3:07

cloud mobile devops you know everything.

3:09

Software truly ate the world.

3:12

And today now we have AI. you know,

3:14

co-pilot, cursor, claude, codeex,

3:16

gemini, you name it. We could generate

3:17

code as fast as we can describe it. The

3:20

pattern continues, but the stale has

3:21

really changed. It's it's infinite now.

3:25

So, uh, Fred Brooks, you might know him

3:27

from writing the mythical man month. He

3:29

also wrote a paper in 1986 called No

3:31

Silver Bullet. And in this, he argued

3:33

that there'd be no single innovation

3:35

that would give us an order of magnitude

3:37

improvement in software productivity.

3:39

Why? Because he said the hard part

3:41

wasn't ever the mechanics of coding. the

3:44

syntax, the typing, the boilerplate. It

3:46

was about understanding the actual

3:47

problem and designing the solution. And

3:49

no tool can eliminate that fundamental

3:51

difficulty. Every tool and technique

3:53

we've created up to this point makes the

3:54

mechanics easier. The core challenge

3:56

though, understanding what to build, how

3:58

it should work remains just as hard.

4:03

So, if the problem isn't in the

4:04

mechanics, why do we keep optimizing for

4:06

it? How do experienced engineers end up

4:07

with code they don't understand? Now,

4:09

the answer, I think, comes down to two

4:11

words we tend to confuse. simple and

4:13

easy. We tend to use them

4:15

interchangeably, but they really mean

4:16

completely different things. Uh I was

4:19

outed at the speaker dinner as being a

4:20

closure guy, so this is kind of clear

4:22

here. But Rich Hickey, the creator of

4:24

the closure programming language,

4:25

explained this in his talk from 2011

4:27

called simple made easy. He defined

4:30

simple meaning one fold, one braid, and

4:32

no entanglement. Each piece does one

4:34

thing and doesn't intertwine with

4:35

others. He defines easy as meaning

4:38

adjacent. What's within reach? What can

4:39

you access without effort? Copy paste

4:42

ship. Simple is about structure. Easy is

4:45

about proximity.

4:48

The thing is we can't make something

4:49

simple by wishing it. So simplicity

4:51

requires thought, design and untangling.

4:54

But we can always make something easier.

4:56

You just put it closer. Install a

4:58

package, generate it with AI, you know,

5:00

copy a solution off of Stack Overflow.

5:03

It's it's human nature to take the easy

5:05

path. We're wired for it. You know, as I

5:08

said, copy something from Stack

5:09

Overflow. It's right there. framework

5:10

that handles everything for you with

5:12

magic. Install and go. But easy doesn't

5:15

mean simple. Easy means you can add to

5:16

your system quickly. Simple means you

5:18

can understand the work that you've

5:20

done. Every time we choose easy, we're

5:22

choosing speed now. Complexity later.

5:24

And honestly,

5:25

that trade-off really used to work. The

5:28

complexity accumulated in our codebases

5:30

slowly enough that we can refactor,

5:32

rethink, and rebuild when needed. I

5:34

think AI has destroyed that balance

5:36

because it's the ultimate easy bun. And

5:37

it makes the easy path so frictionless

5:39

that we don't even consider the simple

5:40

one anymore. Why think about

5:42

architecture when code appears

5:43

instantly.

5:46

So let me show you how this happens. How

5:47

a simple task evolves into a mess of

5:49

complexity through a conversational

5:50

interface that we've all come to love.

5:53

You know this is a contrived example but

5:54

you know say we have our app. We want to

5:56

add uh some authentication to it. We say

5:58

add o. So we get a nice clean o.js file.

6:01

Iterate on a few times it gets a message

6:02

file. You're like okay cool. We're going

6:03

to add OOTH now too because and now

6:06

we've got an OJS and OOTHJS. We keep

6:08

iterating and then we find ourselves

6:09

that sessions are broken and we got a

6:11

bunch of conflicts and by the time you

6:12

get to turn 20, you're not really having

6:14

a discussion anymore. You're managing

6:15

context that become so complex that even

6:18

you don't remember all the constraints

6:19

that you've added to it. Dead code from

6:21

abandoned approaches. Uh tests that got

6:23

fixed by just making them work. You

6:25

know, fragments of three different

6:26

solutions because you have saying wait

6:28

actually each new instruction is

6:30

overwriting architectural patterns. We

6:32

said make the off work here. It did.

6:33

When we said fix this error, it did.

6:35

There's no resistance to bad

6:37

architectural decisions. The code just

6:38

morphs to satisfy your latest request.

6:40

Each interaction is choosing easy over

6:42

simple. And easy always means more

6:45

complexity. We know better. But when the

6:48

easy path is just this easy, we take it.

6:50

And complexity is going to compound

6:51

until it's too late.

6:55

AI really takes easy to its logical

6:57

extreme. Decide what you want. Get code

7:00

instantly. But here's the danger in

7:02

that. The generated code treats every

7:04

pattern in your codebase the same. You

7:07

know, when an agent analyzed your

7:08

codebase, every line becomes a pattern

7:10

to preserve. The authentication check on

7:11

line 47, that's a pattern. That weird

7:14

gRPC code that's acting like GraphQL

7:16

that I may have had in 2019, that's also

7:18

a pattern. Technical debt doesn't

7:20

register as debt. It's just more code.

7:22

The real problem here is complexity. I

7:25

know I've been saying that word a bunch

7:27

in this talk without really defining it,

7:29

but the best way to think about it is

7:30

it's the opposite of simplicity. It just

7:32

means intertwined. And when things are

7:33

complex, everything touches everything

7:35

else. You can't change one thing without

7:37

affecting 10 others.

7:41

So, back to Fred Brooks's no bullet

7:43

paper. In it, he identified that there's

7:45

two main types of complexity in every

7:46

system. There's the essential

7:48

complexity, which is really the

7:50

fundamental difficulty of the actual

7:52

problem you're trying to solve. Users

7:53

need to pay for things, orders must be

7:55

fulfilled. This is the complexity of why

7:57

your software system exists in the first

7:58

place. And then second, there's this

8:01

idea of accidental complexity.

8:03

Everything else we've added along the

8:04

way, workarounds, defensive code,

8:06

frameworks, abstractions that made sense

8:08

a while ago, it's all the stuff that we

8:10

put together to make the code itself

8:11

work.

8:13

In a real codebase, these two types of

8:14

complexity are everywhere and they get

8:17

so tangled together that separating them

8:18

requires context, history, and

8:19

experience.

8:21

the generated output makes no such

8:23

distinction and so every pattern is

8:25

keeps just getting preserved.

8:29

So here's a real example from uh some

8:31

work we're doing at Netflix. I have a

8:32

system that has a abstraction layer

8:34

sitting between our old authorization

8:36

code we wrote say five or so years ago

8:38

and a new centralized o system. We

8:41

didn't have time to rebuild our whole

8:42

app. So we just kind of put a shim in

8:44

between. So now we have AI. This is a

8:46

great opportunity to refactor our code

8:47

to use the new system directly. Seems

8:49

like a simple request, right?

8:52

And no, it's like the old code was just

8:54

so tightly coupled to its authorization

8:56

patterns. Like we had permission checks

8:58

woven through business logic, ro

8:59

assumptions baked into data models and

9:01

off calls scattered across hundreds of

9:03

files. The agent would start

9:05

refactoring, get a few files in and hit

9:07

a dependency couldn't untangle and just

9:09

spiral out of control and give up or

9:11

worse it would try and preserve some

9:13

existing logic that from the old system

9:16

and recreating it using the new system

9:17

which I think is not great too.

9:21

The thing is it couldn't see the scenes.

9:23

It couldn't identify where the business

9:24

logic ended and the off logic began.

9:26

Everything was so tangled together that

9:28

even with perfect information, the AI

9:31

couldn't find a clean path through. When

9:33

your accidental complexity gets this

9:34

tangled, AI is not the best help to

9:37

actually make it any better. I found it

9:39

only adds more layers on top.

9:42

We can tell the difference, or at least

9:43

we can when we slow down enough to

9:45

think. We know which patterns are

9:47

essential and which are just how someone

9:48

solved it a few years ago. We carry the

9:51

context that the AI can infer, but only

9:53

if we time to make take time to make

9:54

these distinctions before we start.

9:59

So how do you actually do it? How do you

10:02

separate the accidental and essential

10:04

complexity when you're staring at a huge

10:05

codebase? Codebase I work on Netflix has

10:08

around a million lines of Java and the

10:10

main service in it is about 5 million

10:11

tokens last time I checked. no context

10:14

window I have access to uh can hold it.

10:17

So when I wanted to work with it, I

10:18

first thought, hey, maybe I could just

10:19

copy large swaths of this codebase into

10:21

the into the context and see if the

10:23

patterns were emerged, see if it would

10:24

just be able to figure out what's

10:25

happening. And just like the

10:27

authorization refactor from previously,

10:28

[clears throat] the output just got lost

10:30

in its own complexity. So with this, I

10:32

was forced to do something different. I

10:34

had to select what to include. Design

10:36

docs, architecture, diagrams, key

10:37

interfaces, you name it, and take time

10:39

writing out the requirements of how

10:40

components should interact and what

10:42

patterns to follow.

10:44

See, I was writing a spec. Uh 5 million

10:47

tokens became 2,000 words of

10:48

specification. And then to take it even

10:50

further, take that spec and create an

10:52

exact step set of steps of code to

10:54

execute. No vague instructions, just a

10:56

precise sequence of operations. I found

10:58

this produced much cleaner and more

11:00

focused code that I could understand. As

11:02

I defined it first and planned its own

11:04

execution,

11:08

this became the approach which I called

11:10

context compression a while ago. But you

11:11

call it context engineering or

11:12

spectriven development, whatever you

11:14

want. The name doesn't matter. What only

11:17

matters here is that thinking and

11:18

planning become a majority of the work.

11:20

So let me walk you through that how this

11:22

works in practice.

11:24

So we have step one, phase one,

11:25

research. You know, I go and feed

11:27

everything to it up front. Architecture

11:29

diagrams, documentation, Slack threads.

11:31

I been over this a bunch, but really

11:33

just bring as much context as you can

11:34

that's going to be relevant to the

11:35

changes you're making. And then use the

11:38

agent to analyze the codebase and map

11:39

out the components and dependencies.

11:42

This shouldn't be a oneshot process. I

11:43

like to probe say like what about the

11:45

caching? How does this handle failures?

11:47

And when it's analysis is wrong, I'll

11:48

correct it. And if it's missing context,

11:50

I provide it. Each iteration refineses

11:53

its analysis.

11:55

The output here is a single research

11:56

document. Here's what exists. Here's

11:58

what connects to what. And here's what

12:00

your change will affect. Hours of

12:01

exploration are compressed into minutes

12:03

of reading.

12:05

[snorts] I know Dex mentioned it this

12:07

morning, but the human checkpoint here

12:08

is critical. This is where you validate

12:10

the analysis against reality. The

12:12

highest leverage moment in the entire

12:14

process. Catch errors here. Prevent

12:16

disasters later.

12:19

Onto phase two. Now that you have some

12:21

valid research in hand, we create a

12:22

detailed imple implementation plan. Real

12:24

code structure, function signatures,

12:26

type definitions, data flow. You want

12:28

this to be so any developer can follow

12:30

it. I I kind of liken it to paint by

12:32

numbers. You should be able to hand it

12:33

to your most junior engineer and say,

12:34

"Go do this." And if they copy it line

12:37

by line, it should just work.

12:40

This step is where we make a lot of the

12:41

important architectural decisions. You

12:43

know, make sure complex logic is

12:45

correct. Make sure business requirements

12:47

are, you know, following good practice.

12:50

Make sure there's good service

12:51

boundaries, clean separation, and

12:52

preventing any unnecessary coupling. We

12:54

spot the problems before they happen

12:56

because we've lived through them. AI

12:58

doesn't have that option. It treats

12:59

every pattern as a requirement.

13:02

The real magic in this step is the

13:04

review speed. We can validate this plan

13:06

in minutes and know exactly what's going

13:08

to be built. And in order to keep up

13:10

with the speed at which we want to

13:12

generate code, we need to be able to

13:13

comprehend what we're doing just as

13:15

fast.

13:17

Lastly, we have implementation. And now

13:20

that we have a clear plan and like

13:22

backed by a clear research, this phase

13:24

should be pretty simple. And that's the

13:27

point. You know, when AI has a clear

13:29

specification to follow, the context

13:31

remains clean and focused. We've

13:33

prevented the complexity spiral of long

13:34

conversations. And instead of 50

13:36

messages of evolutionary code, we have

13:38

three focused outputs, each validated

13:40

before proceeding. No abandoned

13:42

approaches, no conflicting patterns, no

13:44

wait actually moments that leave dead

13:46

code everywhere.

13:48

To me, what I see is the real payoff of

13:50

this is that you can use a background

13:51

agent to do a lot of this work because

13:53

you've done all the thinking and hard

13:55

work ahead of time. It can just start

13:57

the implementation. You can go work on

13:59

something else and come back to review

14:01

and you can review this quickly because

14:03

you're just verifying it's conforming to

14:04

your plan, not trying to understand if

14:06

anything got invented.

14:10

The thing here is we're not using AI to

14:12

think for us. We're using it to

14:13

accelerate the mechanical parts while

14:15

maintaining our ability to understand

14:16

it. Research is faster, planning is more

14:19

thorough, and the implementation is

14:20

cleaner. The thinking, the synthesis,

14:23

and the judgment though that remains

14:25

with us.

14:29

So remember that uh authorization

14:32

refactor I said that AI couldn't handle.

14:34

The thing is now we're actually, you

14:36

know, working on it now starting to make

14:37

some good progress on it. The thing is

14:40

it's not because we found better

14:41

prompts. We found we couldn't even jump

14:43

into doing any sort of research,

14:45

planning, implementation. We actually

14:46

had to go make this change ourself by

14:48

hand. No AI, just reading the code,

14:51

understanding dependencies, and making

14:52

changes to see what broke. That manual

14:55

migration was, I'll be honest, it was a

14:57

pain, but it was crucial. It revealed

14:59

all the hidden constraints, which

15:01

invariants had to hold true, and which

15:02

services would break if the off changed.

15:05

things no amount of code an analysis

15:07

would have surfaced for us. And then we

15:09

fed that pull request of the actual

15:12

manual migration into our research

15:14

process and had it use that as the seed

15:16

for any sort of research going forward.

15:18

The AI could then see what a clean

15:22

migration looks like. The thing is each

15:24

of these entities are slightly

15:26

different. So we have to go and

15:27

interrogate it and say hey what do we

15:28

about do about this? Some things are

15:30

encrypted some things are not. We had to

15:32

provide that extra context each time uh

15:34

through a bunch of iteration.

15:37

Then and only then we could generate a

15:39

plan that might work in one shot. And

15:42

the key and might's the key word here is

15:44

we're still validating, still adjusting,

15:46

and still discovering edge cases.

15:52

The three-phase approach is not magic.

15:54

It only works because we did this one

15:56

migration by hand. We had to earn the

15:58

understanding before we can code into

15:59

our process. I still think there's no

16:02

silver bullet. I don't think there's

16:03

better prompts, better models, or even

16:04

writing better specs, just the work of

16:07

understanding your system deeply enough

16:08

that you can make changes to it safely.

16:14

So why go through with all this? Like

16:16

why not just iterate with AI until it

16:17

works? Like eventually won't models get

16:19

strong enough and it just works. The

16:21

thing to me is it works isn't enough.

16:24

There's a difference between code that

16:26

passes test and code that survives in

16:27

production. between systems that

16:30

function today and systems that that can

16:32

be changed by someone else in the

16:34

future. The real problem here is a

16:37

knowledge gap. When AI can generate

16:39

thousands of lines of code in seconds,

16:41

understanding it could take you hours,

16:43

maybe days if it's complex. Who knows,

16:46

maybe never if it's really that tangled.

16:50

And here's something that I don't think

16:51

many people are even talking about this

16:52

point. Every time we skip thinking to

16:54

keep up with generation speed, we're not

16:56

just adding code that we don't

16:57

understand. We're losing our ability to

16:59

recognize problems. That instinct that

17:01

says, "Hey, this is getting complex." It

17:04

atrophies when you don't understand your

17:05

own system.

17:07

[snorts]

17:09

Pattern recognition comes from

17:10

experience. When I spot a dangerous

17:12

architecture, it's because I'm the one

17:13

up at 3:00 in the morning dealing with

17:15

it. When I push for simpler solutions,

17:17

it's because I've had to maintain the

17:19

alternative from someone else. AI

17:22

generates what you ask it for. It

17:23

doesn't encode lessons from past

17:25

failures.

17:27

The three-phase approach bridges this

17:28

gap. It compresses understanding into

17:30

artifacts we can review at the speed of

17:32

generation. Without it, we're just

17:34

accumulating complexity faster than we

17:36

can comprehend it.

17:40

AI changes everything about how we write

17:42

code. But honestly, I don't think it

17:44

changes anything about why software

17:46

itself fails. Every generation has faced

17:48

their own software crisis. Dystra's

17:51

generation faced it by creating the

17:52

discipline of software engineering. And

17:54

now we face ours with infinite code

17:56

generation.

17:58

I don't think the solution is another

17:59

tool or methodology. It's remembering

18:01

what we've always known. That software

18:03

is a human endeavor. The hard part was

18:06

never typing the code. It was knowing

18:07

what to type in the first place. The

18:09

developers who thrive won't just be the

18:11

ones who generate the most code, but

18:13

they'll be the ones who understand what

18:15

they're building, who can still see the

18:16

seams, who can recognize that they're

18:18

solving the wrong problem. That's still

18:20

us. That will only be us.

18:23

I want to leave on a question and I

18:24

don't think the question is whether or

18:25

not we will use AI. That's a foregone

18:27

conclusion. The ship has already sailed.

18:30

To me, the question is going to be

18:31

whether we will still understand our own

18:33

systems when AI is writing most of our

18:34

code.

18:37

Thank you. [applause]

18:40

[music]

18:50

[music]

18:55

>> [music]

Interactive Summary

The speaker confesses to shipping code they didn't fully understand, a practice they believe is common. They trace the historical cycles of software crises, starting from the late 1960s, where increasing hardware power led to greater software demands and complexity. The speaker highlights the historical trend of creating new tools and methodologies (like C, OOP, Agile, DevOps, and now AI) to manage this complexity, but argues that these often make the *mechanics* of coding easier without addressing the core difficulty: understanding the problem and designing the solution. This leads to a confusion between 'simple' and 'easy', where we opt for the easier, quicker path (e.g., copy-pasting from Stack Overflow, using AI) which accumulates complexity. The speaker introduces Fred Brooks' concept of essential vs. accidental complexity and explains how AI, by treating all code as patterns, can exacerbate accidental complexity. They propose a solution called "context compression" (or "context engineering"/"spec-driven development") involving a three-phase approach: research (gathering and analyzing context), planning (creating a detailed implementation spec), and implementation (generating code based on the spec). This process emphasizes human thinking and planning to manage AI's code generation capabilities effectively, ensuring that the focus remains on understanding and building the right system, not just generating code quickly. The ultimate question posed is whether we will still understand our own systems as AI takes over more of the coding process.

Suggested questions

5 ready-made prompts