Vision: Zero Bugs — Johann Schleier-Smith, Temporal

Watch on YouTube

Now Playing

Transcript

797 segments

0:00

Please join me in envisioning a world

0:02

where software has zero bugs. Not just a

0:05

few bugs, but actually literally zero

0:08

bugs. Okay. Okay. Just bear with me now.

0:12

So for most people, let's just say

0:16

people who aren't software engineers,

0:19

bugs are actually just not a very big

0:21

part of their life. Period. Most of the

0:24

apps that we use on our phones, our

0:27

social media, our news, that stuff

0:30

pretty much works most of the time. The

0:32

camera works most of the time. Any of

0:34

those most popular apps, banking, they

0:37

work really well most of the time. So,

0:40

bugs are really not top of mind for most

0:43

people.

0:45

Now, anybody who makes software

0:48

is very familiar with a different world.

0:52

A world of constant stress about the

0:57

possibility of software errors creeping

1:00

into critical applications

1:03

on call uh responses to pagers, cloud

1:07

provider outages, the list goes on and

1:10

on. So there's a disconnect between what

1:14

most people are experiencing every day

1:16

in the world and the reality of making

1:20

software. Now I will say that even for

1:25

those of us who are not engineers,

1:29

the perils of broken software do crop up

1:33

from time to time.

1:35

Just yesterday, I took my seven-year-old

1:39

son to the mini golf place, and there

1:42

was just one reservation left.

1:45

reservations were required.

1:48

And I dutifully whipped out my

1:51

smartphone,

1:53

snapped the QR code, went through the

1:55

process to grab the last reservation

1:58

spot, only to be told that it had been

2:03

grabbed by somebody else.

2:08

Well, I got to say I was very proud of

2:10

my son because most kids most of the

2:12

time would have probably melted and he

2:15

actually didn't. He handled it great.

2:17

And then can you imagine my surprise

2:20

when I checked my messages about 10

2:22

minutes later to find out that in fact

2:27

that last reservation slot had gone to

2:30

us. So we were thrilled. That roller

2:34

coaster journey still reinforces the

2:36

fact that bugs are real in the world and

2:39

they have real impact on real people

2:41

every day. Even if it is just a

2:43

momentary emotional swing for a

2:46

seven-year-old.

2:48

I'm Johan Flyersmith and today I'm going

2:51

to be talking to you about a vision of

2:53

zero bugs. Now I work at Temporal

2:57

Technologies. Temporal makes software

2:59

for durable execution. and it makes

3:01

software that deployed to the cloud do

3:03

what it's supposed to do. But this talk

3:05

is not going to be about temporal. There

3:09

are several other talks at the AI

3:11

engineer summit that do talk about

3:14

temporal. My colleague Cornelia Davis

3:16

will be doing a workshop on Sunday. In

3:19

addition, Samuel Kovven from Pyantic

3:21

will be talking about building agents

3:25

that combine Temporal with Pideantic.

3:28

The push to build reliable software and

3:32

the vision of giving engineers time back

3:35

for innovation is tightly lined with our

3:38

products at Temporal. However,

3:40

everything in this presentation is going

3:42

to be outside of the scope of our

3:44

current products. Let's return to the

3:46

vision of zero bugs.

3:49

There are quite a few objections, really

3:51

reasonable objections to this vision.

3:55

So, let's talk through them. First of

3:57

all, as we've started out saying,

4:01

incidents happen. Incidents happen

4:03

whether it's because of cloud outages or

4:06

problems with orders. They happen and

4:09

generally speaking, we pick ourselves up

4:11

and get through them. More broadly, the

4:13

world is imperfect and so a few software

4:16

bugs here and there might be okay. And

4:19

in fact, we already are solving for

4:22

reliability pretty well in many of the

4:25

situations where it matters. So maybe

4:28

software is good enough. Maybe we don't

4:31

need to push towards a vision of zero

4:33

bugs.

4:35

Here's another objection. You could give

4:38

perhaps good reasons, good theoretical

4:40

reasons even why eliminating all of the

4:43

bugs is just simply impossible. Why?

4:45

It's a preposterous idea. So you could

4:48

say there are millions of lines of code.

4:50

The code is just too big. We have too

4:52

much code as we know as agents generate

4:55

more and more code that exacerbates the

4:58

problem and it's all just simply too

5:00

complicated.

5:02

Furthermore, if we look at the

5:04

definition of a bug, it seems that the

5:07

specifications unavoidably have some

5:10

degree of ambiguity. I would say that

5:12

it's a bug. Whatever the way the program

5:14

works

5:16

does not match the end user's

5:18

expectations.

5:20

They don't care whether it was a problem

5:22

with a product specification or whether

5:24

the programmer forgot to check for a

5:27

null. It just doesn't matter, right? And

5:30

furthermore, unexpected things happen in

5:32

the real world. If we think about

5:34

control systems for example, if there is

5:38

some aspect of the world that hasn't

5:40

been modeled correctly, you could see

5:43

this frequently for example in the fears

5:46

around the capabilities of self-driving

5:48

vehicles. Then that you could say just

5:53

simply can't be handled. It's hopeless.

5:56

Furthermore, we're going to talk about

5:58

some of the powerful techniques in

6:00

software verification, but we also know

6:02

and we can prove theoretically that

6:04

those have limits. There are problems

6:06

that are computationally intractable in

6:09

some cases.

6:11

Reason number three is economics. If you

6:13

have competitors who don't care much

6:15

about software quality and who will win

6:17

in the marketplace if you spend time on

6:20

it, then that reliable software may

6:24

never see the light of day. Also, you

6:26

might just say that the ROI just simply

6:29

isn't there for fixing every single bug.

6:31

Some of them maybe are just not so bad.

6:34

Maybe they have easy workarounds.

6:36

And finally, perhaps cynically,

6:40

some people think that there are

6:41

companies that are okay with shipping

6:45

buggy software because it helps them

6:47

sell support.

6:49

In this vision, this cynical and sad

6:52

vision of the world, the bugs win and

6:56

we'll never have bug-free software. Not

6:59

even close. [snorts]

7:01

Now, I contend that there is hope.

7:06

And if we look, there are practices,

7:11

a whole slew of techniques that really

7:16

allow very reliable software. Let's look

7:19

at this example, which is the Airbus

7:22

A320. The control software for this

7:25

airplane was developed in the 1980s and

7:30

has been held up as a showcase for

7:33

reliability. There are in fact to this

7:36

date no serious incidents with Airbus

7:40

A320 aircraft that have been attributed

7:44

to problems with the software.

7:47

So what is their approach?

7:52

There are a bunch of ideas here that are

7:55

really pretty neat. So one of them is

7:58

Nvers programming. So the most critical

8:01

elements of the Airbus control system

8:04

[snorts] were actually built with

8:06

different processors. Say one from x86

8:10

from Intel, one Motorola processor,

8:12

different operating systems on that,

8:14

separate teams writing the software

8:17

providing a tremendous level of

8:18

redundancy against unexpected issues.

8:22

They also use something called

8:23

specificationbased design. tremendous

8:26

amounts of documentation, but also

8:29

documentation that could be analyzed in

8:33

order to understand

8:35

and make provable guarantees about the

8:38

behavior of the system and what the

8:40

software would do under a whole variety

8:42

of scenarios. They use independent

8:44

verification teams where the people

8:46

writing the code and the people checking

8:48

to make sure that the code had the

8:50

desired behavior were completely

8:52

separate teams. And they also used a

8:54

slew of defensive programming

8:56

techniques. So for example, not

8:58

allocating any memory at runtime. That's

9:00

all done statically. Not having

9:03

sophisticated exception handling. Just

9:05

keeping it really simple, very explicit

9:07

in the code, how any error conditions

9:11

are handled. And finally, static

9:13

analysis and verification. We'll talk

9:15

about those techniques more in just a

9:17

few minutes. So the mindset here is also

9:21

really important. The Airbus engineering

9:23

team had this idea of zero defect

9:26

tolerance of thinking of software as a

9:29

certified component that was engineered

9:32

to meet a certain specification just

9:33

like a turbine fan blade might be.

9:37

And they also had a system level

9:39

approach to reliability because when you

9:41

think about it with an airplane there

9:43

are all sorts of things that could go

9:44

wrong that need to be protected against.

9:48

It stands to reason that the decades of

9:53

experience engineering mission critical

9:56

mechanical systems crossed over into the

9:59

software development process and there's

10:01

a lot that we can learn from that.

10:04

10:06

core to the A320 was quality through

10:10

process. Now, I know for folks who are

10:11

banging out code, process is often times

10:14

the last thing that they want to think

10:16

about.

10:17

But as we're thinking about how agentic

10:19

coding works, thinking about how we keep

10:21

agents on the rails and doing what we

10:23

want them to do, process really is

10:26

something that we do want to think

10:28

about. There are quite a few steps to

10:31

the quality process. Many of these are

10:34

familiar to people who are writing

10:36

software today, say planning and

10:38

requirements. But there are also some

10:40

others that are a little bit different

10:41

like certification by an external

10:45

agency, maybe a regulator or the

10:47

government. The integration testing

10:49

becomes particularly important for an

10:51

airplane where that software needs to

10:54

interact with a physical system. And as

10:55

we look ahead and think about where

10:58

things are going in terms of the

10:59

software that we are going to have in

11:01

the future that's interfacing more and

11:03

more with the physical world. So this is

11:06

something that is probably going to come

11:08

back. And the key thing too is that

11:11

there's a feedback process in refining

11:14

each of these processes and making sure

11:16

that it interfaces well with the steps

11:19

that come before and after.

11:22

The aerospace industry is particularly

11:24

rich in these examples of super super

11:27

reliable software being built. So the

11:29

space shuttle is one and and really it's

11:31

quite stunning. So in the last three

11:33

versions of that software 420,000

11:36

lines of code in each of those and the

11:41

result of that after sort of inspecting

11:43

was was one error per version. Sadly,

11:46

some of the space shuttles have been

11:48

lost, but space shuttles have never been

11:51

lost to software problems. Over the last

11:54

11 versions, there were a total of 17

11:58

errors. And so, this is probably a

12:02

thousand times fewer bugs um per line of

12:06

code than is typical in commercial

12:09

software. Another aerospace example is a

12:11

Curiosity rover. With a mission that

12:13

costs millions and with very little

12:15

ability to intervene once the system is

12:18

on Mars, it was critical to have a high

12:21

level of reliability. Now that said,

12:24

this software developed in the 2000s did

12:28

take a bit of a different approach that

12:30

really shows the evolution of reliable

12:32

systems. So, for example, while

12:33

redundant systems were used, they're

12:36

actually identical systems and a

12:37

commercial off-the-shelf real-time

12:40

[snorts] operating system was used

12:42

rather than a custom operating system.

12:44

Now, aerospace isn't the only industry

12:47

where high assurance software, high

12:49

quality software, software with

12:52

effectively zero bugs, has been

12:54

critical. So whether it's in the

12:56

chemical industry or the automotive

12:58

industry, medical software, nuclear

13:01

power industry or security systems, each

13:05

of these provides us with an opportunity

13:08

to learn something.

13:11

Let's take a moment to shift gears a

13:12

little bit. Let's look at the advances

13:16

in computer science that really set the

13:19

foundation for how reliable software is

13:23

built today. And in fact, as we look at

13:25

these, we'll find that they really are

13:27

the foundation for really all software

13:31

that's built today.

13:33

The biggest of these is highle

13:36

languages. Here we go back to the 1950s,

13:40

1960s.

13:41

And from that period

13:44

where people were mostly writing with

13:46

assembly language up through the 1980s

13:50

when really assembly language more or

13:52

less went out of favor as a language

13:55

that people would use. It was a language

13:58

that that was replaced by machine code

14:01

generated by machines for machines.

14:04

There was about a 5 to 10x productivity

14:07

gain.

14:09

And

14:11

the core idea with highle languages is

14:14

around abstraction.

14:17

It's around data abstraction so that

14:20

instead of poking at memory locations,

14:22

you work with data structures that have

14:26

some relevance in the problem domain.

14:29

And it's about structured programming

14:31

which we'll talk about in a minute. At

14:32

the end of the day though, what is sort

14:35

of a unifying concept here is preserving

14:39

the essential complexity, which is those

14:41

aspects of the problem that are directly

14:44

relevant to whatever it is that the

14:48

software is supposed to do and removing

14:51

as much as possible from the code. those

14:54

aspects of the problem that have

14:57

something to do with the implementation

14:59

that have something to do with the

15:01

machine underlying that runs the code

15:04

like what its registers are or how you

15:06

lay out or access the memory or even

15:08

many aspects of the performance of that

15:10

machine. Structured programming as

15:12

espoused by Edgar Dystra was one of the

15:16

really big advances coming in the 1960s

15:20

and being broadly accepted in the 1970s.

15:24

Today, programmers can be excused for

15:27

having forgotten about the debates about

15:30

whether go-to statements were a useful

15:34

programming tool or something that

15:36

should be avoided at all costs. Our

15:39

programming language that we use today

15:42

clearly don't have go-to statements.

15:45

What is structured programming all

15:47

about? It's really quite simple. You

15:49

have a set of basic control structures.

15:51

So these are things like sequences,

15:52

statements that come one after the

15:54

other. Um selection if then else,

15:56

iteration concepts that are completely

15:58

familiar to any programmer today. But

16:01

what's really important about structured

16:04

programming versus what came before

16:06

where people were modeling applications

16:08

in terms of flowcharts and having these

16:12

nonstructured concepts like go-tos where

16:14

you could really jump around throughout

16:16

a program was enabling this sort of

16:18

compositional reasoning and eliminating

16:21

spaghetti code in many cases. You could

16:23

still write spaghetti code of course

16:24

with structured programs but if you look

16:27

at forrren code and if you try to

16:28

understand that go for it. It's a fun

16:30

time. You'll find that uh it's really

16:34

very different. So this hierarchical

16:36

decomposition of programs, it really

16:38

mitigates complexity. It allows

16:40

programmers to focus on one piece of the

16:42

code at a time. When you have LMS

16:44

generating the code, this is just as

16:47

valuable as it was for the programmers

16:51

who are writing code decades ago.

16:53

Another

16:54

key idea that traces back to the 1970s

16:59

is David Parnes's

17:02

push to think about software systems in

17:06

terms of modules. What does modularity

17:08

mean? It's perhaps best known in the

17:10

context of object-oriented programming,

17:12

but it applies in a whole bunch of

17:14

situations.

17:16

It's perhaps best known as an aspect of

17:19

object-oriented programming, but you can

17:21

have modularity without object-oriented

17:23

programming. Libraries are one of the

17:26

obvious examples. And so when we think

17:29

about verifying a program, when we think

17:32

about making sure that that program does

17:34

what it's supposed to do, whether we're

17:35

verifying it as a person or as an LLM or

17:38

using some sort of formal verification

17:40

technique, modularity is a massive

17:44

boost. As you chain modules together,

17:47

you get a subexponential scaling,

17:49

perhaps even a linear scaling rather

17:51

than an exponential scaling where you

17:53

can apply local reasoning at every

17:56

level.

17:57

And the upshot of that is that you have

18:00

manageable complexity

18:02

regardless of the size of the system.

18:03

You take that spaghetti and you turn it

18:05

into something that is very nicely

18:07

organized. I want to take a moment here

18:10

to reflect on why LLMs are not simply

18:12

generating machine code rather than

18:15

highle language code. It's certainly a

18:18

reasonable question and I think that the

18:21

reasons that applied to human

18:24

programmers decades ago are just as

18:26

applicable to LMS today. So for one

18:28

thing, we know that context is limited.

18:33

The context for an LLM, the context

18:35

window might be a lot larger than what a

18:38

human is able to hold in their head. It

18:39

depends a little bit on how you count

18:41

that context. Certainly, we have a lot

18:44

of awareness of background facts that

18:45

we've sort of compressed into our brain.

18:48

Um but uh

18:51

context is definitely a scarce resource

18:54

for LLMs just like attention and ability

18:58

to reason perhaps call it working memory

19:01

is a scarce resource for people. The

19:04

argument for libraries is as strong

19:06

today as it ever was. So while you could

19:08

make the argument, oh why don't we just

19:11

let the AI generate all the code for the

19:14

libraries since it's fast and cheap.

19:15

Maybe we can customize it to the needs

19:18

of our specific application.

19:21

Getting that code properly tested,

19:23

properly verified is going to be a huge

19:26

challenge. And so we really want the

19:30

ability to use reliable, trusted

19:33

components and modules to build our

19:35

systems. On that note, I do need to put

19:38

in a little pitch for temporal. What

19:40

temporal allows you to do is it allows

19:42

you to abstract away the reliability of

19:46

your software in the cloud. It provides

19:48

durable execution, which means that it's

19:50

shipping that reliability problem to a

19:53

separate piece of code that's outside of

19:55

your application that your application

19:57

doesn't need to worry about. Let's now

20:00

go ahead and dive in on the fun part,

20:02

which is formal methods. And I want to

20:04

shoot straight to a few demos. Now, in

20:08

these demos, I'm going to be using the

20:10

Daphne language. What Daphne allows you

20:12

to do is it allows you to use a custom

20:16

programming language that generates

20:18

output to a whole variety of other

20:20

languages, whether it's JavaScript,

20:21

Python, um, or C, you name it. What

20:25

daffany allows you to do is it allows

20:28

you to put proofs in line with your

20:30

code,

20:32

allowing theorem proving software to

20:35

come along and verify that that code

20:39

does exactly what you said you wanted to

20:42

do. Okay. So I have a program here that

20:45

is written in the Daftly language and it

20:48

has one function. It's called a method

20:50

here and it does something very simple.

20:52

So it does index up. So what it's going

20:53

to do is it's going to search an array

20:56

to find the index of a particular number

20:59

and I can write a number of assertions

21:02

about this. The array length is greater

21:05

than zero. The number returned in that

21:09

result is either negative 1 if it's not

21:11

found or the uh some number that is less

21:15

than the length of the array and so

21:17

forth. What I can now do is I can just

21:20

go ahead and I can run the Daphne

21:22

verifier on that program.

21:27

Great, no bugs.

21:30

Let's go ahead and generate a Python

21:34

program that exercises this

21:37

functionality. And we can see that the

21:40

program first verifies before it runs.

21:43

So I know that all of those assertions

21:45

that are proven about the program have

21:48

been checked before that program runs.

21:51

This is an extremely powerful technique

21:53

because it spits out a Python library.

21:56

It's something that can be integrated

21:58

into your code. Now suppose I come over

22:02

here and I make a small change to the

22:05

algorithm, which is to say I've

22:07

introduced a bug.

22:09

If I now go back and I try to run that

22:12

again, the verifier steps in and throws

22:17

an error and we are saved from seeing

22:20

that bug.

22:23

All right, let's return to the

22:24

presentation here. So, one thing to keep

22:26

in mind is that verification is only as

22:28

good as the specification. If I leave

22:29

out anything that needs to be checked,

22:31

that creates an opportunity for bugs.

22:33

[snorts]

22:34

So I want to emphasize that in the last

22:36

few decades formal methods have become

22:38

commercially relevant on a really

22:42

impressive scale. For example, the Scl

22:47

micro kernel is a fully verified

22:50

operating system. It's a simple

22:51

operating system typically used for

22:53

embedded systems and security critical

22:55

applications, but it is an operating

22:57

system. The comfort C compiler again

23:00

often used in security critical

23:02

applications as well as in the aviation

23:05

industry that is a fully verified

23:08

compiler. That is to say that formal

23:10

methods have been used to ensure that

23:13

the code that that compiler admits given

23:16

a C program does exactly what that C

23:20

program is supposed to do. Project

23:23

Everest works on libraries for

23:25

cryptography, including libraries that

23:27

are widely deployed today, protecting

23:31

internet traffic. And really

23:33

impressively in the microprocessor space

23:36

now for several decades, formal methods

23:38

have being used to ensure the

23:41

correctness of those designs. There has

23:44

been just a huge motivation to make sure

23:47

that these systems are performing as

23:51

expected. And one of the things that's

23:53

really really cool is that there has

23:55

been just tremendous progress in terms

23:57

of the size and speed with which

24:01

verification can be performed over the

24:04

last sort of 20 plus years. And this

24:06

really coincides with the rise of

24:09

benchmarks. Benchmarks can have a

24:11

tremendous role in shaping an industry.

24:13

It gives folks something to focus on.

24:16

And so we can see that success rates for

24:19

the benchmarks have gone from the 30%ish

24:24

range up to nearly 100% while at the

24:27

same time the runtime on those

24:30

benchmarks has gone down by a factor of

24:33

50 or more. So there are a handful of

24:36

verification tools that you can use

24:37

today and I want to break them down in a

24:39

few different categories. So static

24:41

verification is probably that which you

24:43

are most familiar with. I'm starting

24:44

from the bottom here. If you are using

24:48

type systems that is a simple form of

24:50

static verification but there are ways

24:52

to attach more checks to the type

24:56

system. Jumping up to the top we just

24:58

saw Daphne Ron Spark. data is another

25:01

example of tight coupling between those

25:06

theorems and the code and then there are

25:09

other systems that are also wellknown

25:10

and lean for example that provide

25:14

theorem proving separate from the code

25:16

the problem there while those tools are

25:18

super super powerful is that you do need

25:21

to make sure that what the code does and

25:26

what you have written in terms of the

25:28

proof are the same Model checking deals

25:31

with finite state machines and proving

25:34

properties about those finite state

25:36

machines. Theorem proving on the other

25:38

hand doesn't have that limitation

25:40

because it is able to take advantage of

25:42

more powerful reasoning techniques,

25:46

automated reasoning techniques.

25:57

All right, let's get to the good stuff.

25:59

Agentic coding. Now, I wanted to give

26:01

you a set of really practical things

26:05

that you can try in your day-to-day work

26:09

to see what sorts of benefits you can

26:11

get. These are probably not things that

26:13

you're going to apply across the code

26:14

base, but when you're struggling to get

26:17

the agent to do what you want it to do

26:19

on a very specific piece of code, these

26:21

could all be pretty valuable. So some of

26:24

these are things that we are probably

26:26

reasonably well verssed with. So

26:28

detailed specifications, using type

26:30

languages, doing modular code, these are

26:32

all sort of things that we pretty much

26:34

do anyway. But some things that we might

26:37

not do are interacting with the LLM and

26:40

asking it to do explicit risk analysis,

26:43

asking it to write safety cases, which

26:46

are statements about things that could

26:49

go wrong and how that thing that could

26:53

go wrong is being mitigated in the code.

26:55

So this is separate from formal methods.

26:57

This is sort of a um more qualitative

27:03

reasoning which is something that we

27:04

know that LLMs can do. Another

27:07

inspiration that you can take is from

27:10

the design of high assurance systems

27:13

where they have separate teams do the

27:15

coding and the verification. That means

27:18

that you can have separate prompts to

27:21

the LLM for testing versus for

27:27

writing the code in the first place. And

27:28

if you want to take that to another

27:30

level, you can use multiple model

27:32

providers. So you can use one foundation

27:34

model for the tests and one foundation

27:36

model to write the code. You can bring

27:39

in those formal methods techniques to

27:41

give proofs around sections of critical

27:43

code. And lastly, this is sort of the

27:47

timeless advice, keeping your code

27:50

small, outsourcing those things that can

27:54

be to libraries which can be separately

27:57

tested, validated, developed, and now

28:02

your code doesn't need to worry about

28:04

it.

28:07

All right, let's talk for a minute about

28:10

software 3.0. So this is the idea

28:16

promoted by Andre Karpathy that prompts

28:20

can really function as programs that

28:22

what we're doing today is we are

28:24

programming through AI through LLMs

28:28

and it's a new world of coding whether

28:31

that means that the LLM directly solves

28:34

whatever problem you need solved or

28:36

whether it generates code or perhaps

28:39

loops and uses tools or any combination

28:42

thereof in order to get to whatever

28:45

behavior you want for the system. This

28:47

opens up a tremendous need for new

28:50

assurance techniques. Right? Because LMS

28:52

are fundamentally non-deterministic

28:54

and because the state space is

28:58

absolutely huge, all of the verification

29:00

techniques that we have discussed have

29:03

basically no bearing on this form of

29:08

software.

29:10

That said,

29:12

it's not all gloom and doom. And I am

29:15

really excited by the idea that despite

29:18

having new and different failure modes,

29:21

there are also potentially new forms of

29:23

resilience. LLMs can respond to

29:28

unanticipated inputs. They have that

29:31

ability to deal with ambiguity. And you

29:34

can imagine lots of architectures

29:36

whether they are pure agentic

29:39

architectures as we often have today to

29:42

ones that maybe invoke LLMs once certain

29:45

error conditions are encountered that

29:47

are actually getting ahead of and

29:50

protecting the world from all kinds of

29:54

software faults and perhaps doing it in

29:57

really simple and interesting way. So I

29:59

think this is just a tremendously

30:00

interesting idea. All right,

30:03

let's get to cost. This is one of the

30:04

big topics. So, what does agentic code

30:08

cost? I vibe coded up this very simple

30:13

game. I spend about 2 minutes prompting.

30:17

We can set that aside. I'm going to not

30:19

count my time towards the cost. GPT5

30:21

codeex. It's creating 600,000 input

30:25

tokens and it has 3.5 million cached

30:28

input tokens, 48,000 reasoning tokens,

30:31

and then is returning 28,000 tokens. The

30:34

cost to generate this game was about $2.

30:37

And the thing that's interesting here is

30:38

that the cost to generate the output

30:42

tokens

30:43

is only about 15%

30:47

of the overall cost. The rest of which

30:49

is going into the repeated use of input

30:53

tokens as tests are being run and the

30:57

reasoning tokens as well. As it with

31:00

human written code, the amount of time

31:03

that you spend actually writing the code

31:07

is a small fraction of the overall time

31:10

that's spent to build the software. All

31:13

right, so let's bring it down and let's

31:16

look at the cost of code. So for high

31:17

assurance code, if we look at something

31:19

like the space shuttle or the Airbus

31:21

example, the numbers there, if you take

31:24

$ 1990 from the space shuttle, it was

31:26

about $1,000 per line of code. If you

31:29

translate that into $205, it's probably

31:32

more like $2,500. And in some cases, so

31:35

for example, for security high assurance

31:38

software, numbers as high as $3,000 per

31:41

line of code have been quoted. For

31:43

typical software development, it's more

31:45

like $10 to $100 for real production

31:49

software, but nothing that is developed

31:52

with the high assurance techniques. And

31:54

in some cases, so for example, for

31:56

security high assurance software numbers

31:59

as high as $3,000 per line of code have

32:02

been quoted. If you have lowcost

32:05

contractors, you may be able to bring

32:07

that number down as low as $1 to $10.

32:10

This is all without considering any AI

32:14

or agentic codegen. For the agentic

32:16

coding, I've put a pretty broad range

32:19

that includes just cheap models spinning

32:23

out code. It could probably go even

32:25

lower than this if they're not iterating

32:26

on it very much up to more expensive

32:30

models that are working harder to

32:34

generate that code. Regardless of how

32:37

you slice the numbers, you're looking at

32:40

a factor of at least a thousand,

32:44

probably about 10,000. If you set aside

32:47

the cost of the people involved in the

32:49

agentic coding, if you just look at that

32:50

agentic coding piece, that code is being

32:54

generated far more cheaply than typical

32:58

software. And this is interesting

33:02

because the gap between the cost of high

33:04

assurance code and typical software is

33:07

only about 100x.

33:10

33:12

if we extrapolate

33:17

we could conclude that agentic coding

33:21

has the potential to produce high

33:24

assurance software

33:26

100 times more cheaply than typical

33:29

software is produced today.

33:33

That leads us to the vision of zero

33:36

bugs.

33:38

Software reliability is a solved

33:40

problem. It's solved in aerospace. It's

33:43

solved in other critical industries.

33:46

And with the deployment of agents geared

33:48

towards achieving high assurance code,

33:51

whether that's because they're using

33:52

formal methods, because they have

33:54

extensive processes, because they're

33:55

using adversarial testing, the list goes

33:58

on and on. We can believe that agents

34:02

will make high assurance code 100 times

34:04

cheaper and that in this context we will

34:07

see a proliferation of bug-free

34:10

experiences. I also want to emphasize

34:12

that this push towards a vision of zero

34:14

bugs serves to address many of the

34:18

limitations that agentic coding have

34:21

today. notably around the quality of the

34:24

software that's written. When developers

34:27

choose not to use the agent coding tools

34:29

that are at their disposal, the reason

34:31

for doing so typically is that it's just

34:34

going to take them more time to fix the

34:36

bugs in that software than it would to

34:39

take them to write the software

34:40

correctly in the first place.

34:42

As soon as we can get to the point where

34:44

agentic coding is routinely generating

34:48

software that has fewer defects than

34:51

software written by humans,

34:53

we can expect absolute takeoff in its

34:57

adoption.

34:59

We know how to do that. We've known how

35:02

to do that for decades.

35:07

Before we close, I want to emphasize

35:11

that tardigrades are not bugs.

35:15

This is Ziggy. Ziggy is temporal's

35:18

mascot and Ziggy belongs to the film

35:22

tardigrada,

35:25

not an insect.

35:27

Tardigrades are some of the most

35:30

resilient animals in the world. They

35:33

have even been known to survive in outer

35:36

space. And earlier this year, we

35:39

actually took Ziggy to space just to

35:42

prove that point. We are having a lot of

35:44

fun here at Temporal building durable

35:47

execution as the reliable foundation for

35:50

modern software. If anything that we

35:53

discussed here today resonates with you,

35:55

please reach out. We'd love to chat and

35:58

explore how to work together in any

36:00

possible way.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

Johan Flyersmith explores the vision of 'zero-bug' software, drawing on established reliability practices from high-stakes industries like aerospace. He discusses how formal verification techniques, modular design, and rigorous engineering processes can be integrated with AI agentic coding to significantly lower the cost of creating high-assurance, reliable software.