How METR measures Long Tasks and Experienced Open Source Dev Productivity

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

Watch on YouTube

Now Playing

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

Transcript

2149 segments

0:20

here's the very simple argument.

0:23

If you look at the sub notion of compute

0:25

over time um you know this could be like

0:27

R&D um spending on compute this could be

0:30

experimental comput it could be training

0:31

compute what you know whatever um that

0:33

some particular lab is is using goes

0:36

like this no surprise

0:38

if you have another chart of like um you

0:41

know log time horizon let's say this

0:44

this uh meter measure from the um this

0:47

figure that many of you would have seen

0:48

on Twitter over time it looks like

0:53

um uh you know let let's say that this

0:55

was like not merely a coincidence but

0:57

these things were causally proportional

0:59

in the sense that if uh if compute

1:03

growth were to half then time horizon

1:06

growth were to half. So you know for the

1:08

for the sake of argument let's say that

1:10

you know starting from 28 or so um the

1:14

compute curve begins to bend like that

1:17

where this would be no growth and this

1:19

would be the original growth something

1:20

something like half then if you know if

1:22

they were causally related and in

1:25

particular they were causally

1:26

proportional to one another then you'd

1:28

expect this to go like that and then for

1:30

some milestone that you care about let's

1:33

say here we've got uh one work one work

1:37

rising up there one month

1:43

then the delay implied in AI

1:45

capabilities is potentially enormous. Um

1:48

now like why you know lots of people

1:50

have stipulated that there might be some

1:52

slowdown in comput growth I'm not an

1:54

expert in in those forecasts but I think

1:56

I think the prior reasons do seem like

1:57

somewhat strong to me. One is like

1:59

physical constraints that we might we

2:01

might hit power constraints as you

2:02

mentioned or um there are various other

2:05

ones that that EPO have a report on that

2:07

they that they consider all of which

2:09

seem to not bite through 2030 but you

2:12

know potentially potentially could bite

2:13

sometime after 2030. Um I I think the

2:15

more likely one is just like dollars is

2:18

a constraint like you can't um you know

2:20

large tech companies can only spend so

2:22

much at a certain point like large

2:24

nation states can only spend so much

2:26

like you can't um uh I guess there are

2:29

some scenarios in which you you can you

2:31

can continue going but that seems to to

2:33

kind of naturally imply this slowing

2:35

down and then the you know additional

2:37

point that this this paper is trying to

2:38

make is that under a

2:41

very contestable but standard assumption

2:44

from economics um you should in fact

2:47

expect these these two to be causally

2:49

proportional um I think in particular

2:52

you should expect them to be causally

2:53

proportional um to the extent that or

2:56

for the period that uh software only

2:58

singularity is not possible and that's a

3:01

whole another discussion we can talk

3:02

about that um but at least in this kind

3:04

of somewhat business as usual um um uh

3:09

scenario or sort of until that scenario

3:11

no longer applies um I I think this is

3:14

this is maybe a reasonable model and

3:15

does imply some sorting of air

3:16

capabilities in the in the near future

3:20

I I have no plan for this session

3:22

whatsoever

3:24

>> that also tells us we don't have a

3:26

technological advance that dramatically

3:28

improves capabilities relative to like

3:30

like an unpredictable technological

3:33

advance, right?

3:33

>> Yeah. Yeah. I mean, all all predictions,

3:35

you know, assume no unpredictable

3:38

um Yeah. I'm like um uh you know, time

3:42

horizon or or like in general in in AI

3:44

kind of straight lines on on log linear

3:46

plots um have have been a I think you

3:50

know, a very highly underrated um

3:52

forecasting tool. They've done extremely

3:53

well over now many orders of magnitude.

3:56

you know, I I think it's reasonable to

3:57

have the default expectation that the um

4:00

log linear lines continue through like

4:02

approximately the same number of orders

4:04

of magnitude except maybe if there's,

4:06

you know, some significant break in the

4:07

inputs. Yeah, of course on the upside

4:09

there could be um there could be

4:11

something quite dramatic. Software

4:13

singularity is the first thing that

4:14

comes to my mind, but um uh but you

4:17

know, another Transformer style moment

4:20

seems like another another candidate

4:21

naturally. Of course, also one of the

4:23

problems with with testing this will be

4:25

that like I think most of the tasks that

4:27

you have able to test will eclipse the

4:31

maximum possible amount of time that

4:33

those tasks could take at some point in

4:34

the evaluation set.

4:36

>> Yeah. So I think um you know there are

4:39

some ways around this that we're working

4:40

on. I'd be I'd be excited to talk about

4:42

that. They all feel pretty early. Um uh

4:45

but uh yeah, you know, I I think it's I

4:47

think it's right that um if uh if time

4:50

horizons are doubling, you know,

4:52

eventually you you know, the um the

4:55

doubling time is such that you can't

4:56

possibly make long enough tasks in the

4:58

in the relevant.

4:59

>> It's possible also that like we actually

5:01

hit a place where time horizon is no

5:02

longer a useful measure because actually

5:04

you now want time now you want total

5:05

time to decrease like like like what you

5:08

want is you want the same result at a

5:10

lower time.

5:10

>> Oh. Um uh one

5:14

>> you want higher reliability at a lower

5:15

time time horizon. One thing to say

5:17

about time horizon um is there's like

5:20

two notions of time here like like a

5:23

human time axis thing like calendar time

5:25

access the the time that the model

5:27

working for I think you should like kind

5:29

of approximate as zero um it's it's not

5:32

actually zero they are they are taking

5:34

actions but they they largely um do

5:37

their successful work pretty pretty

5:38

early on to the extent they're going to

5:39

be successful on tasks um so so my my

5:42

guess would be that it will continue to

5:44

be the case that there's not sort

5:46

so much extra juice on that margin of of

5:48

making the models complete tasks more

5:51

quickly although reliability very much

5:53

so obviously um

5:56

>> so most of it's like the human like the

5:58

the iteration loop most of the time is

6:00

spent in like the human machine

6:01

iteration loop

6:02

>> um the humans are working without AIs

6:05

and the AI working without humans so the

6:07

for the humans I guess all humans

6:11

like yeah Yeah. Yeah. Yeah. Yeah.

6:17

>> Cool. Any questions on me to work? I I

6:20

can go through um uh some like upcoming

6:22

things that we're that we're excited

6:23

about if people are excited about those

6:25

things.

6:26

>> Yeah. I I I just have one question

6:29

perceived one like like the time

6:30

perception. Yeah. One of the

6:32

>> Yeah. Yeah. Y

6:34

>> um one one thing I thought and you you

6:36

brought up a little bit in the paper

6:38

which is uh you know whether or not

6:40

familiarity is a confounding factor. Um

6:43

although one of the things

6:44

>> with with tools you think

6:45

>> yeah tool is kind of factor and and of

6:48

course also like you also brought up

6:49

that like tool capability has

6:50

dramatically changed but uh there was an

6:52

interesting presentation from meta at

6:54

the developer activity engineering

6:55

summit this year

6:57

>> and they had done us they have probably

6:59

the best infrastructure for quantitative

7:01

measurement of like developer experience

7:03

in the world of any company

7:05

>> and they're able to tell you basically

7:06

how long it actually takes to make uh

7:09

make a PR basically they call it the

7:11

opposite meta but like how actual effort

7:13

like human time effort it took to make a

7:15

PR and what they saw was they saw a J

7:18

curve when they gave people agents and

7:20

that J curve was I don't remember how

7:21

long it was like three months or six

7:23

months and so one of the things I also

7:25

wonder is like if it would be

7:27

interesting if if there's a cut off of

7:29

how much familiarity the person has like

7:31

have they been using this as their

7:33

full-time daily driver for a period of

7:36

months uh and if there's like a cut off

7:39

that occurs once their like certain

7:40

level of familiarity occurs

7:42

Yeah, I I I'm totally I'm totally on

7:45

board with like not just in this case,

7:47

but in many economically relevant

7:49

outside of software engineering cases,

7:51

you know, JC like explanations being

7:52

being a real thing. I'm like, yeah, um

7:55

uh you know, developers, not just

7:57

developers, um experiment with tools.

7:59

You know, you tend to be slower the

8:00

first time that you're experimenting

8:02

with tools. um uh but you know if if

8:05

you're doing this so that you you have

8:06

some investment benefits you know later

8:08

on you might be might be more proficient

8:10

at the tools or in the case of AI um

8:12

maybe you just sort of expect the models

8:14

will get better and so even if you don't

8:15

become more proficient it will be like

8:17

the kind of thing that you want to do

8:18

you know those explanations broadly um

8:21

make sense to me um um I can give you

8:24

some reasons why I have scores

8:27

um I think so one thing to say is you

8:30

know we're Um um what are some things to

8:34

say? Um

8:36

uh as backgrounds, you know, we're

8:37

continuing with this with this work and

8:39

we'll we'll see. Um uh you know, another

8:42

thing to say is just like

8:43

quantitatively, you know, difference

8:45

between this and this very large.

8:50

I'm like how much how much is Jacob

8:51

explaining? I think it's not explaining

8:53

that much. Let me explain that because

8:54

like we see this over and over actually

8:56

in software engineering studies that the

8:58

one question you can't ask people in a

9:00

survey is how long did a task take time

9:03

>> like you can ask people how much more

9:05

productive did you feel and they will

9:06

give you an accurate response that

9:08

correlates to quantitative feedback.

9:10

>> Ask anybody the amount of time that

9:11

something takes they are almost always

9:13

wrong. So that I was like like what when

9:16

I share this with my colleagues I was

9:17

like okay I'm not surprised about that

9:19

at all but what is interesting is how

9:21

much is the slowdown aspect that was

9:23

what was

9:24

>> yeah yeah yeah that um uh yeah uh point

9:27

well taken that that that makes a lot of

9:28

sense I do I do um uh so I think we

9:33

despite this were interested in time

9:35

estimates because um you know we're

9:38

we're interested in providing

9:39

>> yeah I mean the perceptual like I do

9:41

think that's relevant too also because

9:42

like the perceptual aspect also the hype

9:44

aspect

9:46

>> right like so developers will tell you

9:47

that they were faster when they weren't

9:49

and I think that is worth knowing

9:51

>> and you know to to the extent that we're

9:53

interested in um uh measuring the you

9:56

know possibility timing nature of um of

9:59

capabilities explosions or sort of aird

10:02

being automated one commonly proposed

10:04

measure to do this is just like ask

10:05

developers or researchers how much

10:07

they're being sped up and for exactly

10:08

the reasons they're pointing at I don't

10:10

put a lot of faith in those um in those

10:12

in those estimates. So nice to nice nice

10:15

to see it like this. Yeah. Some some

10:16

more some more Jacob things. So I

10:20

so the so the forecasters who who are

10:24

not predicting time to complete right

10:25

they are they are just predicting this

10:26

this effect size that non-developers the

10:29

expert forecasters they are told the

10:31

degree of experience these developers

10:33

have and some of the forecasters are um

10:36

in thinking about how this population

10:38

might be different to other populations

10:40

pointing out various facts about the

10:41

study like they're more experienced I

10:43

expect experienced people to get less um

10:45

uh to get less speed up or you know

10:47

repositories are larger. I think AIS are

10:49

less capable at working on large

10:51

repositories. I expect less speed up.

10:52

They never never mention um familiarity

10:56

with tools. My my sense is that um yeah,

11:00

they share the the sense that I had

11:02

ahead of time, which was like most of

11:04

the action is in understanding what AI's

11:06

the kind of things that AIs are good at

11:08

or bad at in the first place. And all of

11:10

these developers have experience with

11:11

LMS and their core development workflow.

11:13

It's just cursor that they're they're

11:15

quite that three course of them are

11:17

totally unfamiliar with at the start of

11:18

the study. Um so I I just I wasn't

11:21

seeing much much margin. Um yeah I I I

11:25

think I think it is I think it is an

11:27

open open question. I I also you know we

11:29

watched so many hours of screen

11:31

recordings of these developers working

11:33

and um I just do not see um I think

11:36

they're like prompting very reasonably

11:39

you know in some cases worse than me and

11:40

my colleagues in some cases better. Um I

11:42

I'm not seeing these like advanced

11:44

workflows that they're not accessing.

11:45

>> Yeah. And my experience is is not that

11:48

far off from this is that there are

11:49

times when like I am dramatically slowed

11:51

down and there's times when I am

11:52

accelerated.

11:53

>> Yep.

11:54

>> And although as my familiarity with the

11:56

tool increases.

11:57

>> Yeah.

11:57

>> I definitely improve a lot because I

12:00

learn over time

12:01

>> what I can tell it to do and what I

12:03

can't tell it to do.

12:04

>> Yeah.

12:04

>> In addition to like it's just getting

12:06

better with it like understanding like

12:07

okay now I need to plan blah blah blah

12:09

blah. But I but that's why so the thing

12:12

is like before you make a like high

12:14

level architectural decision that you

12:16

know 10 conversations uh 10 conversation

12:18

uh turns down is going to blow up in

12:21

your face you like really try and think

12:22

about it.

12:23

>> Yeah. Yeah. Yeah. Exactly. And and and

12:24

also like scope it down to like a

12:27

smaller problem. Like I at first I would

12:29

try problems that were too large and

12:30

like can't handle that.

12:32

>> But just I mean just for the future if

12:34

you ever do I mean I think it's

12:35

obviously really hard with the with the

12:36

sample with the 16 person sample size.

12:39

But

12:41

>> that's great great because because in

12:43

the future what I I think having a cut

12:45

off like trying to figure out if there

12:47

is a cut off of familiarity where the

12:49

number changes would be interesting to

12:51

see if that meta result generalizes

12:53

outside of

12:54

>> um we are we are on it I think. Um the

12:57

AIS have been getting better during this

12:59

period which is going to compound a lot

13:00

of a lot of what's going on obviously

13:02

but uh yeah yeah

13:04

>> the thing is the projects themselves are

13:06

very optimized for people coming on to

13:09

new projects and figuring out how to you

13:11

know they're already the ones that

13:13

struggle to be organized well for humans

13:15

to come on board and be able to navigate

13:17

them quickly don't survive very long in

13:19

the open source ecosystem and these are

13:21

fairly mature open source projects.

13:23

They're a little bit different from like

13:24

in enterprise settings where things

13:26

survive because they make money even if

13:28

they're a pain to develop on, right? So

13:31

the the context is a bit different.

13:35

>> These are the repos. Yeah. Yeah, that

13:39

that is a really interesting point

13:40

because like actually some of the repos

13:42

that I was helped the most with were

13:43

ones that I was completely unfamiliar

13:46

with and which had no decent

13:47

documentation of any kind and where like

13:49

I I had to come in on this legacy code

13:51

base that had existed for years and like

13:54

make a change and and like the developer

13:56

who owned it was like only partially

13:57

available to answer questions to me and

13:59

so in that case like cloud code was a

14:00

huge help.

14:01

>> Yeah. Legacy code bases don't exist

14:03

because they work well. It's because

14:04

they make money.

14:05

>> Yeah.

14:10

question I had was um sort of like did

14:13

all the developers have the same level

14:14

of AI familiarity with cursor or was was

14:18

there some variance and was that like is

14:21

there a plot of like each of each each

14:22

of their familiarity

14:23

>> there's always a plot

14:28

>> there's always a plot that could kind of

14:31

like dig into like the question of oh is

14:32

there is there a J curve

14:34

>> yeah so so here's some here's some

14:36

evidence

14:36

Um, so okay, the you know, I can show

14:39

you some plots. I think the the the

14:40

sample size is just small enough that

14:42

like you shouldn't really believe any of

14:44

the I mean I think the plots aren't

14:46

going to show much, but then I I don't

14:48

want to say that's like strong evidence

14:49

this is not something that's going on. I

14:51

just think the evidence is kind of weak.

14:52

The thing that really convinces me is

14:53

like watch the videos. Obviously videos

14:56

working and you know often they're

14:57

better at using cursor than me and I'm

14:59

like well you know I'm I'm working on

15:01

this project using cursor. Um but but

15:04

here are some graphs. So um so this is

15:07

by whether they have various types of um

15:10

uh AI experience coming into the study

15:13

and you know basically you see no

15:15

movement in in point estimates people

15:17

for whom cursor was a primary ID before

15:19

um yeah not not a huge amount of

15:21

difference versus people for whom it was

15:23

not. Um then the next one is you know

15:28

you might think may maybe you have a

15:30

view that you know some Jacob cut off

15:31

comes after this point but still you

15:33

know within the within the study there's

15:35

some variation in how experienced people

15:37

are with AI because they have multiple

15:40

issues you know after the first AI issue

15:42

they're slightly more exposed than after

15:44

the second AI issue. So you might try

15:45

sort of excluding those data points over

15:47

time and and seeing and seeing what pops

15:48

up and you know they don't they don't

15:50

seem to get better at using error over

15:51

time.

15:52

>> Although I think there's probably a

15:53

static issue with that.

15:54

>> You think there's probably what? Sorry.

15:55

>> Yeah, there's probably a static issue

15:57

with that that plot right there. Like

15:59

those bars are very very wise.

16:01

>> Oh, I mean I think yeah none I I think

16:04

like all of the um plots outside of the

16:07

main plots all of these subsets things

16:09

you should like not put a lot of stock

16:11

in. Um yeah, I I I I totally I totally

16:14

agree. Um okay. And then lots lots has

16:16

been made so so this graph is the reason

16:18

we put it in unclear evidence because

16:19

we're like ah things point in different

16:21

directions. Um a lot has been made of of

16:23

this plot suggesting you know something

16:25

something J-shaped in particular that

16:27

you know at the end once people have

16:28

more experience um uh they do experience

16:31

some some speed up. Um here are some

16:33

issues. You know first like the other

16:35

plots don't I think that's important to

16:38

to include. Second, these hours are

16:40

coded very conservatively. So for

16:43

instance, someone in the 30 to 50 hours

16:45

bucket is um had cursor as their primary

16:49

IDE in 2024, they had recorded

16:52

themselves on their time tracking

16:54

software as having spent 140 hours using

16:57

cursor. They conservatively estimated

16:59

that they'd spent 50 hours using cursor.

17:01

And so they end up in our 30 to 50 hours

17:04

bin. This is someone whose whose primary

17:05

ID was was was cursed last year. Um, and

17:08

and you know, people have been

17:09

commenting about this. They've been

17:10

using cursor for less than a week. I

17:11

think that's not a not a very fair

17:13

assessment. If you if you were to move

17:15

that developer over from the uh

17:17

penultimate bar into the again, you

17:18

shouldn't believe this because of

17:19

statistics, but um if you were to move

17:22

the uh that that developer from the um

17:24

penultimate um effect size estimate to

17:27

the to the last one, then you see some

17:29

balancing out where you get back to

17:30

essentially zero in the last bucket.

17:34

Yeah. Again, so so like transfinitive. I

17:36

think Jacob explanations, you know,

17:38

still like very on the table.

17:40

>> Is is it not likely though that the

17:42

50-hour group also is similarly

17:44

underestimating their their time they've

17:46

spent using cursor and that actually if

17:48

you just had a longer scale that you

17:50

would still see a degree

17:51

>> Oh, that that is an interesting point.

17:53

Um um

17:57

that seems plausible to me. Um and then

18:00

and then I guess I want to I'm not sure

18:02

it's underestimate because we're using

18:03

this like very conservative

18:04

>> Yeah. Yeah. Totally. Um, yeah. Yeah, I

18:07

think that seems plausal to me. And

18:09

then, um, for this not to be strong

18:12

evidence, I'd retreat back to I think

18:13

you shouldn't really believe in any of

18:14

these.

18:15

>> Yeah. I think the biggest thing is it's

18:17

small sample size and there's also a lot

18:19

of bias in the data set effectively,

18:21

right? Like it's a certain kind of data

18:23

set. It's open source.

18:24

>> You mean like the kinds of the kinds of

18:25

developers?

18:26

>> Yeah. Open source developers and also

18:27

working on open source projects that are

18:29

pretty mature.

18:30

>> Yeah. you know those those two things

18:32

are if you're working with open source

18:35

developers are pretty mature this is

18:37

probably reasonably indicative maybe but

18:39

the sample size is pretty small but

18:41

outside of that it gets a little harder

18:43

>> yeah and talks about this I'm like um uh

18:46

I think yeah this group is really weird

18:49

it's really interesting it's like

18:50

interesting for the same reason it's

18:51

weird right um uh yeah we we were

18:55

interested in you know again studying um

18:57

uh possible effects on of AI for R&D

19:00

speed up or or or automation. Um there

19:02

if any types of developers are not being

19:05

greatly sped up, it implies the whole

19:07

thing isn't isn't being sped up. So So

19:09

it is kind of curious to see even even

19:11

like particular weird populations. You

19:12

might imagine in like large, you know,

19:15

sort of production inference code bases

19:17

maybe have a bit more of this shape than

19:19

scrappy experiment scripts.

19:21

>> Yeah. Yeah. Yeah.

19:23

>> Um but yeah, it's totally

19:24

>> No, I think I think it's very

19:26

interesting. It's just it's hard to

19:27

generalize. We just don't know.

19:29

>> Yeah.

19:30

Yeah, we are doing this large study and

19:32

I think you know I think unfortunately

19:34

after the large study which includes

19:35

more green field projects I think it's

19:37

still going to be hard to

19:39

um for for not so similar reasons. Yeah.

19:42

>> Although I don't feel like your results

19:44

are particularly contradictory with any

19:46

actual independent research that's been

19:48

conducted. The only research that I've

19:49

seen that I would say is contradictory

19:51

to yours is research that has been

19:53

funded by model shops or agent shops.

19:56

>> Um

19:58

what can I say about that? I I do I do

20:00

think that most of the research that's

20:02

that's put out um is associated with uh

20:07

large tech companies um and I and I

20:12

think there are other methological

20:13

concerns that I studies as well.

20:15

>> I I have methodological concern with

20:17

that as well. I know people who work at

20:18

some of those places have methological

20:20

concerns with the work that was output.

20:22

>> I mean I you know I think I think there

20:24

there are concerns about also.

20:26

>> Sure. Sure. But I I actually feel like I

20:28

I remember somebody sent me your paper

20:30

and when I saw the headline I was like

20:32

no way.

20:33

>> Well, me too.

20:35

>> I was like that sounds like BS.

20:37

>> Yeah. Yeah.

20:37

>> I read the paper and I was like,

20:39

>> "Oh, this doesn't suck at all."

20:41

>> Like

20:42

a little bit.

20:43

>> Well, no. Like at at least your high

20:45

level conclusion both is intuitive like

20:49

from a person who's read a lot of

20:50

software engineering research and also

20:53

is well justified. I like I think people

20:55

I have had people argue with me about

20:56

the 16 developer thing, but I don't

20:58

think that actually matters in that

20:59

particular case because I think they're

21:00

actually a fairly good control set more

21:03

or less, right, for an experiment

21:05

because they they remove a lot of

21:08

validity concerns by being experts. So

21:12

yeah, they it's true that they don't

21:13

represent certain like like the broad

21:15

aspects of developers, but they also

21:17

remove a lot of variance and what you

21:18

would expect from the population and

21:20

they and they allow you to have like a

21:22

sort of an epistemological

21:24

function of like hey let's isolate that

21:27

factor away and then that's let's see

21:29

what happens with that and that's I like

21:32

that and then they thought the way the

21:33

study was conducted was completely

21:34

sufficient to draw a conclusion a high

21:36

level conclusion that it draw.

21:38

>> Thank you very much. Um here's a here's

21:40

a curiosity. So so we did we haven't

21:42

published this because of organizational

21:45

reasons that I won't go into but um we

21:48

did um we did conduct this um uh you

21:52

know people would throw sort of their

21:53

various explanations for for for what's

21:55

going on here you know many of which

21:57

have lots of merit some of which more

21:58

more skeptical of um you know a natural

22:00

one is brownfield versus greenfield

22:01

projects. Um so we ran this um kind of

22:04

enormous hackathon where we randomized

22:06

half of teams to um use AI versus not

22:10

kind of you know maximally green field

22:11

or something. Um and uh and then we'd

22:14

have a bunch of judges score them um you

22:16

know many judge scores per project or

22:19

something to try and even out that noise

22:20

and we'll see you know is it the case

22:22

that like the bottom uh 50% are all the

22:25

AI disloud group and the top um the top

22:27

are all the um AI allowed groups or

22:30

something like that. Now, unfortunately,

22:31

it was sort of even even smaller. That's

22:33

like part of the reason we're not

22:34

publishing this. I think the evidence is

22:35

is is really quite weak. The degree of

22:38

overlap is enormous. Like the the point

22:40

estimate that we um I'm a bit nervous

22:43

about saying this because, you know,

22:44

hasn't gone through the kind of review

22:46

processes that something like this goes

22:47

through. So, so um maybe I messed

22:49

something up, but um uh I think the

22:51

point estimate is something like four

22:53

percentage points higher on a on a

22:56

sorry, four percentile points higher um

22:59

if AI is allowed versus if it's not

23:02

after after controlling for everything

23:03

else. That is like you know extremely

23:06

noisy and you shouldn't draw any

23:07

conclusions but um but seemingly maybe

23:09

kind of um small effects

23:13

from AI. Um yeah. Yeah.

23:17

>> So one question I have I guess this is

23:19

related to the study and also related to

23:20

other research that you guys have done.

23:22

Um so have you found a similar pattern

23:24

or I guess first have you um explored

23:27

like the effect of AI in other domains

23:30

and specifically software engineering?

23:32

Um and if so have you also found this

23:34

kind of surprising result that maybe

23:38

speed up?

23:40

Um um no no no no I mean not new

23:43

directions ones um stuff that we have

23:45

not done um uh yeah I yeah you know

23:50

we're interested in understanding um uh

23:53

possibility of accelerating R&D um you

23:56

know coding is not the only kind of

23:57

thing that happens at major AI companies

24:00

much more conceptual work happens um uh

24:02

you know I'd be I'd be very excited

24:03

about um um you know working with math

24:07

PhD students or very different types of

24:09

software developers or um or you know

24:13

running running these kind of studies

24:14

inside of um major AI companies or or

24:17

large tech companies or or something

24:18

like that. I think um we are very

24:21

interested in you know not necessarily

24:23

directly but some somewhat close analogy

24:25

to um to the large air company case. So

24:29

to the extent that something really

24:30

deviates from that um probably less

24:32

interested.

24:34

>> Interesting. So yeah. So I guess it

24:35

sounds like uh you're interested in

24:37

measuring capabilities for like

24:40

math research um and uh some other

24:44

research.

24:45

>> Yeah, I I'd say I'm interested in like

24:47

what the hell is going on in AI and um

24:51

you know how am I going to learn the

24:52

most about what the hell is going on in

24:54

AI? Um, you know, I I think something

24:56

something a bit more conceptual, some

24:58

something where, you know, fewer humans

25:00

are currently working on it, so it's

25:02

less appearing in training data, um,

25:03

will help me better sort of triangulate

25:05

the truth about what's going on in AI,

25:07

um, even if I don't care about math

25:09

research in particular, um, it'll still

25:11

sort of draw helpful qualitative lessons

25:14

is kind of the sense I have.

25:15

>> Yeah. I mean, if I was going to pick the

25:17

areas that I think it's most successful

25:18

in or like the areas where I would

25:20

expect to be more successful, but where

25:21

I think it is being less successful, I

25:24

would pick probably data science

25:26

>> as an interesting one like how does data

25:28

science how a bunch of data scientists

25:30

help by AI today.

25:31

>> Say more about what you expected to be

25:33

less successful.

25:33

>> Um, so in a in a real so let me give you

25:37

an example.

25:37

>> Yeah, Google LinkedIn

25:39

>> and at LinkedIn there are 5,000 tables

25:41

with the name impressions in the in the

25:43

table, right? So if an analyst wants to

25:45

understand how many impressions happened

25:47

on a page, where the hell do they go?

25:48

Hum being can't figure that out.

25:50

>> Yeah.

25:50

>> Like today, there is no existing AI

25:52

system that we have that can be hooked

25:54

into like corporate environment like

25:55

that and process through I mean there's

25:58

trillions of rows in those tables. So

26:01

like how like like so what a data

26:03

scientist needs to do is they need to be

26:05

like I need to like you know analyze a

26:06

bunch of data and come to a conclusion,

26:08

right? Uh and I I hear lots of like

26:12

thoughts about building systems. You

26:14

know, people talk talked about ML to

26:15

SQL. The models are much better writing

26:17

SQL than they used to be. But I believe

26:19

that the state of underlying data is so

26:22

bad

26:24

that the the the actual data scientist

26:27

going to get way less value out of the

26:29

the AI than software engineers are

26:31

mixing for.

26:33

>> That is

26:34

>> interesting.

26:34

>> That's very curious. I um so one one

26:38

view that some some more bearish people

26:40

have looking looking at the future of AI

26:41

is is um you know so much there's so

26:43

much classic knowledge around there's so

26:45

much knowledge that's sort of um

26:46

embedded inside of companies that you're

26:48

you're not going to pick up from you

26:50

know these like RL training environment

26:51

startups or or something something

26:52

something you know maybe it it's not

26:55

sort of the state of nature that there

26:57

needs to be many specialized AIS the

26:59

like much of the lesson of the past few

27:01

years is that one big general AI seems

27:03

to seems to be more performance but you

27:04

know at some point in the future when

27:06

data is like locked up inside of

27:07

companies um uh you know we will have

27:09

more of this um uh proliferation of of

27:12

many more specialist models as I have

27:14

you know GPTN fine-tuned on on LinkedIn

27:17

data in particular something something

27:19

something I have one reaction that's

27:20

kind of like that

27:21

>> yeah I don't know

27:22

>> I do have a disbelief like reaction I'm

27:23

like ah science you know

27:27

>> but also like so but also like so

27:29

contradictory facts so one of these

27:32

problems is the all these data sets

27:34

contain contradictory prefax like the

27:37

name of the field will be uh like uh you

27:41

know date started or like it'll be time

27:44

started right and then it will contain

27:46

only a date except for it will only

27:47

contain the date up until like November

27:49

of last year and then after that it will

27:51

contain only the month but then after

27:52

that it'll contain maybe the the seconds

27:54

that the thing finished and in order to

27:56

actually successfully query the data set

27:59

you the data an you the data analyst or

28:00

the data scientist have to know what

28:01

those cuto off dates were not written

28:03

anywhere

28:05

Although what you could do theoretically

28:06

is import a bunch of the SQL that other

28:08

analysts have written to try to figure

28:09

out like how they triangulated these

28:11

things and work backwards from those

28:12

reports. But today though I think today

28:15

for example

28:16

>> people sorry I've just like I haven't

28:17

worked large company

28:19

people don't fix this source.

28:22

>> Oh no. So

28:23

>> I feel like the lesson I learned over

28:25

and over again at this data specs really

28:28

matter really matter. No, I I I've also

28:32

been working in data analysis and

28:33

research developer research

28:34

>> and yeah

28:35

>> and so yeah so the like the problem is

28:39

like their job is like produce this

28:41

report for this executive right not go

28:43

make infrastructure to produce this

28:44

report for this

28:46

>> but I'm like if I okay

28:50

>> I'm with you I live that dream every day

28:52

yeah you just have you end up having to

28:54

right is is you have to build out

28:56

infrastructure for it that has to be

28:58

part of the job description And and the

29:00

other part is you have to fix the

29:02

problem at the source. Like you really I

29:04

I I still remember having a conversation

29:07

where where someone said it's too

29:08

difficult to fix it at the source

29:10

because there's too much complexity of

29:11

all the systems that depend on the

29:13

source. And I said okay wait a minute

29:15

you're saying it's too complicated to

29:17

solve at the source downstream somehow a

29:20

problem that is too big for the entire

29:22

organization to solve.

29:23

>> It's easier to solve there. Come on.

29:25

Like that doesn't make any sense. I just

29:26

think there's so much potential here and

29:28

I have not seen a lot of studies done on

29:30

like how people who are working in that

29:33

data space are experiencing AI and

29:36

what's fascinating about that is real ML

29:38

is mostly data work like like ML

29:41

especially outside of LLM outside of

29:42

LLMs the majority of ML engineers spend

29:45

most of their time doing like feature

29:46

curation

29:48

>> rather than they spend actual direct

29:50

model training

29:50

>> and like trying to clean up bad data for

29:52

feature curation. So like theoretically

29:54

the potential even for the improvement

29:55

of ML by enabling ML to be a better data

29:58

scientist is huge and I I suspect that

30:01

if you my hypothesis is if you went into

30:05

this space you would discover it is

30:07

great at telling me how to write SQL or

30:10

how to like write pandas and or polars

30:12

or whatever you're using. It is okay at

30:15

doing very trivial things and it fails

30:17

at all complex tasks like fails

30:19

completely at all complex tasks. I don't

30:21

even know. I haven't even set a

30:22

benchmark on it.

30:24

>> Can you give me an example of a of a a

30:26

complex task?

30:27

>> Sure. Um

30:30

uh let's say a complex task is

30:34

determine the time between uh give me

30:37

the P90 of time between deployments for

30:39

all deployments that happened to Capital

30:40

One.

30:42

>> It struggled at that.

30:44

>> Yeah, that that it doesn't seem

30:45

surprising to me.

30:46

>> That seems surprising, right?

30:47

>> Yeah. Uh so uh

30:49

>> and like you know if it has sort of

30:50

reasonable context about where it would

30:52

find this

30:52

>> kind of data right sure makes sense and

30:55

uh and and then so okay so fine so so

30:58

give me that number and then also I make

31:00

sure that you can break that down you

31:01

know by team hierarchy so you give me

31:04

that in a table so I can break it down

31:06

by team hierarchy

31:07

>> uh where is the team hierarchy data

31:10

like uh how oh here's a funny thing uh

31:14

what PRs were in those so how do I know

31:17

how How would I how do I actually

31:18

determine what the time deployment

31:21

started and ended was? Because it turns

31:22

out that's not clear in the base

31:23

telemetry. And you have to like know

31:25

magic to figure out when the when the

31:27

deployments started and ended. Um uh oh,

31:30

and also tell me, you know, for my

31:32

ability to analyze it, tell me how many

31:33

PRs were in each of those deployments

31:35

and which PRs went to each of those

31:36

deployments. Well, guess what? The

31:38

deployment system only, this isn't being

31:39

recorded, right?

31:41

>> I think it is being recorded.

31:42

>> Okay.

31:42

>> Yes. But before you

31:45

>> um so then you know imagine the public

31:48

system doesn't contain sufficient

31:49

information about that data right uh

31:53

then like

31:56

like where do I get that data well that

31:57

data it doesn't exist in any other

31:59

system so what I well maybe I have to go

32:00

like I have to go to GitHub and I have

32:03

to call the GitHub API and like the

32:05

chance of the LM or any agent figuring

32:07

that out today is pretty minimal.

32:10

Hm.

32:13

Yeah, I do still, you know, relative to

32:15

my colleagues, I'm I'm I'm pretty

32:16

bearish on AI progress. I I I do still

32:19

have some reaction that's like, ah, like

32:21

can't you spend a day getting this into

32:23

a cursor rules file,

32:25

you know, like where where the where the

32:28

hierarchy exists.

32:29

>> I I would I would go I think that's why

32:31

I think it's interesting. I think it

32:32

would be worth studying. I don't I have

32:33

not seen any real comprehensive study on

32:36

the experience that data scientists

32:38

have. Um um if you if you have any ins

32:41

to um uh to to us running studies at

32:44

large tech companies then I I am all in.

32:46

>> The only there there is a fellow at open

32:48

eye that I was talking to who's one of

32:49

the speakers who does evals internal

32:52

evals and he has mentioned that he's

32:54

done some work with data scientists. So

32:56

he might know some people who have that

32:59

data but it's it's all been internal

33:01

between him and like ser between him and

33:03

like you know entropic or whatever

33:05

right. Um

33:08

and yeah that and I also think uh I one

33:12

of the ones I'm curious about too is

33:13

lawyers

33:15

curious about like more traditional like

33:17

older like um lawyers doctors and I

33:20

think mathematicians are all really

33:21

interesting to me

33:24

>> just because you know both lawyers and

33:26

doctors are so constrained by a legacy

33:29

history of like the constraints around

33:31

them and how they work

33:34

>> and yeah legal legal issues I'm

33:36

imagining continue to be a significant

33:38

bar.

33:38

>> Yeah. And they're stood like I I I'm

33:42

also interested in like what's the how

33:44

are the

33:45

>> stodgginess I feel like is is a um I I

33:48

think I'm less bought into as a

33:49

long-term explanation for economic I

33:52

like the the legal restrictions they

33:54

sort of continue to be the case through

33:55

time. The stodgginess I can like set up

33:56

a new law firm that's less stodgy and

33:58

then take the previous law firm or seems

34:01

>> I agree. I I don't think it's

34:04

persistent. I just think it's it's

34:06

interesting to see one thing that would

34:08

be interesting to see is like if that

34:10

affects the mental model that they have

34:11

today like like if if they're like how

34:14

they've been talked to about it or how

34:16

their trust in it affects how they use

34:17

it.

34:18

>> It would be interesting to know to me

34:20

like I don't know it's a worthwhile

34:22

study. It's more like one of those

34:24

things that I wonder about idally. You

34:25

take a lawyer who just got out of

34:27

college and sort of, you know, has spent

34:28

a lot more time using CHTPT. And you

34:30

take a lawyer who's been in the business

34:32

for 50 years and, you know, has has a a

34:35

giant folder full of word docs that

34:38

contain like all the briefs that all

34:39

their, you know, junior associates have

34:41

written for decades and decades. And he

34:43

just opens up those briefs and like

34:44

changes a few words in them and then

34:46

sends them out to the judge. And he

34:47

like, you know, has known those judges

34:49

for like 30 years, 40 years. He knows

34:52

exactly what they want. that like you

34:54

know is he getting any is he going to

34:55

get any value but is there a value he

34:57

should get is there something that like

35:00

is there some way that like he would be

35:02

helped by AI I certainly know discovery

35:06

discovery in AI is like in in law is

35:08

like a huge huge problem and I I know

35:11

that like there's Harvey I don't know

35:12

anything about what success they've had

35:17

a lot of people working in that space

35:19

specifically like that's it's an ongoing

35:22

thing right there. There's always

35:23

technology for it, but it's kind of

35:27

the adoption of it is a very different

35:29

thing from

35:30

>> that's that's that that's the thing,

35:31

right? Because I one of the first things

35:33

that I thought of because I I have a

35:34

little bit of a legal background and one

35:36

of the first things that I thought of

35:37

the first time like when Chic 3 came

35:39

out, I was like, "Oh, this could totally

35:41

change discovery."

35:44

Like this could because discovery is

35:45

like the most painful and most difficult

35:47

and most expensive. You can have serious

35:49

social consequences by making discovery

35:52

less expensive. Like that is the

35:54

expensive part having a lawsuit.

35:57

And so like you could actually have

36:00

significant impact on a society if you

36:02

could make discovery cheap and

36:03

instantaneous and reliable.

36:08

>> Yeah.

36:09

>> I have a question.

36:12

>> Yeah.

36:24

Sorry. Scatter plot, right?

36:27

>> Um,

36:29

>> first in 50 hours.

36:32

Oh,

36:32

>> I see. Yeah.

36:35

Yep.

36:39

Uh,

36:40

>> I say it's this one.

36:42

Yes, that one.

36:45

So, you're saying that people the

36:47

develop there was no difference cursor.

36:50

We're talking about the ID that VI

36:52

coding and they use it for 50 hours.

36:55

Well, I was very intrigued by that

36:58

because everyone talks about VI coding

37:00

and how cursor is instrumental.

37:04

Why did you get to how did you get to 50

37:07

hours? Just curious.

37:08

>> Um, so so this is including time arrive

37:11

at 50 hours is

37:13

>> this is including uh time in the

37:14

experiments um that developers have

37:16

spent in the experiments plus their plus

37:18

their past experience. So for um for for

37:22

some developers working on some issues

37:24

as part of the experiment some of them

37:25

have gotten to more than 50 hours of um

37:28

cursor experience um uh and that's

37:31

that's just coded up in that in that

37:33

bucket at the end.

37:34

>> And it was was it the same task for each

37:36

group? Uh, no. These are kind of they're

37:38

natural tasks that pop up on the GitHub

37:41

repositories, which which as mentioned

37:43

are kind of um I don't want to uh I'm a

37:46

little bit nervous about saying they're

37:48

weird because it implies they're um uh I

37:50

want to say it's very interesting and

37:52

it's very weird and and it's interesting

37:53

for the same reasons it's weird. the

37:55

these are um these are repositories in

37:58

which they have these these are projects

37:59

in which they have an enormous amount of

38:01

mental context to build up um that the

38:03

the AIS might not have um that they've

38:05

worked on for for many many years that

38:07

they can um I'm not sure this is always

38:09

the case but you know I imagine it in my

38:11

head that they basically know how to

38:13

execute on the particular task they have

38:15

before um uh before they even you know

38:18

go about attempting it because they're

38:21

so expert in the in the project

38:24

the positive speed of is it like like 5%

38:28

like what do you what's how do you

38:29

quantify the positive speed of

38:31

>> um so uh you might think about uh let's

38:36

let's go to this one instead. So um on

38:39

the here left hand side we have the um

38:44

averages for what the developers say um

38:46

will happen in terms of their time to

38:47

complete if their issue or their task

38:50

gets assigned to the AI disallowed or

38:52

the AI allowed group. Um you know they

38:55

they think that if AI is disallowed it

38:57

will take them a bit more time closer to

38:59

two hours and I guess more like an hour

39:00

and a half or a little bit less if AI is

39:03

allowed. Um but then you know we we

39:06

randomize this particular task to allow

39:08

AI or not allow AI and it turns out you

39:10

know if we randomize to AI allowed then

39:12

that the times are more like a bit above

39:14

2 hours rather than a bit below 2 hours.

39:16

Um and then you can think of the uh

39:19

change in time estimate as sort of being

39:21

one divided by the other here. It's not

39:23

quite that for reasons reasons I can go

39:24

into but it's you know it's effectively

39:26

um what exactly is the transformation?

39:29

You know what it's something like AI

39:31

disallowed over the AI allowed minus

39:33

one.

39:35

So, uh, to draw that out, I'm like, um,

39:39

you know, I might be like, what's what's

39:41

the speed up? You know, is it like, uh,

39:44

1.1x

39:45

that, but you know, these these

39:46

developers are going 1.1 times faster

39:49

when we're actually on a time to

39:50

complete scale, not a not a speed scale,

39:52

but ignoring ignoring that um ignoring

39:55

that detail. Um, you know, is it 1.5x?

39:58

Um, is it 0.5x? So, they're actually

40:00

going sort of twice as slow. um how how

40:02

would we get that information? Well,

40:04

we'd do something like um take the AI

40:08

disallowed times divided by the allowed

40:12

AI times. You know, if this was uh 1.1,

40:16

let's say, times as long as the allowed

40:19

times, then we'd get to uh 1.1 x speed

40:22

up. It's something something like that

40:24

that's going on.

40:29

And in fact, you know, we find that we

40:30

find that slow down.

40:34

>> I I just read a fascinating article as

40:37

company I remember, but basically

40:40

journalist

40:41

was allowed to

40:44

40:46

using five coding, right? uh do a pull

40:49

request, meaning there was some feature

40:53

and AI was used to assist with building

40:58

out the requirements and she practically

41:01

according to the article just kind of

41:03

did a little couple of tweaks and then

41:06

just signed off on it.

41:08

>> It was just really fascinating. That was

41:10

the whole live coding thing.

41:12

>> Yeah, I coding like that was the whole

41:14

thing. It was like you didn't have any

41:17

software development background. That

41:19

was all I was just curious.

41:22

You've

41:24

tried to do a study on that.

41:27

>> So I so I definitely do I definitely do

41:29

the share but you know if you've got

41:30

like no idea what's going on then um

41:33

probably probably these are going to be

41:35

some some significant um some

41:37

significant speed up. you know I I I

41:38

will say I guess number one it's not um

41:42

you know it's not a priority obvious um

41:45

you know in fact we went out and did

41:46

this hackathon with you know very

41:48

experienced people and much less

41:50

experienced people and and tried to see

41:51

what happened and what we found is you

41:53

know the scores the judge scores

41:55

extremely noisy and I think you

41:56

shouldn't believe it but um you know the

41:58

the judge scores were not that much

42:00

higher when the AI was allowed versus

42:02

versus when it was not that the people

42:04

aren't actually making that much more

42:05

progress and then and then another thing

42:07

to is I I think there's going to be more

42:09

expertise in this in this room than than

42:11

I have. My understanding from you know

42:15

sitting with these open source

42:16

developers for a while and not not being

42:17

a very capable developer myself um is um

42:22

is that the the quality bar on the

42:24

repositories in in this study is just

42:26

very high typically. Um and so I would

42:30

be very surprised if a journalist um you

42:34

know even frankly if like a good

42:36

software engineer without lots of

42:37

experience on the repository but but

42:39

certainly you know someone who wasn't a

42:40

software engineer was able to get up a

42:42

clean PR on these repositories

42:46

first time in fact I think that's a lot

42:49

of the story for what's going on here is

42:51

that the AIS you know they actually kind

42:53

of do make progress in the right

42:54

direction some some good fraction of the

42:56

time but um for, you know, for various

42:59

reasons. Sometimes for reasons of

43:01

correctness, but sometimes for reasons

43:02

of like, you know, how they've tried to

43:05

solve the problem and, you know, whether

43:06

that's the typical way of solving the

43:07

problem or like how various parts of the

43:09

project speak to one another. These

43:10

these kind of considerations, you know,

43:12

they they haven't properly accounted for

43:13

that. And so, you know, the humans not

43:15

only need to spend expensive time

43:17

verifying, but also like clean up clean

43:19

up all the stuff. My sense is that

43:21

someone who didn't have all that

43:22

experience like basically wouldn't know

43:24

how to do that step. Um, and so wouldn't

43:27

be able to submit a clean PR to these

43:28

repositories. You know, that's it. Like

43:30

I relative to these people at least, I

43:32

suck at software development and I I'm

43:35

getting up, you know, PRs internally all

43:36

the time and I think they're I think

43:38

they're worse quality and um, you know,

43:41

and they're and they're getting over

43:42

time. They're getting better over time.

43:43

You know, I do believe that people are

43:45

coding when they when they wouldn't be

43:46

able to code. they are submitting, you

43:48

know, PRs at a lower quality standard

43:50

when they wouldn't be able to do that at

43:51

all. Um, but getting getting up these

43:53

expert level PRs, I I do feel kind of

43:55

skeptical

43:56

>> and and that's actually part of what I

43:58

was getting at is they often get PRs

44:00

often get rejected by more novice uh

44:03

folks on these big on these bigger

44:05

quality projects for no other reason

44:07

other than the developer ergonomics

44:09

impact of the PR, right? So the fact

44:12

that it makes it harder for me to future

44:14

maintain because because for an open

44:16

source project almost all the incentive

44:17

is biased towards making it easier for

44:19

me to maintain the project right so

44:21

every time a PR comes in if it doesn't

44:23

make it easier for me to maintain the

44:25

project I have a tendency to reject it

44:28

right uh if it does make it easier to

44:30

maintain the project then yay I'm into

44:32

it as a that is unlike what you have in

44:35

a typical business context right where

44:37

most important thing actually is to get

44:39

something done

44:40

>> right uh because you're you know the

44:42

fact that that someone's going to have

44:43

to spend a lot of time maintaining it's

44:44

almost job security right but for open

44:46

source it's the opposite it's actually

44:48

what causes people to leave projects is

44:50

when it's difficult to maintain right so

44:52

it is a different bias on what you

44:54

accept for pull requests

44:56

>> can you remind me the name of the name

44:59

of the English gentleman who maintains

45:00

the school compiler

45:03

>> um

45:03

>> Simon

45:05

no I I can't remember what name recall

45:08

>> so here's here's one story that that

45:09

might be relevant

45:11

um you know bunch of repositories in the

45:13

study they all have you know broadly

45:15

these characteristics one of them is the

45:16

HA hasll compiler famously on the HA

45:18

hasll compiler um there's like some

45:20

chance I don't know if it's 50% or 30%

45:22

or what but there's some chance that if

45:24

you submit a PR the

45:27

>> I'm being recorded the

45:29

>> Simon Simon

45:30

>> Marlo maybe

45:31

>> I'm not sure the creator of the HA hasll

45:34

compiler will come into the comments and

45:36

argue with you for many many hours much

45:38

longer than you spent working on the

45:40

pull request um until um until the PR

45:43

hits exactly your specifications. Um

45:45

combine that fact with the remarkable

45:48

fact I think that the median PR in the

45:51

study the time they spend working on the

45:53

code post review is zero minutes. That

45:56

is the the median PR is like perfect

45:59

first time around because the

46:01

professional incentives of these

46:02

developers are are like that. Now

46:04

there's a very long tale on one of them.

46:06

Um on one of them I think literally

46:08

Simon this gentleman pops up and I'll

46:10

use the comments for many hours and that

46:12

that that one's a lot longer. Um but um

46:15

uh yeah they are they are maintaining

46:16

this extremely high bar.

46:21

>> I'm interested in your other upcoming

46:22

stuff that you had in your talk.

46:24

>> Yeah let's do it. Um

46:28

so um yeah so so you know so one thing I

46:33

what to say um I guess let's let's go in

46:36

order as I as I think you mentioned um

46:39

you know if if uh capabilities as

46:42

measured by time horizon keep keep

46:43

doubling it does seem very very

46:45

challenging to keep up with that um in

46:47

the short term we have a number of

46:48

directions for um for getting on top of

46:51

that but uh and I think that will last

46:53

like through the year but through two

46:54

years you that seems challenging. Um I

46:58

think still possible through 3 years I

47:01

think still seems possible. You know it

47:03

start starts to get harder and harder.

47:05

Um anyway in the short term building

47:07

these building these much longer tasks

47:08

and ways in which we might get around

47:10

the problem entirely. For instance um

47:13

here's one thing that might be somewhat

47:15

>> you could also raise the accuracy bar.

47:17

uh you could raise the accuracy bar

47:19

although um you know we're the reason

47:22

we're interested in this in the first

47:23

place is we're like um you know is GT5

47:28

existentially dangerous okay and the

47:30

answer is no I think

47:31

>> um but like what but like why why do we

47:33

think the answer is no okay at least I

47:35

think there are multiple reasons but at

47:36

least we can say you know GT5 is just

47:38

like not that good at stuff like you're

47:39

trying to get it to do you know data

47:41

science on these on these um columns

47:44

that have very similar names and it's

47:45

like not clear exactly the logic that's

47:46

led to the columns. It's not it's not

47:48

doing that sort of thing. I'm like,

47:49

wouldn't you need to do that sort of

47:50

thing to be not exactly that, but you

47:52

know, be capable of stuff to be

47:54

extensionally dangerous and it's not

47:55

capable of stuff. Um, so, you know, I

47:58

see that the the AI are failing at these

48:00

these hard tasks, I'm like, you know,

48:01

great, whatever. But if, you know, if

48:03

it's succeeding 90% of the time and not

48:05

99% of the time on like very challenging

48:08

long tasks, then, you know, I think

48:10

there are there are, you know, other

48:12

things that we can retreat to and I'm

48:14

prepared to retreat to those things. But

48:15

at least the naive the reason it's not

48:18

extentially dangerous is that it's not

48:19

capable of stuff I think at least starts

48:21

to get pushed on as as as an

48:23

explanation. So here's another

48:25

possibility for something we could do.

48:31

So um what if instead of time horizon

48:35

here we had uh time horizon uh whilst

48:41

uh being

48:44

watched

48:46

very

48:48

closely

48:51

and then time. Okay. And then you know

48:53

maybe the the time horizon whilst not

48:55

being watched very closely is like this

48:57

but the time horizon whilst being

48:58

watched very closely is like this. That

49:00

is like if um uh uh you know I I I hope

49:05

that major AI companies have um

49:08

monitoring setups uh going where you

49:10

know for example AI are looking over

49:12

transcripts of of AI performing work to

49:15

see if anything suspicious is going on.

49:17

And you know it's not it won't be

49:19

impossible perhaps I conjecture to to

49:22

fool those monitors but you know be

49:24

challenging to to to fool those

49:26

monitors. So you know maybe you can do

49:29

you know 40hour tasks without 40hour you

49:32

know dangerous capability relevant tasks

49:34

um without any monitoring but only like

49:36

2-hour long tasks if you're if you're

49:38

being watched very very closely

49:40

>> being watched like if there's another

49:41

model evaluating the output. This is the

49:43

kind of thing. Maybe you escalate to

49:45

humans and

49:45

>> with them without safety controls. You

49:47

mean

49:47

>> Yeah. Yeah. Yeah. Roughly. Yeah. Um and

49:49

and then you know now now we've bought a

49:51

load more doublings. We can we can sort

49:54

of keep keep doing the capability

49:55

extrapolation thing. That's one of the

49:57

kinds of things I'm about in addition to

49:58

creating longer tasks.

49:59

>> Yeah. I mean I imagine some of the model

50:01

shops do have like evaluations of

50:03

capability with and without safety

50:04

because I'm sure that they're like

50:06

there's an argument between their

50:07

researchers and their safety teams.

50:09

>> Um yeah. Yeah. Yeah. Um yeah. Um

50:14

>> seem like I have seen something about

50:16

this but not a lot.

50:18

>> Yeah. Yeah. Um yep. Um

50:22

um yeah I I guess I think that

50:26

um this might be sort of like an

50:29

especially quantitatively important

50:30

consideration or um I I expect that it

50:34

will reduce the effective time horizon

50:36

by uh by like maybe an order of

50:39

magnitude or two. Um yeah, I I agree

50:42

that there's a there are some important

50:43

senses in which there's not really a

50:44

difference difference difference in

50:45

kind.

50:45

>> Yeah, of course. Then I would also worry

50:47

that like publishing that encourages

50:48

people to like focus less on safety or

50:50

to like try to argue against safety

50:52

because how it impacts capability.

50:54

>> Yeah, I think there are lots of

50:56

landmines

50:58

in in um in all sorts of safety work,

51:01

not just not just the

51:02

>> Oh, of course.

51:04

>> Um okay, next thing. um you know we have

51:08

this we have this trend I spoke about

51:10

this at the beginning but you know we

51:12

have this trend is it going to continue

51:13

forever is this is this a fact of the

51:15

universe or does it you know somehow

51:17

depend on inputs or what you think about

51:19

um intelligence explosions or or

51:21

something like that um trying trying to

51:24

think about that where's this line uh

51:26

actually going is um is a is a pretty

51:29

active area of work also you know the

51:31

ways in which um this line or or the the

51:35

particular points don't quite correspond

51:36

to the thing I care about. So one

51:38

obvious way is that um you know these

51:41

these models are being judged according

51:42

to um

51:45

you know I think I think the um

51:48

algorithmic scoring that we use on on

51:50

mis tasks is is um is importantly sort

51:53

of more robust or more covering the

51:55

relevant concerns than might be the case

51:57

in just sort of sweet benches and unit

51:58

tests but but it still sort of it still

52:00

has a lot of the same character. Um

52:03

there are um you know considerations

52:05

like being able to build on this work in

52:07

future outside of the immediate problem

52:09

um uh uh facing you that that aren't

52:12

being captured by by meter scoring. And

52:14

maybe if you did capture that, you know,

52:15

you'd get something a little bit like

52:16

going from 50% success to 80% success.

52:19

You know, you can do hourlong tasks if

52:22

it doesn't matter whether you can build

52:23

on the work, but you know, only 30

52:25

minutes asks if it does matter whether

52:26

you can build on the work. bringing

52:28

bringing these numbers again to to

52:29

something I care about a little bit more

52:30

and then yeah projecting out both if

52:32

there are compute slowdowns um if if we

52:35

are going to enter some regime where um

52:39

uh AIS are building AIS and that leads

52:41

to some sort of steeping of the curve

52:43

these these kind of considerations

52:46

that's another thing I'm thinking about

52:48

52:52

and then capabilities measurement from

52:53

new angles so here's um you know here's

52:56

here's one history of meter that I think

52:58

is not the accepted history and um also

53:01

probably um not a very accurate history

53:04

certainly not the most accurate history

53:05

but but here's one possible telling

53:08

um you know near the beginning meter has

53:10

early access to when I wasn't there and

53:13

I have sort of no internal knowledge of

53:16

this when meter has early access to GT4

53:19

um and there are just sort of Q&A data

53:21

sets going on everywhere like Elsat data

53:22

sets or you're like you know can GT4

53:26

seems so smart relative to stuff that

53:28

that went before. Can it do stuff? You

53:30

know, so you like you tried out some

53:32

tasks. Can it can it do stuff? And the

53:33

answer is, you know, it can do some

53:34

stuff and it can't do other stuff. Um

53:37

and um and people like, "Oh, that's

53:39

cool. You know, you've tried this,

53:40

you've tried this um neat new kind of

53:42

thing, getting models to do stuff

53:43

instead of instead of answering

53:44

questions." And then and then later

53:45

you're like, "Well, different models,

53:47

you know, they come out over time. You

53:49

know, this model comes out in January,

53:50

this model comes out in February. Can

53:51

they do different kinds of stuff?" If we

53:53

test them on the same If we test them on

53:55

the same stuff, then we'll try and think

53:56

of kind of the most obvious in some ways

53:58

summary statistic of whether they can do

54:00

stuff. This like single single um data

54:03

point or number that reflects whether

54:04

they can do stuff, the time horizon,

54:05

plot it over time and see what happens.

54:06

You're like, oh that's kind of

54:08

interesting. And then you're like, well,

54:09

what's the next sort of in some sense

54:12

kind of dumbest or like most obvious

54:13

thing you can do? Well, we'll run kind

54:15

of the most obvious RCT design. We'll

54:17

like allow AI or not allow AI and then

54:19

we'll see we'll see what happens and

54:20

we'll try and you know it'll be it'll be

54:22

messy. There's lots of um there are lots

54:25

of methodological problems that that

54:27

people point out as there are with this

54:29

work but there are different kinds of

54:30

problems you know they have different

54:31

pros and different cons and maybe with

54:32

these sort of two different things they

54:34

give two different answers and have two

54:35

different sets of pros and cons. we can

54:37

kind of triangulate the truth from that.

54:39

And then now I'm like, well, can can we

54:42

pull that rabbit out of the hat one more

54:43

one more time? Are there or multiple

54:46

more times? Are there other sources of

54:47

evidence that have, you know, different

54:49

pros and cons that I that I won't

54:51

believe in fully, but they're different

54:52

pros and cons and they might give

54:53

different answers and so on and so

54:55

forth. Um, here are two suggestions, the

54:58

things I'm curious about at the moment.

55:00

The first is um in the wild transcripts.

55:04

So you know agents in cursor in claw

55:07

code and in whatever other other um

55:10

products or services um they leave

55:12

behind traces um traces of their diffs

55:15

that they've um contributed to code or

55:18

or disc of their actions and their

55:20

racing chains and and so on and so

55:22

forth. Um the traces that they leave in

55:25

the wild are you know importantly

55:27

different from this where it's more kind

55:28

of contained and you know the task is

55:31

sort of neatly packaged and stuff. This

55:32

is going to be, you know, like the like

55:34

the example with the many different

55:35

columns that are very confusing. This is

55:37

going to be like whatever real crap

55:38

shows up in the wild, how to model sound

55:41

and that. Um, there are important

55:43

reasons why you shouldn't believe that

55:45

kind of information. It's it's like not

55:47

very experimental. It's like hard to

55:48

know exactly what to make of it, but it

55:50

does have these important pros that it's

55:51

like it's more real. It's, you know, the

55:53

data is enormous. Perhaps um the data on

55:55

transcripts is enormous. Um, you know,

55:57

perhaps there's a lot you can learn

55:58

there. That's that's one thing. And then

56:00

and then here's another one. There's

56:02

this um there's this group which you

56:04

guys should check out called um agent

56:06

village

56:08

AI village sorry um where they um they

56:12

have um a lot of different models or or

56:17

agents kind of living in this village

56:18

occasionally talking to humans trying to

56:20

accomplish um fuzzy goals that are that

56:23

are set to them basically using

56:25

computes. They try and do stuff like,

56:28

you know, organize this event at the

56:29

park or um uh run a few human subjects

56:32

experiment or run this merch store, you

56:34

know, stuff stuff like that that's not

56:36

so clearly specified.

56:38

And basically all the time they find

56:40

that the models fall on their faces and

56:42

suck. Um and there are lots of reasons

56:46

not to believe this evidence.

56:48

You know, here are some of the reasons.

56:49

Number one, um, it is using computer use

56:52

and I think computer use is just way

56:53

worse than CLI based computer use

56:56

capabilities are considerably worse than

56:58

CLI based stuff at the moment or text

56:59

based things in general at the moment

57:01

and maybe you care more about text based

57:03

things because that's more relevant to

57:05

various types of things you care about

57:07

and also lots of GUI gooey based things

57:09

um can be converted into text things.

57:11

Um, it's um, you know that there's all

57:14

these different models hanging around in

57:15

the village. I'm like, why why are there

57:16

so many models? Like, why is there a

57:18

village instead of just like some big

57:19

Asian orchestration setup? I don't I

57:21

don't really understand what's going on

57:22

there. And um anyway, lots of reasons

57:25

not to believe it. But on the other

57:27

hand, it is models doing stuff in the

57:29

world. It's not benchmark style tasks.

57:31

It's like trying to accomplish some goal

57:33

and they can't accomplish even sort of,

57:35

you know, very basic subsets of the

57:37

goal. And I feel like that's extremely

57:39

interesting. And I I wonder if you could

57:40

get rid of some of the most obvious

57:42

cons. you know, make this only text

57:43

based, give them some um uh relevant

57:47

textbased tools, work a bunch on the

57:49

elicitation to to make these models sort

57:51

of more performance, get rid of the less

57:52

performant models in in the village, so

57:54

on and so forth. But then try and get

57:56

them to do these fuzzy goals. Um and you

57:59

know just observe like where did they

58:01

mess up like you know they they they

58:03

they went about step one it went great

58:06

but then they sort of they became

58:07

incoherent or they you know went into a

58:10

strange psychological basin with one of

58:12

the other models or you know they they

58:14

weren't able to interact with external

58:15

services in appropriate way or or figure

58:17

out their resource use. You know I'd be

58:18

very interested just kind of

58:19

qualitatively in what's in what goes on

58:22

when you do that. Again, keeping in mind

58:24

that we're interested in um the ability

58:27

of um at least at the moment, I'm most

58:30

interested in the ability of AIS to

58:32

automate R&D and you know, or speaking

58:34

to why that's not the case at the moment

58:35

and why that might not be the case in

58:36

the near future. Some something shaped

58:39

like this seems like it might be might

58:41

be kind of might curiously point to to

58:43

why that's not the case. Not sure

58:45

exactly what's there, but yeah. And my

58:47

observation is that they

58:50

they are effectively neurode divergent

58:53

individuals, right? And none of our

58:55

world was not built for that. There's

58:59

everything that we have they're defined

59:01

for a human to do. They're shaped and

59:03

size to humans. Just like you know the

59:05

military, like you know how big are

59:07

packs? Well, it's based on how much they

59:08

think a person can reasonably carry,

59:10

right? And how much we expect someone to

59:12

handle for their taxes, that's based on

59:13

what we think a human can do. And

59:15

>> wow. and and they're and if you think

59:18

about neurode diverent individuals they

59:20

struggle with challenges with the way

59:22

the world's expectations don't align

59:24

with them and compared to a neurode

59:26

diverent individual these you know these

59:28

intelligence are really really different

59:30

right and so all of the rough edges

59:32

where they don't align with our world

59:35

that's why they needed assistant human

59:37

assistant in order to accomplish

59:39

anything real in our world is just too

59:41

hard uh for them

59:42

>> currently currently

59:44

>> they are I think Yeah. Yeah. Someday

59:46

change, but now they're just hopeless,

59:49

right?

59:50

>> I have to get really, really good. Our

59:52

world will have to change. One of those

59:53

two things,

59:54

>> you know. I I agree. I like so strongly

59:56

share this sense. But, you know, but if

59:58

you ask me to really pin down like why

60:00

why exactly is that the case again when

60:01

they're like, you know, beating all GPQ

60:06

GPQA experts on these extremely hard

60:07

science questions and they're, you know,

60:09

blah blah blah. Like, exactly what how

60:11

why are they not able to accomplish

60:12

things in the world? You ever met a

60:14

neurody divergent individual who wasn't

60:15

terribly good at something was

60:17

completely useless at getting through

60:19

life?

60:19

>> Yeah. Yeah. They all very good at

60:20

reading books.

60:23

>> There's a lot of those people in the

60:25

world. I

60:26

>> It's not that surprising.

60:28

>> Although my my only feeling about AI

60:29

village is it's like, well, today is the

60:31

200th day my car didn't rocket off the

60:34

Earth and escape velocity and fly to the

60:36

moon.

60:37

>> Like that's because you didn't build a

60:39

rocket yet.

60:42

>> Yeah. I think there was a lot of talk a

60:44

year ago about, you know, maybe I'm

60:48

mischaracterizing, but I thought there

60:49

was a lot of talk a year ago about

60:50

comput capabilities being impressive

60:52

today.

60:53

>> There was there was a lot of talk about

60:54

it and yet I have talked to almost

60:56

nobody who has used them for any

60:58

practical

61:00

>> totally totally um yeah but if we if we

61:04

move this to text only and it seems

61:06

reasonable to complete text only um uh

61:09

you know would you still have the rocket

61:10

concern? No, I would I would

61:13

>> Well, it depends on what the task was.

61:15

>> Sure.

61:15

>> Yeah. Yeah. The kind of thing that you

61:18

could that a human could do over CLI.

61:22

So I think this um this relates to the

61:24

uh the topic talk that earlier today

61:27

where they talked about how um you know

61:29

one way to

61:32

use uh effectively is to give them if

61:36

you have a task like figure out a way to

61:38

present the task or transform the task

61:40

to something that is indistribute you

61:42

know for the model and I feel like this

61:45

conversation kind of you know ties in on

61:46

that like um you know interacting with

61:49

with Chrome is less in distribution than

61:51

a CLI. So I I think that could be an

61:54

interesting area of research is like you

61:56

know okay if you're interested in

61:58

exploring like how well can it perform

62:00

these really open-ended tasks like first

62:02

I guess creating harnesses and creating

62:05

an interface that is much more

62:07

indistribution for them so that way

62:09

that's you know less of a concern.

62:13

Yeah, I mean I I think also speaks to

62:14

the point about quote unquote neurody

62:16

divergence of models. Um, you know,

62:18

there's some uh it's not so different

62:20

from management skill or something

62:22

giving, you know, giving appropriately

62:23

scoped tasks to to your to your very

62:26

talented interns or very talented

62:28

neurodyiverent interns or something

62:29

something like that. I do I do think

62:30

that's right from the sorry to be a you

62:33

know um uh sorry sorry to be so

62:36

repetitive from the perspective of

62:39

capability explosions um and automating

62:42

R&D you know I think maybe the models

62:45

will get extremely good at um scoping

62:47

tasks for themselves such that it's

62:48

benchmark style or or something like

62:50

that but you know if they can't do that

62:52

I'm like well there's a lot of things

62:54

that aren't that don't look like

62:55

benchmarks that crop up in the real

62:56

world and you do need to be able to kind

62:58

of flexibly work with that if you're to

63:01

um do something as complicated as

63:02

automate a major AI company. Um um and

63:07

you know so so I do I do think it's um

63:10

yeah I I think it can both be the case

63:12

that the AIS are incredibly performance

63:15

um on some particular type of problem or

63:17

if you make other types of problems more

63:19

similar in scope or shape to to the type

63:21

of problem that they're best at and and

63:23

also that they you know can't flexibly

63:24

substitute for human workers because

63:26

that requires you know um yourself

63:29

setting up the problem in in a way

63:31

that's appropriate or or not having

63:32

those constraints yourself.

63:34

Yeah, it is interesting though just just

63:36

to your point about new capabilities is

63:37

thinking of almost another axis on the

63:40

graph that you have

63:41

>> because I think there's not just I

63:44

wonder if there's not just a time

63:45

horizon issue but there's a a task

63:48

category or type of work category like

63:50

like as your example of computer like

63:52

computer use is one of those examples

63:53

right like if we think about the

63:55

capability of computer use versus or

63:57

capability that would require computer

63:58

use versus a capability that could

64:01

become can be accomplished entirely in

64:03

text. Yeah. So, yeah, sure. But but but

64:07

like a lot of these are like like almost

64:09

all these benchmarks are basically text.

64:11

>> Um yes. Yes. Yes. And indeed, you know,

64:13

the ones the ones that aren't the ones

64:15

that require sort of um vision

64:17

capabilities are are notably lacking.

64:20

Yeah. I I I um I'm not sure exactly what

64:22

to what to make of this graph. I think

64:24

one thing I make is that you one thing I

64:27

make of it is that um uh you know there

64:31

probably is maybe not so much variation

64:34

in in sort of slope or doubling time

64:36

across task distributions. I think

64:38

there's only weak evidence for that.

64:39

But, you know, in in insects or, you

64:41

know, the base of where we are now, um,

64:43

yeah, there's there's possibly a great

64:45

deal of variety, especially on this sort

64:47

of um, uh, image- like capabilities

64:50

versus not to mention, but but physical

64:52

abilities even more, you know.

64:55

>> Yeah. Right. So, there's Exactly. Like,

64:57

so I mean, you could even go through

64:59

senses, right? Like you you could go

65:01

through like a tactile like like today

65:03

like they would all score zero. Nothing

65:05

has tactile. So like it can't tell you

65:08

anything about anything tactile.

65:10

>> Um well you know in producing this graph

65:11

we you know we try and make the models

65:13

as performance as possible on some held

65:15

out so you know we try and give them

65:18

some tactile stuff.

65:20

>> I'm not sure they perform zero.

65:22

>> Sure sure

65:24

we do have some examples.

65:27

>> Yeah space judgments judgments like

65:30

that. But um you know we we've obviously

65:33

seen configure find control and stuff

65:35

like that with other robotics.

65:38

It's just I I haven't even I don't even

65:40

know if anybody maybe somebody has

65:42

listed out what all of the capabilities

65:45

that we would expect in the future like

65:47

if we actually wanted AGI what is the

65:49

entire list of

65:51

>> that's a way to start a debate that

65:53

doesn't end. I think it's Basel Halperin

65:56

and Arjun Romani hopefully have a paper

65:59

on this often a small number of months.

66:02

>> Yeah. And that be to think about where

66:04

are we at and do all of the capabilities

66:06

follow the same all the capabilities

66:08

that we currently measure. Do they

66:09

follow the same uh log?

66:12

>> Yeah, it does seem like a reasonable

66:14

null hypothesis to to you as well as me.

66:16

I think not not not not certainty. I

66:18

mean, who knows? Yeah. Yeah.

66:24

Um, oh, something there was something I

66:26

wanted to add there. Um, oh, oh, here,

66:30

yeah, here's another thing I'm thinking

66:31

about. Not super in a research classy,

66:33

although kind of. Um,

66:36

um, so, you know, some people like me

66:38

are sort of skeptical of of um software

66:41

only singularity. That is the the the

66:43

idea that you could automate AI research

66:45

without also automating um uh chip

66:49

design and maybe also chip production as

66:51

well. um that you'd quickly get

66:52

bottlenecks by by computes because there

66:54

are only for fixed hardware there only

66:55

so sort of so many experiments that you

66:56

can run that that will be that will be

66:59

um sufficiently productive to to uh to

67:02

soon progress upwards. But you know even

67:05

for people like me who are skeptical of

67:07

that um uh you know you might think that

67:10

in fact like chip production is going to

67:12

get automated. you know, the robots like

67:14

they're they're coming. They can they

67:15

can do they can do the stuff that humans

67:17

do and then and then maybe you really do

67:19

have a fully self-sustaining um uh

67:22

robots plus AI economy and you know so

67:26

you uh you have some slow trend from

67:28

from comput slowing down but then you

67:29

have sort of a fuming back upwards once

67:31

once the whole thing is is um is is in a

67:34

tight loop. Um one interesting debate

67:36

that I uh heard about recently and would

67:38

like to think more is um uh you know I

67:42

think there's in the public discussion

67:44

there's some sense that you know why why

67:46

are robotics capabilities lagging um uh

67:50

lagging LLM like capability so much well

67:52

it's to do with training data or

67:54

something something something like that

67:55

or or maybe it's to do with hardware

67:57

constraints

67:58

>> I'm I'm curious if it's not to do with

68:00

hardware constraints like what what what

68:02

exactly are these hardware constraints

68:03

if we put super intelligence

68:05

inside hypotheical superchargers inside

68:08

of you know um hardware parts that

68:12

existed today could it build

68:15

chip production facilities and I I have

68:18

no idea because I'm s you know I'm I'm

68:21

beyond beyond beyond novice but it's not

68:22

obvious to me what the what the answer

68:24

is I think it's I think it's kind of

68:26

plausible I'm not sure you need this

68:27

like um yeah I'm not sure you need this

68:30

like very flexible fine motor control in

68:32

order to do it also I think maybe the

68:34

fine motor control is there subject to

68:36

having super intelligence controlling

68:37

it.

68:38

>> I mean to be fair like the key aspects

68:41

of chip production are done.

68:44

Um, oh, but but I'm also thinking like

68:47

building the robots and

68:48

>> yeah, the whole, you know,

68:49

>> and that's I'll tell you I I have a

68:52

friend who spent most of his career

68:53

doing software development, but during

68:55

COVID started working on manufacturing

68:58

things like papers and things like that

69:00

to help people and he found out how hard

69:02

the manufacturing world is and how slow

69:05

the iteration process is.

69:07

>> And it is really like he put it like he

69:09

he knew it was going to be worse. he

69:11

didn't understand that it was like next

69:13

level like an order of magnitude worse

69:15

and I think that probably like you know

69:18

we we from our perspective people who

69:20

don't do it it seems like how bad can it

69:22

be right it's the the feedback I've had

69:24

from everybody who actually works in

69:25

that space is it's way way different

69:28

that's what I've heard as well I I've

69:30

only talked a little bit with like

69:31

people who work in fabs and stuff but I

69:33

I was surprised when I did talk to them

69:36

>> of the level of human expertise required

69:39

yeah

69:39

>> in order to work at the fabs like a lot

69:40

of Those jobs are like fairly high

69:42

paying actually engineering jobs in

69:44

order to like success

69:45

>> also the rate of improvement is actually

69:47

glacial right compared compared to

69:49

software right

69:50

>> I think also because it's cost a billion

69:52

dollars to build a

69:52

>> f iteration is a huge cost of time money

69:56

it it's brutal right

69:58

>> so it's I think that's why it's been

70:00

hard to get it all the way there is just

70:02

like give them a couple more centuries

70:04

maybe they can get it done

70:06

>> is that really your view centuries

70:08

centuries

70:08

>> I do I do I I do think I'm skeptical

70:12

like you about like how easy some of

70:14

these tasks are.

70:15

>> Yeah.

70:16

>> We think they're easy, but in my

70:18

experience like

70:19

>> I I I remember when the self-driving

70:21

thing came out when people were like

70:23

pushing that and it was I actually

70:24

worked in that space for a while and it

70:25

was like

70:26

>> I get that we can get really close to it

70:28

but getting all the way to something

70:30

that is acceptable is extremely

70:34

difficult, right? Uh and we

70:36

underestimate how much work is involved

70:37

in getting that last little bit done.

70:39

the first 90% I I knew we could do it

70:42

with computers like you know 10 years

70:44

ago pretty much but getting the last bit

70:46

that everyone's happy with it

70:49

>> yeah work

70:50

>> I feel this myself you know I didn't get

70:52

a driver's license when I was uh because

70:55

I expected self-driving cars to come um

70:57

yeah I think I think totally but it

70:58

hasn't been that long you know and

71:00

they're expanding to to um to to the

71:03

entire bay

71:05

>> they're going to get there I don't think

71:06

it's going to take

71:07

>> is is the is the robot economy. Building

71:09

the chip production is going to take

71:11

centuries.

71:12

>> I don't know. I don't know about Well, I

71:14

I could see that it might take

71:17

it's it's so part of the trick with

71:18

self-driving is the economic incentive

71:20

is moving it along faster, right? And

71:23

probably the robot building robots kind

71:26

of thing, but also but like

71:27

>> yeah,

71:27

>> you know, where we're at right now is

71:29

like Ripre is kind of as far along as we

71:32

got of robots building robots, right?

71:34

Which is

71:35

>> Oh. Oh, but I I I feel like, you know,

71:38

is that is that paying sufficient

71:40

attention to the chart? GPT2 2019.

71:44

>> It's so it's so recent, you know, I I I

71:47

have some This is so This is so

71:49

>> Yeah. Yeah.

71:50

>> Um uh nonsensical, but I'm like maybe

71:53

we're in a sort of GP2 moment.

71:56

>> No, it's a fair point. I could be wrong.

71:58

It's just my guess is it's going to take

71:59

a lot longer than we think.

72:01

>> At least to be able to do like real mass

72:03

production.

72:04

>> Yeah.

72:05

uh at a scale that that causes the kind

72:09

of global impact that you're talking

72:10

about. Yeah. Right. That that's I I

72:13

think they can already do a great job

72:14

building one-offs, right? Robots are

72:16

very good at build doing one-off builds

72:19

>> at a small scale, but

72:21

>> it's totally impractical for doing it at

72:23

a large scale.

72:25

>> There is um

72:28

72:31

72:35

So one one fact I think is kind of

72:37

remarkable

72:39

is this maybe it's this is that the rate

72:42

of is it this yeah yeah the rate of

72:46

compute put to robotics models

72:50

lags behind

72:52

um sorry is is is about the same but the

72:54

levels uh two orders of magnitude

72:56

difference. Um I I am kind of um curious

73:00

if that's if that got closed um um uh

73:06

what we'd what we'd see. It does seem

73:08

like at least sort of more capable

73:10

robots are in some sense um very on the

73:14

table as something that could be the

73:15

case very soon if this if this not I'm

73:18

not saying all the way. I'm certainly

73:19

not saying trip production. It just does

73:21

seem like there's some sort of data

73:22

hang.

73:23

>> Yeah. Yeah. intuitively.

73:26

>> That's interesting.

73:27

>> Um

73:29

also also thinking some sort of um um

73:33

some like you don't just need to be

73:34

scaling data. You can also scale

73:36

parameters use same amount of data you

73:38

know flexible ways to use compute to to

73:40

close some gap.

73:42

>> Interesting.

73:44

>> Yeah.

73:45

>> Just give me a very interesting overview

73:47

of where AI is going into fabrication

73:51

>> and and what does it say? Oh, so it says

73:55

so it says there's a lot of areas where

73:56

right now it's going to help probably

73:57

pretty dramatically in the near future

73:59

and a lot of it's been computational

74:00

aspects. There's a lot of computational

74:02

aspects that are extremely expensive.

74:05

Designing like the mass basically the

74:08

hole that you're using for the laser to

74:11

get the transistors.

74:12

>> Um, and like calculating that, how to

74:16

build it and ensuring that it conforms

74:18

to the spec that you've written

74:20

basically is extremely computationally

74:22

expensive. Um, and there's a lot of

74:24

opportunity for AI to have it there. Um,

74:27

and there's also theoretically the

74:29

possibility for so like chip obviously

74:32

chip manufacturer is extremely precise

74:34

but also fragile and the opportunity for

74:38

an AI to detect parameters that are

74:41

basically out of whack and leading to

74:43

failure potential failure in like

74:45

imaging a wafer uh is could

74:48

theoretically dramatically improve yield

74:50

and yield is a big problem in fab in

74:52

chip manufacturing. Like the reason that

74:54

you get different speeds out of your

74:56

CPUs is because they actually just have

74:57

the one line that produces all those

74:59

CPUs and some of them come out better

75:00

and some of them come out worse. And

75:02

that's why the higher G that's why the

75:04

higher gigahertz models are more

75:05

expensive and the lower G like like if

75:07

you have like your Nvidia like your home

75:09

GPUs your your 5040 your 5050 your 5060

75:15

5090 are all the same chip right

75:17

>> that just had different quality

75:19

different tolerance essentially.

75:21

>> Yeah. Um, but the problem is that uh cut

75:27

the recording. They're going to kick us

75:29

out soon, but feel free to continue

75:30

discussion.

75:31

>> Yeah, cool.

75:32

>> You can also hang out here, but I'm just

75:33

going to

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The discussion centers on the relationship between compute growth and AI capabilities, noting that a slowdown in compute (due to physical or financial constraints) could significantly delay AI milestones. Studies on AI's impact on developer productivity, particularly for experts on complex open-source projects, show mixed results, with skepticism about self-reported speed-ups and observations of initial slowdowns. AI also struggles with messy, internal corporate data, limiting its immediate value for data scientists. The conversation explores new methods for measuring AI capabilities, including "in-the-wild" transcripts and "AI Village" simulations, often revealing current AI limitations in open-ended, real-world tasks. An interesting analogy posits AI as neuro-divergent individuals struggling to operate in a human-designed world. Finally, the automation of complex manufacturing processes, such as chip production, is viewed as an immense challenge that could take centuries, despite AI's potential to assist with computational design and yield improvement.

Recently Distilled

Videos recently processed by our community

The Car Collector Who A Ferrari Worth $38 Million; Car of the Century Part 2 | Bloomberg Hot...

Feb 21, 2026

by Bloomberg Podcasts

GEOGRAFIA 4|Dział IV.Problemy polityczne współ.świata Roz.5:Cywilizacja zachodnia-cywilizacja islamu

Feb 21, 2026

by Czytanie Na Ekranie

HISTORIA 4 | Dział VI. Roz.36: Jesień Ludów 1989 r. i jej konsekwencje #historia

Feb 21, 2026

by Czytanie Na Ekranie

GEOGRAFIA 4|Dział IV.Problemy polityczne współczesnego świata - Podsumowanie rozdziału #geografia

Feb 21, 2026

by Czytanie Na Ekranie

GEOGRAFIA 4|Dział V.Problemy społeczne współczesnego świata Roz.1: Problemy demograficzne na świecie

Feb 21, 2026

by Czytanie Na Ekranie

HISTORIA 4 | Dział VI.Roz.37:Rozpad Związku Sowieckiego, Czechosłowacji i Jugosławii #historia

Feb 21, 2026

by Czytanie Na Ekranie

Introducing the New Handbook of Surveys on Households and Individuals Foundations and Emerging App..

Feb 21, 2026

by UNStats