HomeVideos

METR's Benchmarks vs Economics: The AI capability measurement gap – Joel Becker, METR

Now Playing

METR's Benchmarks vs Economics: The AI capability measurement gap – Joel Becker, METR

Transcript

604 segments

0:13

[music]

0:20

Hey guys, thank you so much for having

0:22

me. My name is Joel Becker. I work as a

0:24

researcher or member of technical staff

0:26

at MET, which stands for model

0:29

evaluation and threat research. As we'll

0:30

see in a second, I'm going to be talking

0:32

about AI capabilities. How do we know

0:34

how performant AIs are today? How how

0:36

performant they might be in the near

0:38

future from these two different sources

0:39

of evidence that seem to give somewhat

0:41

conflicting answers. You know, I I could

0:44

have done this whole talk without

0:45

reference to meter papers in particular,

0:47

but we'll look at two papers I've been

0:49

um involved with as as examples of

0:51

benchmark style evidence and then more

0:53

economic style evidence. On the

0:55

benchmark side, measuring AI ability to

0:57

complete long tasks. This is the paper

1:00

um that comes with the the charts that

1:02

many of you would have seen on on

1:03

Twitter and so on that meter is well

1:04

known for. And then the second this um

1:07

RCT measuring how allowing AI affects

1:10

developer productivity. And then we'll

1:12

be talking about how to reconcile uh the

1:14

the gap that's implied between these two

1:16

different kinds of measurements.

1:19

As I mentioned, META stands for model

1:21

evaluation and threat research. We are a

1:23

independent research nonprofit that

1:25

seeks to inform the the public, policy

1:28

makers, labs about the degree to which

1:30

AIs might pose catastrophic risks to

1:33

society. The model evaluation part uh

1:35

means that we seek to understand AI

1:37

capabilities and propensities. And the

1:39

threat research part means we try to

1:41

connect those capabilities and

1:42

propensities to potential catastrophic

1:45

risks.

1:47

Okay. The first paper we're going to

1:48

talk about associated with this chart

1:50

that that many of you I think might have

1:52

seen.

1:53

Um take taking a step back first before

1:55

we dive into the paper. You know how how

1:57

usually do we think about measuring AI

1:59

capabilities using benchmarks on a SWE

2:02

bench or a GPQA so on and so forth.

2:04

There's some notion of 0% performance um

2:07

or or random performance. So for GPQA

2:09

that's that's 25% which corresponds to

2:11

this flaw that the worst you can

2:13

possibly do. Perhaps there's a um human

2:16

baseline that's below 100% for GPQA. I

2:19

think this is something like 75% that

2:22

represents maybe expert human

2:23

performance. And then of course you can

2:25

go all the way up to 100% potentially on

2:27

on these kinds of benchmarks. But but

2:29

what does it mean? you know, if I'm

2:30

getting 50% on GPQA, if I'm like half

2:32

the way from the um from the floor to

2:35

the to the expert baseline, what you

2:37

know, what does that really mean about

2:38

how performant the AIS are? If I meet

2:40

the human baseline, does that mean that

2:42

the AIS are now as performant or even

2:44

more performant than than expert humans

2:46

in in a relevant sense that I that I

2:48

care about? It's hard to interpret. You

2:50

know, another thing that you see from

2:52

this graph is that um benchmarks seem to

2:56

have less and less time between coming

2:58

online sort of giving any signal at all

3:01

and being fully saturated. It's harder

3:04

and harder to create benchmarks that

3:06

have uh plenty of signal that you know

3:08

might might be informative to us about

3:10

how capable models are for for an

3:11

extended period of time. So, we're we're

3:13

going to go about this a different way.

3:16

First, we're going to gather human

3:18

baseline data for diverse tasks spanning

3:20

a range of difficulties. You should

3:22

think of these humans as, you know,

3:24

experienced experts, but on their first

3:27

day or or or first week on the job.

3:29

These are not people with context on the

3:32

tasks in particular. It's not exactly

3:34

the kind of thing that's come up in

3:35

their work before, but if it's a

3:36

software engineering task, you know,

3:37

there are relevantly skilled general

3:39

software engineer. Same for the machine

3:41

learning tasks and the cyber security

3:42

tasks here that we'll talk about. the

3:45

the [snorts] type of tasks come from

3:46

these three um buckets or task

3:49

distributions. Hcast which is a

3:52

collection of um softwarebased tasks

3:54

seemingly requiring autonomy you know

3:56

interacting with tools um uh interacting

3:59

with the environments thinking thinking

4:01

through the problem not not just this

4:02

kind of Q&A style um style data set um

4:06

the SWAR suite which are these atomic

4:08

problems these are problems that you

4:10

know maybe GBT2 can do maybe maybe it

4:12

can't problems like um here are four

4:15

files one of them is called

4:16

passwords.txt txt which file contains

4:19

the passwords and then on the other end

4:21

of difficulty we have rebench which are

4:23

challenging novel open-ended um machine

4:26

learning research engineering challenges

4:28

which are are very difficult even for

4:30

top human experts

4:32

in addition to gathering the the human

4:34

baseline data we'll also under as close

4:36

to identical conditions as possible

4:38

measure AI performance for the AIs that

4:40

we're that we're interested in on the

4:41

same set of tasks and then we're going

4:44

to convert the time it takes for humans

4:47

to complete these tasks into an estimate

4:49

of AI autonomous capabilities as I'll

4:52

I'll show you in a second.

4:55

Here's an illustrative diagram in this

4:57

case for claw 3.7 Sonet which was the

4:59

the frontier model at the time that this

5:01

paper came out. You can see that you

5:04

know for the for the very short tasks

5:05

something like 4 minutes or below Sonet

5:07

is getting the answers correct you know

5:09

essentially 100% of the time or or maybe

5:11

even here literally 100% of the time.

5:13

for the very hardest tasks it's

5:14

struggling and then and then there's

5:16

some range where we're kind of in the

5:17

middle you know we're somewhere between

5:18

10 and 10 and 90%. I'll say that this

5:22

empirical pattern where models are less

5:24

performant at tasks that take humans

5:26

longer is you know it's not a fact of

5:28

nature but it's it's something that we

5:29

see pretty pretty commonly pretty pretty

5:31

robustly across models at least on this

5:33

task distribution and I'd conjecture for

5:35

for other task distributions as well. So

5:37

we try and fit this dark purple line to

5:39

to something like this data on on how

5:41

long it took humans to complete the

5:43

relevant tasks that the models are uh um

5:45

are attempting. And then we call the

5:48

point on the x-axis this horizontal axis

5:50

this human time to complete axis at

5:53

which we predict the models will succeed

5:55

50% of the time the time horizon of

5:59

those models that there's much to debate

6:01

in the 50% number. I can I can talk

6:02

later about the reasons why we chose

6:04

that and and then we'll do the same

6:06

exercise for the other models. So here I

6:08

have uh claw 3 opus has a time horizon

6:11

of something like 4 minutes. That's

6:12

where we're predicting that it has a

6:14

success probability on this task

6:16

distribution of 50%. For 01 preview I'm

6:19

seeing something like 15 minutes so on

6:21

and so forth. And then of course all

6:22

these models you know they they come out

6:24

over um calendar time. So if we plot the

6:27

time horizon, the x-coordinate on uh on

6:31

on this set of plots against um against

6:33

calendar's time, we find something like

6:34

this. It looks, you know, kind of like

6:36

um kind of like an exponential trend

6:38

that's that's going up at some constant

6:40

rate. In fact, it doesn't just look like

6:42

an exponential trend. If we had a

6:43

perfectly straight line here, it would

6:45

indicate um a perfectly exponential

6:47

trend. um we we see something really

6:49

remarkably steady actually much more

6:51

steady than we were anticipating when we

6:53

uh went about doing this research

6:55

project

6:57

and that's continued to be the case. So

7:00

many of you will have seen updates that

7:01

we've made of of this graph on on on

7:03

Twitter. This is going all the way up to

7:05

GPT 5.1 CEX max. So extremely recent um

7:08

the predictions from this you know

7:10

shockingly straight line have have held

7:12

up very well I think.

7:16

Taking a quick step back, what are

7:18

benchmarks telling us or or here kind of

7:20

benchmark like evidence? Well, one thing

7:22

is that AIs can succeed at what for

7:24

humans would be exceedingly difficult

7:27

tasks. The tasks in our ebench are, you

7:29

know, really far beyond my capabilities

7:32

uh personally and and you know the AI is

7:34

having a good crack at them some some

7:35

decent percentage of the time. And the

7:38

second's you know kind of obvious is

7:39

that progress is rapid.

7:42

>> [snorts]

7:42

>> On the other hand, um you know, how much

7:44

how much stock should we put in the um

7:46

the evidence suggested by benchmarks? Um

7:49

what limitations might they have? Lots,

7:52

but here are here are three that I'll

7:54

note. One is, as I mentioned, these are

7:57

humans who are, you know, expert in some

7:59

relevant sense, but they're low context.

8:01

It's something like their their first

8:02

week on the job. They haven't seen tasks

8:04

exactly like this previously. They just

8:05

have some relevant experience.

8:07

presumably people who were more sort of

8:10

you know not not just having the

8:11

relevant experience but also highly

8:13

familiar with um uh with the with the

8:16

set of tasks would perform the tasks

8:18

even sooner and then we think relative

8:19

to those people the AIs were more

8:21

performant.

8:23

The second is that benchmarks can be low

8:25

ceiling. Even you know GPQA or use that

8:28

example again um where we're beginning

8:32

to get to the point where where that

8:33

benchmark is um is totally saturated not

8:37

providing um additional information for

8:39

marginal models whereas time horizon is

8:41

providing this nice way to sort of chain

8:43

benchmarks together in in in some sense

8:45

over time.

8:47

Um but you know nonetheless it's still

8:49

very hard to um uh to create these ever

8:52

harder tasks when the um when the time

8:55

horizon of models is doubling every

8:56

something like six to seven months. So

8:58

even time horizon might be might be

9:00

saturated in not too long or the

9:02

benchmarks underlying time horizon.

9:04

And the next one is you know not not a

9:06

concern that's limited to the to the

9:08

meter task to the task behind time

9:10

horizon. It's also true for sweet bench.

9:11

which is also true for for many of your

9:13

um favorite agentic benchmarks that the

9:15

problems aren't very messy in some

9:17

sense. They don't require a ton of

9:19

coordination with humans. They're often

9:21

in relatively small contained

9:23

environments where where not much can go

9:25

wrong. You know, not these sort of

9:26

massive open source code bases or or um

9:29

other ways in which the the problems can

9:30

involve more interaction with the real

9:32

world or or or be messy in in some

9:34

sense.

9:36

Um so we did this we did this project

9:39

and then um early this year we were you

9:41

know we were trying to think about um uh

9:44

how can we attack some of these

9:45

limitations? What what's a different

9:46

source of evidence that um might have

9:49

its own own pros and cons but you know

9:51

importantly be more externally valid in

9:53

in the scientific jargon.

9:56

Perhaps field experiments are the

9:58

answer. [snorts] So more economic style

9:59

evidence. So here we might be interested

10:02

in very high context developers who are

10:04

expert on the kind of tasks they're

10:05

already doing

10:07

speed up or some notion of productivity

10:09

boost. You know it seems to have more

10:11

signal through even some um superhuman

10:13

according to benchmarks range. You know

10:15

perhaps GPQA is fully saturated and

10:17

you're getting a 1.5x 2x speed up

10:19

something like that but you can still

10:20

achieve a 3x 4x 5x speed up even even

10:24

after that we we maintain more signal.

10:26

And the last is that you know that the

10:28

tasks are messier. They are tasks that

10:31

are coming up in people's real work.

10:32

They're not um synthetic. They're not

10:34

small and contained. Um this is a real

10:36

deployment scenario.

10:40

Here's what we're going to do for this

10:42

paper. We're going to gather 16

10:44

experienced developers on large mature

10:46

open source projects that we'll go

10:47

through in a second. Each of these

10:50

developers will on average complete

10:51

about 16 tasks from their real work.

10:54

These are these are issues on the on the

10:56

relevant GitHub repositories. The kind

10:57

of thing that they might otherwise have

10:59

completed with the with the caveat that

11:00

the very longest issues we're not going

11:02

to include.

11:04

The tasks will be randomly assigned to

11:07

AI disallowed or AI allowed. AI

11:09

disallowed, you know, it means it means

11:11

what you think it means. It means

11:12

software development in 2019. It means

11:14

no AI powered tab autocomplete. It means

11:17

no cursor agentic coding tools. It means

11:19

no LLMs via the web UI.

11:23

or they can be randomly assigned to AI

11:25

allowed in which case everything's on

11:26

the table. You know, any of the AI tools

11:28

I just mentioned or not using the AI

11:30

tools. If you're in the AI allowed

11:31

condition, you're not compelled to use

11:33

AI. You just have the option. And we buy

11:36

these developers Cursor Pro. So, um for

11:39

the for the most part, that's the tool

11:40

that they're using with typically 3.6 or

11:42

3.7s on it at the time, uh which was the

11:45

Frontier model when we conducted this

11:46

work. And then we're going to record the

11:49

time it takes for the developers to

11:50

complete each task and see the degree to

11:53

which they might save time when AI is

11:54

allowed versus when it's not.

11:58

These are some of the repositories. Many

12:00

of you will be familiar with them. We've

12:01

got the Haskell compiler represented. We

12:03

have scikitlearn. We have hugging face

12:05

transformers. These are on average a

12:07

million lines of code plus. They've been

12:09

around for 10 plus years. The developers

12:12

who are going to be working on these

12:13

repositories as part of this study are

12:15

on average the third top contributor out

12:17

of hundreds or or even in some cases

12:19

thousands of contributors to these

12:21

repositories. They personally have been

12:23

contributing to the repository for

12:24

something like 5 years on average. These

12:26

are top experts.

12:29

Some of you might have seen this graph

12:31

too. And and so the punch line's been

12:32

spoiled for for the rest of you. Um we

12:35

asked uh economics experts, machine

12:37

learning experts, you know, these are

12:38

people at major AI companies and labs,

12:41

um uh top academics, um some graduate

12:43

students, so on and so forth, you know,

12:45

how much they expect developers to save

12:47

time when they're using AI. They say

12:49

something like 40% or a little bit less.

12:51

We ask the developers themselves, the

12:53

study participants, how much they expect

12:55

to be sped up ahead of time, and they

12:56

say something like 24 25%. Then we ask

12:59

the developers after the study has been

13:01

completed how much they think they were

13:03

sped up in the past by AI being allowed

13:07

on the issues they completed as part of

13:09

this study and they say that it will

13:11

have sped them up by something like 20%.

13:13

And the punch line is that we find that

13:15

developers are slowed down by 19%. They

13:18

take 19% more time when AI is allowed

13:21

relative to when AI is not allowed.

13:24

You know, when I first saw the data

13:26

coming in, saw sort of early versions of

13:28

this plot, um, I thought presumably the

13:30

same thing that many of you might be

13:32

thinking right now, that we've messed

13:33

something up. Um, that that, you know,

13:35

something's gone wrong. There's some

13:36

there's some issue in in how we've set

13:38

up the experiments. How could it

13:39

possibly be the case? You know, at least

13:41

these um, uh, these developers have

13:44

access to the zero points because they

13:46

cannot use AI at at any time. Um, so we

13:50

poured over, you know, many, many, many,

13:53

many, many hours of screen recordings

13:56

from these developers working on issues

13:58

as part of the study. We look to dive

14:00

into um, a bunch of hypotheses that

14:02

might explain what's going on and try to

14:05

categorize, you know, the things that

14:06

that we think are going on versus not.

14:08

Um, many of this is is listed in the

14:10

paper. I I'll just quickly go through

14:11

some of the things that we think are

14:13

contributing.

14:14

First, overoptimism about AI usefulness.

14:18

that that seems like an obvious one. You

14:19

know, the developers even after the

14:21

study is completed, they think that um

14:23

uh that AI is going to be helpful to

14:25

their work. It's it makes sense they

14:26

might overuse AI um on that basis. Um

14:30

two more implicit repository context and

14:33

high developer familiarity. You know,

14:35

these developers are coming to these

14:36

problems already knowing the solution to

14:38

the problem. They don't they don't um

14:40

they're so expert in this work. you

14:42

know, I I I imagine them as as not

14:44

trying to spend a bunch of time thinking

14:46

through the solution that the the AI can

14:48

can work through. Instead, they're just

14:49

limited by how fast they can type. Um,

14:52

which which means that, you know, using

14:54

AI, instructing AIS to do it, um, comes

14:56

with some significant time cost versus

14:57

how they might otherwise have spent

14:58

their time.

15:00

I think many of us have the sense that

15:02

AIS might be less performant on on large

15:04

and complex repositories, which is a

15:06

different from this difference from this

15:07

benchmark style evidence or or from or

15:09

from some previous work. And then low AI

15:12

reliability. You know, um maybe the AIs

15:15

are very performant on these kinds of

15:17

tasks, but you know, they're only

15:18

performant um 50% of the time or 80% of

15:21

the time, 20% of the time. And so, at

15:23

the very least, you need to check their

15:25

work afterwards. And perhaps even you

15:27

need to spend time correcting their work

15:28

afterwards, which is which is something

15:30

we see quite a lot on these issues.

15:34

One thing from the factors with an

15:35

unclear effect that I that I'll mention

15:37

briefly I have to talk to people about

15:38

later is below average use of AI tools

15:40

which came up in the public discussion.

15:43

This this is in the unclear column

15:44

because it's sort of evidence evidence

15:46

for and against. Um that that's true for

15:48

for many of the things here. We don't

15:50

have anything so conclusive to say we're

15:52

still working on on this line of work.

15:56

Here are some here are some caveats. All

15:58

important. Um first you know obviously

16:00

we do not provide evidence for all

16:01

software developers or tasks. These are

16:04

extremely experienced developers working

16:06

on extremely complex longived open

16:09

source repositories. I in my own work

16:12

you know not um as expert in the

16:14

relevant sense as as these people are.

16:16

I'm working on much smaller

16:17

repositories. Um I I feel more

16:19

comfortable saying that even at this

16:20

time I was sped up by AI tools even if

16:22

even if the developers weren't. This

16:25

setting is weird. It's weird for the

16:27

same reasons that it's that it's

16:28

interesting this this unusual developer

16:29

population.

16:31

Second, the experiment is concentrated

16:33

in March 2025. As I mentioned, uh we

16:36

know that AI progress is rapid. Um

16:38

perhaps this this result will have

16:40

already changed by the by the time I'm

16:42

giving you this talk.

16:45

So there's a kind of puzzle suggested

16:47

right that the benchmark style evidence

16:49

is giving um a very impressive sense of

16:52

what benchmark of what AI capabilities

16:54

look like today whereas the more

16:56

economic style you know I include labor

16:58

market impacts um uh uh working here too

17:01

in addition to our in addition to our

17:03

field experiments look somewhat more

17:04

bearish or or unimpressive. You know why

17:06

why is the former not not translating to

17:08

the latter at least naively there seems

17:10

to be a clash. How might we go about

17:12

resolving this puzzle?

17:15

So one possibility is that in fact we we

17:17

messed something up. This is this is

17:18

still live and on the table. Uh you know

17:20

maybe the developers really are um uh

17:22

not very capable at using AI and if we

17:24

continue to run this experiment as as in

17:26

fact we are they'll you know learn more

17:28

familiarity with the tools and and so

17:30

get productivity benefits that they they

17:32

weren't getting at the time. I'm a

17:34

little skeptical of that story but but

17:35

but that's one possibility.

17:38

Another that economists like to bring up

17:39

is that we're not incentivizing these

17:41

developers to finish quickly. we're

17:43

paying them per the hour, um, which we

17:45

do for external validity reasons. Um,

17:48

you know, looking through their videos,

17:49

I I really, uh, do not think that

17:51

they're developing differently in

17:53

accordance with these incentives, but

17:54

but that certainly is one possibility

17:55

that's on the table.

17:58

You know, another um, more statistical

18:00

in nature possibility is, you know, this

18:02

is a small study. You shouldn't you

18:04

shouldn't over update so much from small

18:06

studies. We we are doing um, bigger

18:08

things that I'm excited to release at

18:10

some point. Okay, but let's let's assume

18:12

we haven't messed something up and this

18:14

is uh this this is a result um uh that

18:17

that we think that we think does hold

18:19

up. How could we resolve the puzzle?

18:22

[snorts and sighs] So, one possibility,

18:23

you know, as I as I alluded to briefly

18:25

is that reliability needs to be very

18:27

high to save time. That you need to be

18:30

getting um the the answers these

18:32

problems that developers are putting in

18:33

correct. you know, something like 95 99%

18:36

of the time in order for developers to

18:38

tab tab tab through and you know, not

18:39

not um not spend lots of time verifying

18:42

the AI's work, which which of course um

18:44

is pretty costly from a time

18:45

perspective.

18:47

Another possibility is bbenchlike or

18:50

algorithmic costless scoring at the

18:52

margin versus mergeability like scoring.

18:56

Sweet scores are not trying to account

18:58

for you know whether the code is spilled

19:00

honable by by other people in future or

19:03

whether it's matching quality

19:04

considerations that aren't um considered

19:06

by the unit tests. You know perhaps AIS

19:08

really are performance according to

19:10

SWEBenchl like scoring but not

19:11

performance according to this kind of

19:12

more holistic um uh holistic scoring

19:15

that we might care about low versus high

19:18

context baseliners. I I I mentioned I

19:20

mentioned previously these are just much

19:22

more skilled humans, you know, relative

19:23

to those humans. Perhaps the AIs are

19:25

less capable. Task distribution, maybe

19:27

these are just different kinds of tasks,

19:29

you know, in particular less less messy

19:31

than the than the benchmark style task.

19:32

Maybe that's explaining what's going on

19:34

here. [snorts] Suboptimal capability

19:36

elicitation. A huge amount of work has

19:38

gone in at meter to making the agents as

19:41

performant as possible given the

19:42

underlying models on on our kinds of

19:44

tasks. And um you know that involves

19:46

churning through a load of AI tokens.

19:49

Perhaps that's that's less the case for

19:51

cursor in particular at the time when we

19:53

completed the study.

19:55

And then interdependence across tasks.

19:57

Maybe it's the case that um you know if

19:59

humans can complete task A and task B.

20:03

AIS can only complete task A but not

20:04

task B and of course can do task A

20:06

faster. then it still makes sense to for

20:09

humans to do task A and task B, not

20:11

delegate task A because you know they

20:14

they need to know the outputs. They need

20:15

to know how how task A was completed in

20:17

order to reliably complete task B. I

20:19

think that's that's part of what's going

20:20

on. You need to maintain context as

20:22

you're working through these subtasks.

20:25

Um lastly I will say that we are hiring

20:28

not just for this kind of work that

20:30

you've um that you've seen being

20:32

extended you know ever longer tasks ever

20:34

more ambitious um RCTs um even more

20:36

sources of evidence from which we can

20:38

triangulate the truth about AI

20:40

capabilities but also for for much more

20:41

besides you can you can find this at

20:44

meter.org/careers org/careers. In

20:46

particular, I'm excited about research

20:47

engineers, research scientists who might

20:49

be um hiding in the current audience.

20:51

We're excited not just, you know, for

20:53

for research types with academic

20:54

experience, but very much for scrappy

20:56

startup people as well. And we're also

20:59

hiring for a director of operations.

21:02

And with that, thank you very much for

21:03

listening.

21:21

Heat

Interactive Summary

Joel Becker from MET (Model Evaluation and Threat Research) presents findings on AI capabilities from two seemingly conflicting sources of evidence: benchmark-style and economic-style studies. Benchmark research, particularly using the "time horizon" metric, indicates rapid and consistent exponential progress in AI models' ability to complete tasks that would take humans longer. However, a field experiment measuring developer productivity showed a surprising result: highly experienced developers working on complex, real-world open-source projects were actually slowed down by 19% when using AI tools compared to working without them. The talk explores various hypotheses for this discrepancy, including developers' overoptimism, high familiarity with tasks, AI reliability issues, the

Suggested questions

8 ready-made prompts