METR's Benchmarks vs Economics: The AI capability measurement gap – Joel Becker, METR
604 segments
[music]
Hey guys, thank you so much for having
me. My name is Joel Becker. I work as a
researcher or member of technical staff
at MET, which stands for model
evaluation and threat research. As we'll
see in a second, I'm going to be talking
about AI capabilities. How do we know
how performant AIs are today? How how
performant they might be in the near
future from these two different sources
of evidence that seem to give somewhat
conflicting answers. You know, I I could
have done this whole talk without
reference to meter papers in particular,
but we'll look at two papers I've been
um involved with as as examples of
benchmark style evidence and then more
economic style evidence. On the
benchmark side, measuring AI ability to
complete long tasks. This is the paper
um that comes with the the charts that
many of you would have seen on on
Twitter and so on that meter is well
known for. And then the second this um
RCT measuring how allowing AI affects
developer productivity. And then we'll
be talking about how to reconcile uh the
the gap that's implied between these two
different kinds of measurements.
As I mentioned, META stands for model
evaluation and threat research. We are a
independent research nonprofit that
seeks to inform the the public, policy
makers, labs about the degree to which
AIs might pose catastrophic risks to
society. The model evaluation part uh
means that we seek to understand AI
capabilities and propensities. And the
threat research part means we try to
connect those capabilities and
propensities to potential catastrophic
risks.
Okay. The first paper we're going to
talk about associated with this chart
that that many of you I think might have
seen.
Um take taking a step back first before
we dive into the paper. You know how how
usually do we think about measuring AI
capabilities using benchmarks on a SWE
bench or a GPQA so on and so forth.
There's some notion of 0% performance um
or or random performance. So for GPQA
that's that's 25% which corresponds to
this flaw that the worst you can
possibly do. Perhaps there's a um human
baseline that's below 100% for GPQA. I
think this is something like 75% that
represents maybe expert human
performance. And then of course you can
go all the way up to 100% potentially on
on these kinds of benchmarks. But but
what does it mean? you know, if I'm
getting 50% on GPQA, if I'm like half
the way from the um from the floor to
the to the expert baseline, what you
know, what does that really mean about
how performant the AIS are? If I meet
the human baseline, does that mean that
the AIS are now as performant or even
more performant than than expert humans
in in a relevant sense that I that I
care about? It's hard to interpret. You
know, another thing that you see from
this graph is that um benchmarks seem to
have less and less time between coming
online sort of giving any signal at all
and being fully saturated. It's harder
and harder to create benchmarks that
have uh plenty of signal that you know
might might be informative to us about
how capable models are for for an
extended period of time. So, we're we're
going to go about this a different way.
First, we're going to gather human
baseline data for diverse tasks spanning
a range of difficulties. You should
think of these humans as, you know,
experienced experts, but on their first
day or or or first week on the job.
These are not people with context on the
tasks in particular. It's not exactly
the kind of thing that's come up in
their work before, but if it's a
software engineering task, you know,
there are relevantly skilled general
software engineer. Same for the machine
learning tasks and the cyber security
tasks here that we'll talk about. the
the [snorts] type of tasks come from
these three um buckets or task
distributions. Hcast which is a
collection of um softwarebased tasks
seemingly requiring autonomy you know
interacting with tools um uh interacting
with the environments thinking thinking
through the problem not not just this
kind of Q&A style um style data set um
the SWAR suite which are these atomic
problems these are problems that you
know maybe GBT2 can do maybe maybe it
can't problems like um here are four
files one of them is called
passwords.txt txt which file contains
the passwords and then on the other end
of difficulty we have rebench which are
challenging novel open-ended um machine
learning research engineering challenges
which are are very difficult even for
top human experts
in addition to gathering the the human
baseline data we'll also under as close
to identical conditions as possible
measure AI performance for the AIs that
we're that we're interested in on the
same set of tasks and then we're going
to convert the time it takes for humans
to complete these tasks into an estimate
of AI autonomous capabilities as I'll
I'll show you in a second.
Here's an illustrative diagram in this
case for claw 3.7 Sonet which was the
the frontier model at the time that this
paper came out. You can see that you
know for the for the very short tasks
something like 4 minutes or below Sonet
is getting the answers correct you know
essentially 100% of the time or or maybe
even here literally 100% of the time.
for the very hardest tasks it's
struggling and then and then there's
some range where we're kind of in the
middle you know we're somewhere between
10 and 10 and 90%. I'll say that this
empirical pattern where models are less
performant at tasks that take humans
longer is you know it's not a fact of
nature but it's it's something that we
see pretty pretty commonly pretty pretty
robustly across models at least on this
task distribution and I'd conjecture for
for other task distributions as well. So
we try and fit this dark purple line to
to something like this data on on how
long it took humans to complete the
relevant tasks that the models are uh um
are attempting. And then we call the
point on the x-axis this horizontal axis
this human time to complete axis at
which we predict the models will succeed
50% of the time the time horizon of
those models that there's much to debate
in the 50% number. I can I can talk
later about the reasons why we chose
that and and then we'll do the same
exercise for the other models. So here I
have uh claw 3 opus has a time horizon
of something like 4 minutes. That's
where we're predicting that it has a
success probability on this task
distribution of 50%. For 01 preview I'm
seeing something like 15 minutes so on
and so forth. And then of course all
these models you know they they come out
over um calendar time. So if we plot the
time horizon, the x-coordinate on uh on
on this set of plots against um against
calendar's time, we find something like
this. It looks, you know, kind of like
um kind of like an exponential trend
that's that's going up at some constant
rate. In fact, it doesn't just look like
an exponential trend. If we had a
perfectly straight line here, it would
indicate um a perfectly exponential
trend. um we we see something really
remarkably steady actually much more
steady than we were anticipating when we
uh went about doing this research
project
and that's continued to be the case. So
many of you will have seen updates that
we've made of of this graph on on on
Twitter. This is going all the way up to
GPT 5.1 CEX max. So extremely recent um
the predictions from this you know
shockingly straight line have have held
up very well I think.
Taking a quick step back, what are
benchmarks telling us or or here kind of
benchmark like evidence? Well, one thing
is that AIs can succeed at what for
humans would be exceedingly difficult
tasks. The tasks in our ebench are, you
know, really far beyond my capabilities
uh personally and and you know the AI is
having a good crack at them some some
decent percentage of the time. And the
second's you know kind of obvious is
that progress is rapid.
>> [snorts]
>> On the other hand, um you know, how much
how much stock should we put in the um
the evidence suggested by benchmarks? Um
what limitations might they have? Lots,
but here are here are three that I'll
note. One is, as I mentioned, these are
humans who are, you know, expert in some
relevant sense, but they're low context.
It's something like their their first
week on the job. They haven't seen tasks
exactly like this previously. They just
have some relevant experience.
presumably people who were more sort of
you know not not just having the
relevant experience but also highly
familiar with um uh with the with the
set of tasks would perform the tasks
even sooner and then we think relative
to those people the AIs were more
performant.
The second is that benchmarks can be low
ceiling. Even you know GPQA or use that
example again um where we're beginning
to get to the point where where that
benchmark is um is totally saturated not
providing um additional information for
marginal models whereas time horizon is
providing this nice way to sort of chain
benchmarks together in in in some sense
over time.
Um but you know nonetheless it's still
very hard to um uh to create these ever
harder tasks when the um when the time
horizon of models is doubling every
something like six to seven months. So
even time horizon might be might be
saturated in not too long or the
benchmarks underlying time horizon.
And the next one is you know not not a
concern that's limited to the to the
meter task to the task behind time
horizon. It's also true for sweet bench.
which is also true for for many of your
um favorite agentic benchmarks that the
problems aren't very messy in some
sense. They don't require a ton of
coordination with humans. They're often
in relatively small contained
environments where where not much can go
wrong. You know, not these sort of
massive open source code bases or or um
other ways in which the the problems can
involve more interaction with the real
world or or or be messy in in some
sense.
Um so we did this we did this project
and then um early this year we were you
know we were trying to think about um uh
how can we attack some of these
limitations? What what's a different
source of evidence that um might have
its own own pros and cons but you know
importantly be more externally valid in
in the scientific jargon.
Perhaps field experiments are the
answer. [snorts] So more economic style
evidence. So here we might be interested
in very high context developers who are
expert on the kind of tasks they're
already doing
speed up or some notion of productivity
boost. You know it seems to have more
signal through even some um superhuman
according to benchmarks range. You know
perhaps GPQA is fully saturated and
you're getting a 1.5x 2x speed up
something like that but you can still
achieve a 3x 4x 5x speed up even even
after that we we maintain more signal.
And the last is that you know that the
tasks are messier. They are tasks that
are coming up in people's real work.
They're not um synthetic. They're not
small and contained. Um this is a real
deployment scenario.
Here's what we're going to do for this
paper. We're going to gather 16
experienced developers on large mature
open source projects that we'll go
through in a second. Each of these
developers will on average complete
about 16 tasks from their real work.
These are these are issues on the on the
relevant GitHub repositories. The kind
of thing that they might otherwise have
completed with the with the caveat that
the very longest issues we're not going
to include.
The tasks will be randomly assigned to
AI disallowed or AI allowed. AI
disallowed, you know, it means it means
what you think it means. It means
software development in 2019. It means
no AI powered tab autocomplete. It means
no cursor agentic coding tools. It means
no LLMs via the web UI.
or they can be randomly assigned to AI
allowed in which case everything's on
the table. You know, any of the AI tools
I just mentioned or not using the AI
tools. If you're in the AI allowed
condition, you're not compelled to use
AI. You just have the option. And we buy
these developers Cursor Pro. So, um for
the for the most part, that's the tool
that they're using with typically 3.6 or
3.7s on it at the time, uh which was the
Frontier model when we conducted this
work. And then we're going to record the
time it takes for the developers to
complete each task and see the degree to
which they might save time when AI is
allowed versus when it's not.
These are some of the repositories. Many
of you will be familiar with them. We've
got the Haskell compiler represented. We
have scikitlearn. We have hugging face
transformers. These are on average a
million lines of code plus. They've been
around for 10 plus years. The developers
who are going to be working on these
repositories as part of this study are
on average the third top contributor out
of hundreds or or even in some cases
thousands of contributors to these
repositories. They personally have been
contributing to the repository for
something like 5 years on average. These
are top experts.
Some of you might have seen this graph
too. And and so the punch line's been
spoiled for for the rest of you. Um we
asked uh economics experts, machine
learning experts, you know, these are
people at major AI companies and labs,
um uh top academics, um some graduate
students, so on and so forth, you know,
how much they expect developers to save
time when they're using AI. They say
something like 40% or a little bit less.
We ask the developers themselves, the
study participants, how much they expect
to be sped up ahead of time, and they
say something like 24 25%. Then we ask
the developers after the study has been
completed how much they think they were
sped up in the past by AI being allowed
on the issues they completed as part of
this study and they say that it will
have sped them up by something like 20%.
And the punch line is that we find that
developers are slowed down by 19%. They
take 19% more time when AI is allowed
relative to when AI is not allowed.
You know, when I first saw the data
coming in, saw sort of early versions of
this plot, um, I thought presumably the
same thing that many of you might be
thinking right now, that we've messed
something up. Um, that that, you know,
something's gone wrong. There's some
there's some issue in in how we've set
up the experiments. How could it
possibly be the case? You know, at least
these um, uh, these developers have
access to the zero points because they
cannot use AI at at any time. Um, so we
poured over, you know, many, many, many,
many, many hours of screen recordings
from these developers working on issues
as part of the study. We look to dive
into um, a bunch of hypotheses that
might explain what's going on and try to
categorize, you know, the things that
that we think are going on versus not.
Um, many of this is is listed in the
paper. I I'll just quickly go through
some of the things that we think are
contributing.
First, overoptimism about AI usefulness.
that that seems like an obvious one. You
know, the developers even after the
study is completed, they think that um
uh that AI is going to be helpful to
their work. It's it makes sense they
might overuse AI um on that basis. Um
two more implicit repository context and
high developer familiarity. You know,
these developers are coming to these
problems already knowing the solution to
the problem. They don't they don't um
they're so expert in this work. you
know, I I I imagine them as as not
trying to spend a bunch of time thinking
through the solution that the the AI can
can work through. Instead, they're just
limited by how fast they can type. Um,
which which means that, you know, using
AI, instructing AIS to do it, um, comes
with some significant time cost versus
how they might otherwise have spent
their time.
I think many of us have the sense that
AIS might be less performant on on large
and complex repositories, which is a
different from this difference from this
benchmark style evidence or or from or
from some previous work. And then low AI
reliability. You know, um maybe the AIs
are very performant on these kinds of
tasks, but you know, they're only
performant um 50% of the time or 80% of
the time, 20% of the time. And so, at
the very least, you need to check their
work afterwards. And perhaps even you
need to spend time correcting their work
afterwards, which is which is something
we see quite a lot on these issues.
One thing from the factors with an
unclear effect that I that I'll mention
briefly I have to talk to people about
later is below average use of AI tools
which came up in the public discussion.
This this is in the unclear column
because it's sort of evidence evidence
for and against. Um that that's true for
for many of the things here. We don't
have anything so conclusive to say we're
still working on on this line of work.
Here are some here are some caveats. All
important. Um first you know obviously
we do not provide evidence for all
software developers or tasks. These are
extremely experienced developers working
on extremely complex longived open
source repositories. I in my own work
you know not um as expert in the
relevant sense as as these people are.
I'm working on much smaller
repositories. Um I I feel more
comfortable saying that even at this
time I was sped up by AI tools even if
even if the developers weren't. This
setting is weird. It's weird for the
same reasons that it's that it's
interesting this this unusual developer
population.
Second, the experiment is concentrated
in March 2025. As I mentioned, uh we
know that AI progress is rapid. Um
perhaps this this result will have
already changed by the by the time I'm
giving you this talk.
So there's a kind of puzzle suggested
right that the benchmark style evidence
is giving um a very impressive sense of
what benchmark of what AI capabilities
look like today whereas the more
economic style you know I include labor
market impacts um uh uh working here too
in addition to our in addition to our
field experiments look somewhat more
bearish or or unimpressive. You know why
why is the former not not translating to
the latter at least naively there seems
to be a clash. How might we go about
resolving this puzzle?
So one possibility is that in fact we we
messed something up. This is this is
still live and on the table. Uh you know
maybe the developers really are um uh
not very capable at using AI and if we
continue to run this experiment as as in
fact we are they'll you know learn more
familiarity with the tools and and so
get productivity benefits that they they
weren't getting at the time. I'm a
little skeptical of that story but but
but that's one possibility.
Another that economists like to bring up
is that we're not incentivizing these
developers to finish quickly. we're
paying them per the hour, um, which we
do for external validity reasons. Um,
you know, looking through their videos,
I I really, uh, do not think that
they're developing differently in
accordance with these incentives, but
but that certainly is one possibility
that's on the table.
You know, another um, more statistical
in nature possibility is, you know, this
is a small study. You shouldn't you
shouldn't over update so much from small
studies. We we are doing um, bigger
things that I'm excited to release at
some point. Okay, but let's let's assume
we haven't messed something up and this
is uh this this is a result um uh that
that we think that we think does hold
up. How could we resolve the puzzle?
[snorts and sighs] So, one possibility,
you know, as I as I alluded to briefly
is that reliability needs to be very
high to save time. That you need to be
getting um the the answers these
problems that developers are putting in
correct. you know, something like 95 99%
of the time in order for developers to
tab tab tab through and you know, not
not um not spend lots of time verifying
the AI's work, which which of course um
is pretty costly from a time
perspective.
Another possibility is bbenchlike or
algorithmic costless scoring at the
margin versus mergeability like scoring.
Sweet scores are not trying to account
for you know whether the code is spilled
honable by by other people in future or
whether it's matching quality
considerations that aren't um considered
by the unit tests. You know perhaps AIS
really are performance according to
SWEBenchl like scoring but not
performance according to this kind of
more holistic um uh holistic scoring
that we might care about low versus high
context baseliners. I I I mentioned I
mentioned previously these are just much
more skilled humans, you know, relative
to those humans. Perhaps the AIs are
less capable. Task distribution, maybe
these are just different kinds of tasks,
you know, in particular less less messy
than the than the benchmark style task.
Maybe that's explaining what's going on
here. [snorts] Suboptimal capability
elicitation. A huge amount of work has
gone in at meter to making the agents as
performant as possible given the
underlying models on on our kinds of
tasks. And um you know that involves
churning through a load of AI tokens.
Perhaps that's that's less the case for
cursor in particular at the time when we
completed the study.
And then interdependence across tasks.
Maybe it's the case that um you know if
humans can complete task A and task B.
AIS can only complete task A but not
task B and of course can do task A
faster. then it still makes sense to for
humans to do task A and task B, not
delegate task A because you know they
they need to know the outputs. They need
to know how how task A was completed in
order to reliably complete task B. I
think that's that's part of what's going
on. You need to maintain context as
you're working through these subtasks.
Um lastly I will say that we are hiring
not just for this kind of work that
you've um that you've seen being
extended you know ever longer tasks ever
more ambitious um RCTs um even more
sources of evidence from which we can
triangulate the truth about AI
capabilities but also for for much more
besides you can you can find this at
meter.org/careers org/careers. In
particular, I'm excited about research
engineers, research scientists who might
be um hiding in the current audience.
We're excited not just, you know, for
for research types with academic
experience, but very much for scrappy
startup people as well. And we're also
hiring for a director of operations.
And with that, thank you very much for
listening.
Heat
Ask follow-up questions or revisit key timestamps.
Joel Becker from MET (Model Evaluation and Threat Research) presents findings on AI capabilities from two seemingly conflicting sources of evidence: benchmark-style and economic-style studies. Benchmark research, particularly using the "time horizon" metric, indicates rapid and consistent exponential progress in AI models' ability to complete tasks that would take humans longer. However, a field experiment measuring developer productivity showed a surprising result: highly experienced developers working on complex, real-world open-source projects were actually slowed down by 19% when using AI tools compared to working without them. The talk explores various hypotheses for this discrepancy, including developers' overoptimism, high familiarity with tasks, AI reliability issues, the
Videos recently processed by our community