How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR
2149 segments
here's the very simple argument.
If you look at the sub notion of compute
over time um you know this could be like
R&D um spending on compute this could be
experimental comput it could be training
compute what you know whatever um that
some particular lab is is using goes
like this no surprise
if you have another chart of like um you
know log time horizon let's say this
this uh meter measure from the um this
figure that many of you would have seen
on Twitter over time it looks like
um uh you know let let's say that this
was like not merely a coincidence but
these things were causally proportional
in the sense that if uh if compute
growth were to half then time horizon
growth were to half. So you know for the
for the sake of argument let's say that
you know starting from 28 or so um the
compute curve begins to bend like that
where this would be no growth and this
would be the original growth something
something like half then if you know if
they were causally related and in
particular they were causally
proportional to one another then you'd
expect this to go like that and then for
some milestone that you care about let's
say here we've got uh one work one work
rising up there one month
then the delay implied in AI
capabilities is potentially enormous. Um
now like why you know lots of people
have stipulated that there might be some
slowdown in comput growth I'm not an
expert in in those forecasts but I think
I think the prior reasons do seem like
somewhat strong to me. One is like
physical constraints that we might we
might hit power constraints as you
mentioned or um there are various other
ones that that EPO have a report on that
they that they consider all of which
seem to not bite through 2030 but you
know potentially potentially could bite
sometime after 2030. Um I I think the
more likely one is just like dollars is
a constraint like you can't um you know
large tech companies can only spend so
much at a certain point like large
nation states can only spend so much
like you can't um uh I guess there are
some scenarios in which you you can you
can continue going but that seems to to
kind of naturally imply this slowing
down and then the you know additional
point that this this paper is trying to
make is that under a
very contestable but standard assumption
from economics um you should in fact
expect these these two to be causally
proportional um I think in particular
you should expect them to be causally
proportional um to the extent that or
for the period that uh software only
singularity is not possible and that's a
whole another discussion we can talk
about that um but at least in this kind
of somewhat business as usual um um uh
scenario or sort of until that scenario
no longer applies um I I think this is
this is maybe a reasonable model and
does imply some sorting of air
capabilities in the in the near future
I I have no plan for this session
whatsoever
>> that also tells us we don't have a
technological advance that dramatically
improves capabilities relative to like
like an unpredictable technological
advance, right?
>> Yeah. Yeah. I mean, all all predictions,
you know, assume no unpredictable
um Yeah. I'm like um uh you know, time
horizon or or like in general in in AI
kind of straight lines on on log linear
plots um have have been a I think you
know, a very highly underrated um
forecasting tool. They've done extremely
well over now many orders of magnitude.
you know, I I think it's reasonable to
have the default expectation that the um
log linear lines continue through like
approximately the same number of orders
of magnitude except maybe if there's,
you know, some significant break in the
inputs. Yeah, of course on the upside
there could be um there could be
something quite dramatic. Software
singularity is the first thing that
comes to my mind, but um uh but you
know, another Transformer style moment
seems like another another candidate
naturally. Of course, also one of the
problems with with testing this will be
that like I think most of the tasks that
you have able to test will eclipse the
maximum possible amount of time that
those tasks could take at some point in
the evaluation set.
>> Yeah. So I think um you know there are
some ways around this that we're working
on. I'd be I'd be excited to talk about
that. They all feel pretty early. Um uh
but uh yeah, you know, I I think it's I
think it's right that um if uh if time
horizons are doubling, you know,
eventually you you know, the um the
doubling time is such that you can't
possibly make long enough tasks in the
in the relevant.
>> It's possible also that like we actually
hit a place where time horizon is no
longer a useful measure because actually
you now want time now you want total
time to decrease like like like what you
want is you want the same result at a
lower time.
>> Oh. Um uh one
>> you want higher reliability at a lower
time time horizon. One thing to say
about time horizon um is there's like
two notions of time here like like a
human time axis thing like calendar time
access the the time that the model
working for I think you should like kind
of approximate as zero um it's it's not
actually zero they are they are taking
actions but they they largely um do
their successful work pretty pretty
early on to the extent they're going to
be successful on tasks um so so my my
guess would be that it will continue to
be the case that there's not sort
so much extra juice on that margin of of
making the models complete tasks more
quickly although reliability very much
so obviously um
>> so most of it's like the human like the
the iteration loop most of the time is
spent in like the human machine
iteration loop
>> um the humans are working without AIs
and the AI working without humans so the
for the humans I guess all humans
like yeah Yeah. Yeah. Yeah. Yeah.
>> Cool. Any questions on me to work? I I
can go through um uh some like upcoming
things that we're that we're excited
about if people are excited about those
things.
>> Yeah. I I I just have one question
perceived one like like the time
perception. Yeah. One of the
>> Yeah. Yeah. Y
>> um one one thing I thought and you you
brought up a little bit in the paper
which is uh you know whether or not
familiarity is a confounding factor. Um
although one of the things
>> with with tools you think
>> yeah tool is kind of factor and and of
course also like you also brought up
that like tool capability has
dramatically changed but uh there was an
interesting presentation from meta at
the developer activity engineering
summit this year
>> and they had done us they have probably
the best infrastructure for quantitative
measurement of like developer experience
in the world of any company
>> and they're able to tell you basically
how long it actually takes to make uh
make a PR basically they call it the
opposite meta but like how actual effort
like human time effort it took to make a
PR and what they saw was they saw a J
curve when they gave people agents and
that J curve was I don't remember how
long it was like three months or six
months and so one of the things I also
wonder is like if it would be
interesting if if there's a cut off of
how much familiarity the person has like
have they been using this as their
full-time daily driver for a period of
months uh and if there's like a cut off
that occurs once their like certain
level of familiarity occurs
Yeah, I I I'm totally I'm totally on
board with like not just in this case,
but in many economically relevant
outside of software engineering cases,
you know, JC like explanations being
being a real thing. I'm like, yeah, um
uh you know, developers, not just
developers, um experiment with tools.
You know, you tend to be slower the
first time that you're experimenting
with tools. um uh but you know if if
you're doing this so that you you have
some investment benefits you know later
on you might be might be more proficient
at the tools or in the case of AI um
maybe you just sort of expect the models
will get better and so even if you don't
become more proficient it will be like
the kind of thing that you want to do
you know those explanations broadly um
make sense to me um um I can give you
some reasons why I have scores
um I think so one thing to say is you
know we're Um um what are some things to
say? Um
uh as backgrounds, you know, we're
continuing with this with this work and
we'll we'll see. Um uh you know, another
thing to say is just like
quantitatively, you know, difference
between this and this very large.
I'm like how much how much is Jacob
explaining? I think it's not explaining
that much. Let me explain that because
like we see this over and over actually
in software engineering studies that the
one question you can't ask people in a
survey is how long did a task take time
>> like you can ask people how much more
productive did you feel and they will
give you an accurate response that
correlates to quantitative feedback.
>> Ask anybody the amount of time that
something takes they are almost always
wrong. So that I was like like what when
I share this with my colleagues I was
like okay I'm not surprised about that
at all but what is interesting is how
much is the slowdown aspect that was
what was
>> yeah yeah yeah that um uh yeah uh point
well taken that that that makes a lot of
sense I do I do um uh so I think we
despite this were interested in time
estimates because um you know we're
we're interested in providing
>> yeah I mean the perceptual like I do
think that's relevant too also because
like the perceptual aspect also the hype
aspect
>> right like so developers will tell you
that they were faster when they weren't
and I think that is worth knowing
>> and you know to to the extent that we're
interested in um uh measuring the you
know possibility timing nature of um of
capabilities explosions or sort of aird
being automated one commonly proposed
measure to do this is just like ask
developers or researchers how much
they're being sped up and for exactly
the reasons they're pointing at I don't
put a lot of faith in those um in those
in those estimates. So nice to nice nice
to see it like this. Yeah. Some some
more some more Jacob things. So I
so the so the forecasters who who are
not predicting time to complete right
they are they are just predicting this
this effect size that non-developers the
expert forecasters they are told the
degree of experience these developers
have and some of the forecasters are um
in thinking about how this population
might be different to other populations
pointing out various facts about the
study like they're more experienced I
expect experienced people to get less um
uh to get less speed up or you know
repositories are larger. I think AIS are
less capable at working on large
repositories. I expect less speed up.
They never never mention um familiarity
with tools. My my sense is that um yeah,
they share the the sense that I had
ahead of time, which was like most of
the action is in understanding what AI's
the kind of things that AIs are good at
or bad at in the first place. And all of
these developers have experience with
LMS and their core development workflow.
It's just cursor that they're they're
quite that three course of them are
totally unfamiliar with at the start of
the study. Um so I I just I wasn't
seeing much much margin. Um yeah I I I
think I think it is I think it is an
open open question. I I also you know we
watched so many hours of screen
recordings of these developers working
and um I just do not see um I think
they're like prompting very reasonably
you know in some cases worse than me and
my colleagues in some cases better. Um I
I'm not seeing these like advanced
workflows that they're not accessing.
>> Yeah. And my experience is is not that
far off from this is that there are
times when like I am dramatically slowed
down and there's times when I am
accelerated.
>> Yep.
>> And although as my familiarity with the
tool increases.
>> Yeah.
>> I definitely improve a lot because I
learn over time
>> what I can tell it to do and what I
can't tell it to do.
>> Yeah.
>> In addition to like it's just getting
better with it like understanding like
okay now I need to plan blah blah blah
blah. But I but that's why so the thing
is like before you make a like high
level architectural decision that you
know 10 conversations uh 10 conversation
uh turns down is going to blow up in
your face you like really try and think
about it.
>> Yeah. Yeah. Yeah. Exactly. And and and
also like scope it down to like a
smaller problem. Like I at first I would
try problems that were too large and
like can't handle that.
>> But just I mean just for the future if
you ever do I mean I think it's
obviously really hard with the with the
sample with the 16 person sample size.
But
>> that's great great because because in
the future what I I think having a cut
off like trying to figure out if there
is a cut off of familiarity where the
number changes would be interesting to
see if that meta result generalizes
outside of
>> um we are we are on it I think. Um the
AIS have been getting better during this
period which is going to compound a lot
of a lot of what's going on obviously
but uh yeah yeah
>> the thing is the projects themselves are
very optimized for people coming on to
new projects and figuring out how to you
know they're already the ones that
struggle to be organized well for humans
to come on board and be able to navigate
them quickly don't survive very long in
the open source ecosystem and these are
fairly mature open source projects.
They're a little bit different from like
in enterprise settings where things
survive because they make money even if
they're a pain to develop on, right? So
the the context is a bit different.
>> These are the repos. Yeah. Yeah, that
that is a really interesting point
because like actually some of the repos
that I was helped the most with were
ones that I was completely unfamiliar
with and which had no decent
documentation of any kind and where like
I I had to come in on this legacy code
base that had existed for years and like
make a change and and like the developer
who owned it was like only partially
available to answer questions to me and
so in that case like cloud code was a
huge help.
>> Yeah. Legacy code bases don't exist
because they work well. It's because
they make money.
>> Yeah.
question I had was um sort of like did
all the developers have the same level
of AI familiarity with cursor or was was
there some variance and was that like is
there a plot of like each of each each
of their familiarity
>> there's always a plot
>> there's always a plot that could kind of
like dig into like the question of oh is
there is there a J curve
>> yeah so so here's some here's some
evidence
Um, so okay, the you know, I can show
you some plots. I think the the the
sample size is just small enough that
like you shouldn't really believe any of
the I mean I think the plots aren't
going to show much, but then I I don't
want to say that's like strong evidence
this is not something that's going on. I
just think the evidence is kind of weak.
The thing that really convinces me is
like watch the videos. Obviously videos
working and you know often they're
better at using cursor than me and I'm
like well you know I'm I'm working on
this project using cursor. Um but but
here are some graphs. So um so this is
by whether they have various types of um
uh AI experience coming into the study
and you know basically you see no
movement in in point estimates people
for whom cursor was a primary ID before
um yeah not not a huge amount of
difference versus people for whom it was
not. Um then the next one is you know
you might think may maybe you have a
view that you know some Jacob cut off
comes after this point but still you
know within the within the study there's
some variation in how experienced people
are with AI because they have multiple
issues you know after the first AI issue
they're slightly more exposed than after
the second AI issue. So you might try
sort of excluding those data points over
time and and seeing and seeing what pops
up and you know they don't they don't
seem to get better at using error over
time.
>> Although I think there's probably a
static issue with that.
>> You think there's probably what? Sorry.
>> Yeah, there's probably a static issue
with that that plot right there. Like
those bars are very very wise.
>> Oh, I mean I think yeah none I I think
like all of the um plots outside of the
main plots all of these subsets things
you should like not put a lot of stock
in. Um yeah, I I I I totally I totally
agree. Um okay. And then lots lots has
been made so so this graph is the reason
we put it in unclear evidence because
we're like ah things point in different
directions. Um a lot has been made of of
this plot suggesting you know something
something J-shaped in particular that
you know at the end once people have
more experience um uh they do experience
some some speed up. Um here are some
issues. You know first like the other
plots don't I think that's important to
to include. Second, these hours are
coded very conservatively. So for
instance, someone in the 30 to 50 hours
bucket is um had cursor as their primary
IDE in 2024, they had recorded
themselves on their time tracking
software as having spent 140 hours using
cursor. They conservatively estimated
that they'd spent 50 hours using cursor.
And so they end up in our 30 to 50 hours
bin. This is someone whose whose primary
ID was was was cursed last year. Um, and
and you know, people have been
commenting about this. They've been
using cursor for less than a week. I
think that's not a not a very fair
assessment. If you if you were to move
that developer over from the uh
penultimate bar into the again, you
shouldn't believe this because of
statistics, but um if you were to move
the uh that that developer from the um
penultimate um effect size estimate to
the to the last one, then you see some
balancing out where you get back to
essentially zero in the last bucket.
Yeah. Again, so so like transfinitive. I
think Jacob explanations, you know,
still like very on the table.
>> Is is it not likely though that the
50-hour group also is similarly
underestimating their their time they've
spent using cursor and that actually if
you just had a longer scale that you
would still see a degree
>> Oh, that that is an interesting point.
Um um
that seems plausible to me. Um and then
and then I guess I want to I'm not sure
it's underestimate because we're using
this like very conservative
>> Yeah. Yeah. Totally. Um, yeah. Yeah, I
think that seems plausal to me. And
then, um, for this not to be strong
evidence, I'd retreat back to I think
you shouldn't really believe in any of
these.
>> Yeah. I think the biggest thing is it's
small sample size and there's also a lot
of bias in the data set effectively,
right? Like it's a certain kind of data
set. It's open source.
>> You mean like the kinds of the kinds of
developers?
>> Yeah. Open source developers and also
working on open source projects that are
pretty mature.
>> Yeah. you know those those two things
are if you're working with open source
developers are pretty mature this is
probably reasonably indicative maybe but
the sample size is pretty small but
outside of that it gets a little harder
>> yeah and talks about this I'm like um uh
I think yeah this group is really weird
it's really interesting it's like
interesting for the same reason it's
weird right um uh yeah we we were
interested in you know again studying um
uh possible effects on of AI for R&D
speed up or or or automation. Um there
if any types of developers are not being
greatly sped up, it implies the whole
thing isn't isn't being sped up. So So
it is kind of curious to see even even
like particular weird populations. You
might imagine in like large, you know,
sort of production inference code bases
maybe have a bit more of this shape than
scrappy experiment scripts.
>> Yeah. Yeah. Yeah.
>> Um but yeah, it's totally
>> No, I think I think it's very
interesting. It's just it's hard to
generalize. We just don't know.
>> Yeah.
Yeah, we are doing this large study and
I think you know I think unfortunately
after the large study which includes
more green field projects I think it's
still going to be hard to
um for for not so similar reasons. Yeah.
>> Although I don't feel like your results
are particularly contradictory with any
actual independent research that's been
conducted. The only research that I've
seen that I would say is contradictory
to yours is research that has been
funded by model shops or agent shops.
>> Um
what can I say about that? I I do I do
think that most of the research that's
that's put out um is associated with uh
large tech companies um and I and I
think there are other methological
concerns that I studies as well.
>> I I have methodological concern with
that as well. I know people who work at
some of those places have methological
concerns with the work that was output.
So
>> I mean I you know I think I think there
there are concerns about also.
>> Sure. Sure. But I I actually feel like I
I remember somebody sent me your paper
and when I saw the headline I was like
no way.
>> Well, me too.
>> I was like that sounds like BS.
>> Yeah. Yeah.
>> I read the paper and I was like,
>> "Oh, this doesn't suck at all."
>> Like
a little bit.
>> Well, no. Like at at least your high
level conclusion both is intuitive like
from a person who's read a lot of
software engineering research and also
is well justified. I like I think people
I have had people argue with me about
the 16 developer thing, but I don't
think that actually matters in that
particular case because I think they're
actually a fairly good control set more
or less, right, for an experiment
because they they remove a lot of
validity concerns by being experts. So
yeah, they it's true that they don't
represent certain like like the broad
aspects of developers, but they also
remove a lot of variance and what you
would expect from the population and
they and they allow you to have like a
sort of an epistemological
function of like hey let's isolate that
factor away and then that's let's see
what happens with that and that's I like
that and then they thought the way the
study was conducted was completely
sufficient to draw a conclusion a high
level conclusion that it draw.
>> Thank you very much. Um here's a here's
a curiosity. So so we did we haven't
published this because of organizational
reasons that I won't go into but um we
did um we did conduct this um uh you
know people would throw sort of their
various explanations for for for what's
going on here you know many of which
have lots of merit some of which more
more skeptical of um you know a natural
one is brownfield versus greenfield
projects. Um so we ran this um kind of
enormous hackathon where we randomized
half of teams to um use AI versus not
kind of you know maximally green field
or something. Um and uh and then we'd
have a bunch of judges score them um you
know many judge scores per project or
something to try and even out that noise
and we'll see you know is it the case
that like the bottom uh 50% are all the
AI disloud group and the top um the top
are all the um AI allowed groups or
something like that. Now, unfortunately,
it was sort of even even smaller. That's
like part of the reason we're not
publishing this. I think the evidence is
is is really quite weak. The degree of
overlap is enormous. Like the the point
estimate that we um I'm a bit nervous
about saying this because, you know,
hasn't gone through the kind of review
processes that something like this goes
through. So, so um maybe I messed
something up, but um uh I think the
point estimate is something like four
percentage points higher on a on a
sorry, four percentile points higher um
if AI is allowed versus if it's not
after after controlling for everything
else. That is like you know extremely
noisy and you shouldn't draw any
conclusions but um but seemingly maybe
kind of um small effects
from AI. Um yeah. Yeah.
>> So one question I have I guess this is
related to the study and also related to
other research that you guys have done.
Um so have you found a similar pattern
or I guess first have you um explored
like the effect of AI in other domains
and specifically software engineering?
Um and if so have you also found this
kind of surprising result that maybe
speed up?
Um um no no no no I mean not new
directions ones um stuff that we have
not done um uh yeah I yeah you know
we're interested in understanding um uh
possibility of accelerating R&D um you
know coding is not the only kind of
thing that happens at major AI companies
much more conceptual work happens um uh
you know I'd be I'd be very excited
about um um you know working with math
PhD students or very different types of
software developers or um or you know
running running these kind of studies
inside of um major AI companies or or
large tech companies or or something
like that. I think um we are very
interested in you know not necessarily
directly but some somewhat close analogy
to um to the large air company case. So
to the extent that something really
deviates from that um probably less
interested.
>> Interesting. So yeah. So I guess it
sounds like uh you're interested in
measuring capabilities for like
math research um and uh some other
research.
>> Yeah, I I'd say I'm interested in like
what the hell is going on in AI and um
you know how am I going to learn the
most about what the hell is going on in
AI? Um, you know, I I think something
something a bit more conceptual, some
something where, you know, fewer humans
are currently working on it, so it's
less appearing in training data, um,
will help me better sort of triangulate
the truth about what's going on in AI,
um, even if I don't care about math
research in particular, um, it'll still
sort of draw helpful qualitative lessons
is kind of the sense I have.
>> Yeah. I mean, if I was going to pick the
areas that I think it's most successful
in or like the areas where I would
expect to be more successful, but where
I think it is being less successful, I
would pick probably data science
>> as an interesting one like how does data
science how a bunch of data scientists
help by AI today.
>> Say more about what you expected to be
less successful.
>> Um, so in a in a real so let me give you
an example.
>> Yeah, Google LinkedIn
>> and at LinkedIn there are 5,000 tables
with the name impressions in the in the
table, right? So if an analyst wants to
understand how many impressions happened
on a page, where the hell do they go?
Hum being can't figure that out.
>> Yeah.
>> Like today, there is no existing AI
system that we have that can be hooked
into like corporate environment like
that and process through I mean there's
trillions of rows in those tables. So
like how like like so what a data
scientist needs to do is they need to be
like I need to like you know analyze a
bunch of data and come to a conclusion,
right? Uh and I I hear lots of like
thoughts about building systems. You
know, people talk talked about ML to
SQL. The models are much better writing
SQL than they used to be. But I believe
that the state of underlying data is so
bad
that the the the actual data scientist
going to get way less value out of the
the AI than software engineers are
mixing for.
>> That is
>> interesting.
>> That's very curious. I um so one one
view that some some more bearish people
have looking looking at the future of AI
is is um you know so much there's so
much classic knowledge around there's so
much knowledge that's sort of um
embedded inside of companies that you're
you're not going to pick up from you
know these like RL training environment
startups or or something something
something you know maybe it it's not
sort of the state of nature that there
needs to be many specialized AIS the
like much of the lesson of the past few
years is that one big general AI seems
to seems to be more performance but you
know at some point in the future when
data is like locked up inside of
companies um uh you know we will have
more of this um uh proliferation of of
many more specialist models as I have
you know GPTN fine-tuned on on LinkedIn
data in particular something something
something I have one reaction that's
kind of like that
>> yeah I don't know
>> I do have a disbelief like reaction I'm
like ah science you know
>> but also like so but also like so
contradictory facts so one of these
problems is the all these data sets
contain contradictory prefax like the
name of the field will be uh like uh you
know date started or like it'll be time
started right and then it will contain
only a date except for it will only
contain the date up until like November
of last year and then after that it will
contain only the month but then after
that it'll contain maybe the the seconds
that the thing finished and in order to
actually successfully query the data set
you the data an you the data analyst or
the data scientist have to know what
those cuto off dates were not written
anywhere
Although what you could do theoretically
is import a bunch of the SQL that other
analysts have written to try to figure
out like how they triangulated these
things and work backwards from those
reports. But today though I think today
for example
>> people sorry I've just like I haven't
worked large company
people don't fix this source.
>> Oh no. So
>> I feel like the lesson I learned over
and over again at this data specs really
matter really matter. No, I I I've also
been working in data analysis and
research developer research
>> and yeah
>> and so yeah so the like the problem is
like their job is like produce this
report for this executive right not go
make infrastructure to produce this
report for this
>> but I'm like if I okay
>> I'm with you I live that dream every day
yeah you just have you end up having to
right is is you have to build out
infrastructure for it that has to be
part of the job description And and the
other part is you have to fix the
problem at the source. Like you really I
I I still remember having a conversation
where where someone said it's too
difficult to fix it at the source
because there's too much complexity of
all the systems that depend on the
source. And I said okay wait a minute
you're saying it's too complicated to
solve at the source downstream somehow a
problem that is too big for the entire
organization to solve.
>> It's easier to solve there. Come on.
Like that doesn't make any sense. I just
think there's so much potential here and
I have not seen a lot of studies done on
like how people who are working in that
data space are experiencing AI and
what's fascinating about that is real ML
is mostly data work like like ML
especially outside of LLM outside of
LLMs the majority of ML engineers spend
most of their time doing like feature
curation
>> rather than they spend actual direct
model training
>> and like trying to clean up bad data for
feature curation. So like theoretically
the potential even for the improvement
of ML by enabling ML to be a better data
scientist is huge and I I suspect that
if you my hypothesis is if you went into
this space you would discover it is
great at telling me how to write SQL or
how to like write pandas and or polars
or whatever you're using. It is okay at
doing very trivial things and it fails
at all complex tasks like fails
completely at all complex tasks. I don't
even know. I haven't even set a
benchmark on it.
>> Can you give me an example of a of a a
complex task?
>> Sure. Um
uh let's say a complex task is
determine the time between uh give me
the P90 of time between deployments for
all deployments that happened to Capital
One.
>> It struggled at that.
>> Yeah, that that it doesn't seem
surprising to me.
>> That seems surprising, right?
>> Yeah. Uh so uh
>> and like you know if it has sort of
reasonable context about where it would
find this
>> kind of data right sure makes sense and
uh and and then so okay so fine so so
give me that number and then also I make
sure that you can break that down you
know by team hierarchy so you give me
that in a table so I can break it down
by team hierarchy
>> uh where is the team hierarchy data
like uh how oh here's a funny thing uh
what PRs were in those so how do I know
how How would I how do I actually
determine what the time deployment
started and ended was? Because it turns
out that's not clear in the base
telemetry. And you have to like know
magic to figure out when the when the
deployments started and ended. Um uh oh,
and also tell me, you know, for my
ability to analyze it, tell me how many
PRs were in each of those deployments
and which PRs went to each of those
deployments. Well, guess what? The
deployment system only, this isn't being
recorded, right?
>> I think it is being recorded.
>> Okay.
>> Yes. But before you
>> um so then you know imagine the public
system doesn't contain sufficient
information about that data right uh
then like
like where do I get that data well that
data it doesn't exist in any other
system so what I well maybe I have to go
like I have to go to GitHub and I have
to call the GitHub API and like the
chance of the LM or any agent figuring
that out today is pretty minimal.
Hm.
Yeah, I do still, you know, relative to
my colleagues, I'm I'm I'm pretty
bearish on AI progress. I I I do still
have some reaction that's like, ah, like
can't you spend a day getting this into
a cursor rules file,
you know, like where where the where the
hierarchy exists.
>> I I would I would go I think that's why
I think it's interesting. I think it
would be worth studying. I don't I have
not seen any real comprehensive study on
the experience that data scientists
have. Um um if you if you have any ins
to um uh to to us running studies at
large tech companies then I I am all in.
>> The only there there is a fellow at open
eye that I was talking to who's one of
the speakers who does evals internal
evals and he has mentioned that he's
done some work with data scientists. So
he might know some people who have that
data but it's it's all been internal
between him and like ser between him and
like you know entropic or whatever
right. Um
and yeah that and I also think uh I one
of the ones I'm curious about too is
lawyers
curious about like more traditional like
older like um lawyers doctors and I
think mathematicians are all really
interesting to me
>> just because you know both lawyers and
doctors are so constrained by a legacy
history of like the constraints around
them and how they work
>> and yeah legal legal issues I'm
imagining continue to be a significant
bar.
>> Yeah. And they're stood like I I I'm
also interested in like what's the how
are the
>> stodgginess I feel like is is a um I I
think I'm less bought into as a
long-term explanation for economic I
like the the legal restrictions they
sort of continue to be the case through
time. The stodgginess I can like set up
a new law firm that's less stodgy and
then take the previous law firm or seems
to
>> I agree. I I don't think it's
persistent. I just think it's it's
interesting to see one thing that would
be interesting to see is like if that
affects the mental model that they have
today like like if if they're like how
they've been talked to about it or how
their trust in it affects how they use
it.
>> It would be interesting to know to me
like I don't know it's a worthwhile
study. It's more like one of those
things that I wonder about idally. You
take a lawyer who just got out of
college and sort of, you know, has spent
a lot more time using CHTPT. And you
take a lawyer who's been in the business
for 50 years and, you know, has has a a
giant folder full of word docs that
contain like all the briefs that all
their, you know, junior associates have
written for decades and decades. And he
just opens up those briefs and like
changes a few words in them and then
sends them out to the judge. And he
like, you know, has known those judges
for like 30 years, 40 years. He knows
exactly what they want. that like you
know is he getting any is he going to
get any value but is there a value he
should get is there something that like
is there some way that like he would be
helped by AI I certainly know discovery
discovery in AI is like in in law is
like a huge huge problem and I I know
that like there's Harvey I don't know
anything about what success they've had
a lot of people working in that space
specifically like that's it's an ongoing
thing right there. There's always
technology for it, but it's kind of
the adoption of it is a very different
thing from
>> that's that's that that's the thing,
right? Because I one of the first things
that I thought of because I I have a
little bit of a legal background and one
of the first things that I thought of
the first time like when Chic 3 came
out, I was like, "Oh, this could totally
change discovery."
Like this could because discovery is
like the most painful and most difficult
and most expensive. You can have serious
social consequences by making discovery
less expensive. Like that is the
expensive part having a lawsuit.
And so like you could actually have
significant impact on a society if you
could make discovery cheap and
instantaneous and reliable.
>> Yeah.
>> I have a question.
>> Yeah.
Sorry. Scatter plot, right?
>> Um,
>> first in 50 hours.
Oh,
>> I see. Yeah.
Yep.
Uh,
>> I say it's this one.
Yes, that one.
So, you're saying that people the
develop there was no difference cursor.
We're talking about the ID that VI
coding and they use it for 50 hours.
Well, I was very intrigued by that
because everyone talks about VI coding
and how cursor is instrumental.
Why did you get to how did you get to 50
hours? Just curious.
>> Um, so so this is including time arrive
at 50 hours is
>> this is including uh time in the
experiments um that developers have
spent in the experiments plus their plus
their past experience. So for um for for
some developers working on some issues
as part of the experiment some of them
have gotten to more than 50 hours of um
cursor experience um uh and that's
that's just coded up in that in that
bucket at the end.
>> And it was was it the same task for each
group? Uh, no. These are kind of they're
natural tasks that pop up on the GitHub
repositories, which which as mentioned
are kind of um I don't want to uh I'm a
little bit nervous about saying they're
weird because it implies they're um uh I
want to say it's very interesting and
it's very weird and and it's interesting
for the same reasons it's weird. the
these are um these are repositories in
which they have these these are projects
in which they have an enormous amount of
mental context to build up um that the
the AIS might not have um that they've
worked on for for many many years that
they can um I'm not sure this is always
the case but you know I imagine it in my
head that they basically know how to
execute on the particular task they have
before um uh before they even you know
go about attempting it because they're
so expert in the in the project
the positive speed of is it like like 5%
like what do you what's how do you
quantify the positive speed of
>> um so uh you might think about uh let's
let's go to this one instead. So um on
the here left hand side we have the um
averages for what the developers say um
will happen in terms of their time to
complete if their issue or their task
gets assigned to the AI disallowed or
the AI allowed group. Um you know they
they think that if AI is disallowed it
will take them a bit more time closer to
two hours and I guess more like an hour
and a half or a little bit less if AI is
allowed. Um but then you know we we
randomize this particular task to allow
AI or not allow AI and it turns out you
know if we randomize to AI allowed then
that the times are more like a bit above
2 hours rather than a bit below 2 hours.
Um and then you can think of the uh
change in time estimate as sort of being
one divided by the other here. It's not
quite that for reasons reasons I can go
into but it's you know it's effectively
um what exactly is the transformation?
You know what it's something like AI
disallowed over the AI allowed minus
one.
So, uh, to draw that out, I'm like, um,
you know, I might be like, what's what's
the speed up? You know, is it like, uh,
1.1x
that, but you know, these these
developers are going 1.1 times faster
when we're actually on a time to
complete scale, not a not a speed scale,
but ignoring ignoring that um ignoring
that detail. Um, you know, is it 1.5x?
Um, is it 0.5x? So, they're actually
going sort of twice as slow. um how how
would we get that information? Well,
we'd do something like um take the AI
disallowed times divided by the allowed
AI times. You know, if this was uh 1.1,
let's say, times as long as the allowed
times, then we'd get to uh 1.1 x speed
up. It's something something like that
that's going on.
And in fact, you know, we find that we
find that slow down.
>> I I just read a fascinating article as
company I remember, but basically
journalist
was allowed to
um
using five coding, right? uh do a pull
request, meaning there was some feature
and AI was used to assist with building
out the requirements and she practically
according to the article just kind of
did a little couple of tweaks and then
just signed off on it.
>> It was just really fascinating. That was
the whole live coding thing.
>> Yeah, I coding like that was the whole
thing. It was like you didn't have any
software development background. That
was all I was just curious.
You've
tried to do a study on that.
>> So I so I definitely do I definitely do
the share but you know if you've got
like no idea what's going on then um
probably probably these are going to be
some some significant um some
significant speed up. you know I I I
will say I guess number one it's not um
you know it's not a priority obvious um
you know in fact we went out and did
this hackathon with you know very
experienced people and much less
experienced people and and tried to see
what happened and what we found is you
know the scores the judge scores
extremely noisy and I think you
shouldn't believe it but um you know the
the judge scores were not that much
higher when the AI was allowed versus
versus when it was not that the people
aren't actually making that much more
progress and then and then another thing
to is I I think there's going to be more
expertise in this in this room than than
I have. My understanding from you know
sitting with these open source
developers for a while and not not being
a very capable developer myself um is um
is that the the quality bar on the
repositories in in this study is just
very high typically. Um and so I would
be very surprised if a journalist um you
know even frankly if like a good
software engineer without lots of
experience on the repository but but
certainly you know someone who wasn't a
software engineer was able to get up a
clean PR on these repositories
first time in fact I think that's a lot
of the story for what's going on here is
that the AIS you know they actually kind
of do make progress in the right
direction some some good fraction of the
time but um for, you know, for various
reasons. Sometimes for reasons of
correctness, but sometimes for reasons
of like, you know, how they've tried to
solve the problem and, you know, whether
that's the typical way of solving the
problem or like how various parts of the
project speak to one another. These
these kind of considerations, you know,
they they haven't properly accounted for
that. And so, you know, the humans not
only need to spend expensive time
verifying, but also like clean up clean
up all the stuff. My sense is that
someone who didn't have all that
experience like basically wouldn't know
how to do that step. Um, and so wouldn't
be able to submit a clean PR to these
repositories. You know, that's it. Like
I relative to these people at least, I
suck at software development and I I'm
getting up, you know, PRs internally all
the time and I think they're I think
they're worse quality and um, you know,
and they're and they're getting over
time. They're getting better over time.
You know, I do believe that people are
coding when they when they wouldn't be
able to code. they are submitting, you
know, PRs at a lower quality standard
when they wouldn't be able to do that at
all. Um, but getting getting up these
expert level PRs, I I do feel kind of
skeptical
>> and and that's actually part of what I
was getting at is they often get PRs
often get rejected by more novice uh
folks on these big on these bigger
quality projects for no other reason
other than the developer ergonomics
impact of the PR, right? So the fact
that it makes it harder for me to future
maintain because because for an open
source project almost all the incentive
is biased towards making it easier for
me to maintain the project right so
every time a PR comes in if it doesn't
make it easier for me to maintain the
project I have a tendency to reject it
right uh if it does make it easier to
maintain the project then yay I'm into
it as a that is unlike what you have in
a typical business context right where
most important thing actually is to get
something done
>> right uh because you're you know the
fact that that someone's going to have
to spend a lot of time maintaining it's
almost job security right but for open
source it's the opposite it's actually
what causes people to leave projects is
when it's difficult to maintain right so
it is a different bias on what you
accept for pull requests
>> can you remind me the name of the name
of the English gentleman who maintains
the school compiler
>> um
>> Simon
no I I can't remember what name recall
>> so here's here's one story that that
might be relevant
um you know bunch of repositories in the
study they all have you know broadly
these characteristics one of them is the
HA hasll compiler famously on the HA
hasll compiler um there's like some
chance I don't know if it's 50% or 30%
or what but there's some chance that if
you submit a PR the
>> I'm being recorded the
>> Simon Simon
>> Marlo maybe
>> I'm not sure the creator of the HA hasll
compiler will come into the comments and
argue with you for many many hours much
longer than you spent working on the
pull request um until um until the PR
hits exactly your specifications. Um
combine that fact with the remarkable
fact I think that the median PR in the
study the time they spend working on the
code post review is zero minutes. That
is the the median PR is like perfect
first time around because the
professional incentives of these
developers are are like that. Now
there's a very long tale on one of them.
Um on one of them I think literally
Simon this gentleman pops up and I'll
use the comments for many hours and that
that that one's a lot longer. Um but um
uh yeah they are they are maintaining
this extremely high bar.
>> I'm interested in your other upcoming
stuff that you had in your talk.
>> Yeah let's do it. Um
so um yeah so so you know so one thing I
what to say um I guess let's let's go in
order as I as I think you mentioned um
you know if if uh capabilities as
measured by time horizon keep keep
doubling it does seem very very
challenging to keep up with that um in
the short term we have a number of
directions for um for getting on top of
that but uh and I think that will last
like through the year but through two
years you that seems challenging. Um I
think still possible through 3 years I
think still seems possible. You know it
start starts to get harder and harder.
Um anyway in the short term building
these building these much longer tasks
and ways in which we might get around
the problem entirely. For instance um
here's one thing that might be somewhat
>> you could also raise the accuracy bar.
uh you could raise the accuracy bar
although um you know we're the reason
we're interested in this in the first
place is we're like um you know is GT5
existentially dangerous okay and the
answer is no I think
>> um but like what but like why why do we
think the answer is no okay at least I
think there are multiple reasons but at
least we can say you know GT5 is just
like not that good at stuff like you're
trying to get it to do you know data
science on these on these um columns
that have very similar names and it's
like not clear exactly the logic that's
led to the columns. It's not it's not
doing that sort of thing. I'm like,
wouldn't you need to do that sort of
thing to be not exactly that, but you
know, be capable of stuff to be
extensionally dangerous and it's not
capable of stuff. Um, so, you know, I
see that the the AI are failing at these
these hard tasks, I'm like, you know,
great, whatever. But if, you know, if
it's succeeding 90% of the time and not
99% of the time on like very challenging
long tasks, then, you know, I think
there are there are, you know, other
things that we can retreat to and I'm
prepared to retreat to those things. But
at least the naive the reason it's not
extentially dangerous is that it's not
capable of stuff I think at least starts
to get pushed on as as as an
explanation. So here's another
possibility for something we could do.
So um what if instead of time horizon
here we had uh time horizon uh whilst
uh being
watched
very
closely
and then time. Okay. And then you know
maybe the the time horizon whilst not
being watched very closely is like this
but the time horizon whilst being
watched very closely is like this. That
is like if um uh uh you know I I I hope
that major AI companies have um
monitoring setups uh going where you
know for example AI are looking over
transcripts of of AI performing work to
see if anything suspicious is going on.
And you know it's not it won't be
impossible perhaps I conjecture to to
fool those monitors but you know be
challenging to to to fool those
monitors. So you know maybe you can do
you know 40hour tasks without 40hour you
know dangerous capability relevant tasks
um without any monitoring but only like
2-hour long tasks if you're if you're
being watched very very closely
>> being watched like if there's another
model evaluating the output. This is the
kind of thing. Maybe you escalate to
humans and
>> with them without safety controls. You
mean
>> Yeah. Yeah. Yeah. Roughly. Yeah. Um and
and then you know now now we've bought a
load more doublings. We can we can sort
of keep keep doing the capability
extrapolation thing. That's one of the
kinds of things I'm about in addition to
creating longer tasks.
>> Yeah. I mean I imagine some of the model
shops do have like evaluations of
capability with and without safety
because I'm sure that they're like
there's an argument between their
researchers and their safety teams.
>> Um yeah. Yeah. Yeah. Um yeah. Um
um
>> seem like I have seen something about
this but not a lot.
>> Yeah. Yeah. Um yep. Um
um yeah I I guess I think that
um this might be sort of like an
especially quantitatively important
consideration or um I I expect that it
will reduce the effective time horizon
by uh by like maybe an order of
magnitude or two. Um yeah, I I agree
that there's a there are some important
senses in which there's not really a
difference difference difference in
kind.
>> Yeah, of course. Then I would also worry
that like publishing that encourages
people to like focus less on safety or
to like try to argue against safety
because how it impacts capability.
>> Yeah, I think there are lots of
landmines
in in um in all sorts of safety work,
not just not just the
>> Oh, of course.
>> Um okay, next thing. um you know we have
this we have this trend I spoke about
this at the beginning but you know we
have this trend is it going to continue
forever is this is this a fact of the
universe or does it you know somehow
depend on inputs or what you think about
um intelligence explosions or or
something like that um trying trying to
think about that where's this line uh
actually going is um is a is a pretty
active area of work also you know the
ways in which um this line or or the the
particular points don't quite correspond
to the thing I care about. So one
obvious way is that um you know these
these models are being judged according
to um
you know I think I think the um
algorithmic scoring that we use on on
mis tasks is is um is importantly sort
of more robust or more covering the
relevant concerns than might be the case
in just sort of sweet benches and unit
tests but but it still sort of it still
has a lot of the same character. Um
there are um you know considerations
like being able to build on this work in
future outside of the immediate problem
um uh uh facing you that that aren't
being captured by by meter scoring. And
maybe if you did capture that, you know,
you'd get something a little bit like
going from 50% success to 80% success.
You know, you can do hourlong tasks if
it doesn't matter whether you can build
on the work, but you know, only 30
minutes asks if it does matter whether
you can build on the work. bringing
bringing these numbers again to to
something I care about a little bit more
and then yeah projecting out both if
there are compute slowdowns um if if we
are going to enter some regime where um
uh AIS are building AIS and that leads
to some sort of steeping of the curve
these these kind of considerations
that's another thing I'm thinking about
um
and then capabilities measurement from
new angles so here's um you know here's
here's one history of meter that I think
is not the accepted history and um also
probably um not a very accurate history
certainly not the most accurate history
but but here's one possible telling
um you know near the beginning meter has
early access to when I wasn't there and
I have sort of no internal knowledge of
this when meter has early access to GT4
um and there are just sort of Q&A data
sets going on everywhere like Elsat data
sets or you're like you know can GT4
seems so smart relative to stuff that
that went before. Can it do stuff? You
know, so you like you tried out some
tasks. Can it can it do stuff? And the
answer is, you know, it can do some
stuff and it can't do other stuff. Um
and um and people like, "Oh, that's
cool. You know, you've tried this,
you've tried this um neat new kind of
thing, getting models to do stuff
instead of instead of answering
questions." And then and then later
you're like, "Well, different models,
you know, they come out over time. You
know, this model comes out in January,
this model comes out in February. Can
they do different kinds of stuff?" If we
test them on the same If we test them on
the same stuff, then we'll try and think
of kind of the most obvious in some ways
summary statistic of whether they can do
stuff. This like single single um data
point or number that reflects whether
they can do stuff, the time horizon,
plot it over time and see what happens.
You're like, oh that's kind of
interesting. And then you're like, well,
what's the next sort of in some sense
kind of dumbest or like most obvious
thing you can do? Well, we'll run kind
of the most obvious RCT design. We'll
like allow AI or not allow AI and then
we'll see we'll see what happens and
we'll try and you know it'll be it'll be
messy. There's lots of um there are lots
of methodological problems that that
people point out as there are with this
work but there are different kinds of
problems you know they have different
pros and different cons and maybe with
these sort of two different things they
give two different answers and have two
different sets of pros and cons. we can
kind of triangulate the truth from that.
And then now I'm like, well, can can we
pull that rabbit out of the hat one more
one more time? Are there or multiple
more times? Are there other sources of
evidence that have, you know, different
pros and cons that I that I won't
believe in fully, but they're different
pros and cons and they might give
different answers and so on and so
forth. Um, here are two suggestions, the
things I'm curious about at the moment.
The first is um in the wild transcripts.
So you know agents in cursor in claw
code and in whatever other other um
products or services um they leave
behind traces um traces of their diffs
that they've um contributed to code or
or disc of their actions and their
racing chains and and so on and so
forth. Um the traces that they leave in
the wild are you know importantly
different from this where it's more kind
of contained and you know the task is
sort of neatly packaged and stuff. This
is going to be, you know, like the like
the example with the many different
columns that are very confusing. This is
going to be like whatever real crap
shows up in the wild, how to model sound
and that. Um, there are important
reasons why you shouldn't believe that
kind of information. It's it's like not
very experimental. It's like hard to
know exactly what to make of it, but it
does have these important pros that it's
like it's more real. It's, you know, the
data is enormous. Perhaps um the data on
transcripts is enormous. Um, you know,
perhaps there's a lot you can learn
there. That's that's one thing. And then
and then here's another one. There's
this um there's this group which you
guys should check out called um agent
village
AI village sorry um where they um they
have um a lot of different models or or
agents kind of living in this village
occasionally talking to humans trying to
accomplish um fuzzy goals that are that
are set to them basically using
computes. They try and do stuff like,
you know, organize this event at the
park or um uh run a few human subjects
experiment or run this merch store, you
know, stuff stuff like that that's not
so clearly specified.
And basically all the time they find
that the models fall on their faces and
suck. Um and there are lots of reasons
not to believe this evidence.
You know, here are some of the reasons.
Number one, um, it is using computer use
and I think computer use is just way
worse than CLI based computer use
capabilities are considerably worse than
CLI based stuff at the moment or text
based things in general at the moment
and maybe you care more about text based
things because that's more relevant to
various types of things you care about
and also lots of GUI gooey based things
um can be converted into text things.
Um, it's um, you know that there's all
these different models hanging around in
the village. I'm like, why why are there
so many models? Like, why is there a
village instead of just like some big
Asian orchestration setup? I don't I
don't really understand what's going on
there. And um anyway, lots of reasons
not to believe it. But on the other
hand, it is models doing stuff in the
world. It's not benchmark style tasks.
It's like trying to accomplish some goal
and they can't accomplish even sort of,
you know, very basic subsets of the
goal. And I feel like that's extremely
interesting. And I I wonder if you could
get rid of some of the most obvious
cons. you know, make this only text
based, give them some um uh relevant
textbased tools, work a bunch on the
elicitation to to make these models sort
of more performance, get rid of the less
performant models in in the village, so
on and so forth. But then try and get
them to do these fuzzy goals. Um and you
know just observe like where did they
mess up like you know they they they
they went about step one it went great
but then they sort of they became
incoherent or they you know went into a
strange psychological basin with one of
the other models or you know they they
weren't able to interact with external
services in appropriate way or or figure
out their resource use. You know I'd be
very interested just kind of
qualitatively in what's in what goes on
when you do that. Again, keeping in mind
that we're interested in um the ability
of um at least at the moment, I'm most
interested in the ability of AIS to
automate R&D and you know, or speaking
to why that's not the case at the moment
and why that might not be the case in
the near future. Some something shaped
like this seems like it might be might
be kind of might curiously point to to
why that's not the case. Not sure
exactly what's there, but yeah. And my
observation is that they
they are effectively neurode divergent
individuals, right? And none of our
world was not built for that. There's
everything that we have they're defined
for a human to do. They're shaped and
size to humans. Just like you know the
military, like you know how big are
packs? Well, it's based on how much they
think a person can reasonably carry,
right? And how much we expect someone to
handle for their taxes, that's based on
what we think a human can do. And
>> wow. and and they're and if you think
about neurode diverent individuals they
struggle with challenges with the way
the world's expectations don't align
with them and compared to a neurode
diverent individual these you know these
intelligence are really really different
right and so all of the rough edges
where they don't align with our world
that's why they needed assistant human
assistant in order to accomplish
anything real in our world is just too
hard uh for them
>> currently currently
>> they are I think Yeah. Yeah. Someday
change, but now they're just hopeless,
right?
>> I have to get really, really good. Our
world will have to change. One of those
two things,
>> you know. I I agree. I like so strongly
share this sense. But, you know, but if
you ask me to really pin down like why
why exactly is that the case again when
they're like, you know, beating all GPQ
GPQA experts on these extremely hard
science questions and they're, you know,
blah blah blah. Like, exactly what how
why are they not able to accomplish
things in the world? You ever met a
neurody divergent individual who wasn't
terribly good at something was
completely useless at getting through
life?
>> Yeah. Yeah. They all very good at
reading books.
>> There's a lot of those people in the
world. I
>> It's not that surprising.
>> Although my my only feeling about AI
village is it's like, well, today is the
200th day my car didn't rocket off the
Earth and escape velocity and fly to the
moon.
>> Like that's because you didn't build a
rocket yet.
>> Yeah. I think there was a lot of talk a
year ago about, you know, maybe I'm
mischaracterizing, but I thought there
was a lot of talk a year ago about
comput capabilities being impressive
today.
>> There was there was a lot of talk about
it and yet I have talked to almost
nobody who has used them for any
practical
>> totally totally um yeah but if we if we
move this to text only and it seems
reasonable to complete text only um uh
you know would you still have the rocket
concern? No, I would I would
>> Well, it depends on what the task was.
>> Sure.
>> Yeah. Yeah. The kind of thing that you
could that a human could do over CLI.
So I think this um this relates to the
uh the topic talk that earlier today
where they talked about how um you know
one way to
use uh effectively is to give them if
you have a task like figure out a way to
present the task or transform the task
to something that is indistribute you
know for the model and I feel like this
conversation kind of you know ties in on
that like um you know interacting with
with Chrome is less in distribution than
a CLI. So I I think that could be an
interesting area of research is like you
know okay if you're interested in
exploring like how well can it perform
these really open-ended tasks like first
I guess creating harnesses and creating
an interface that is much more
indistribution for them so that way
that's you know less of a concern.
Yeah, I mean I I think also speaks to
the point about quote unquote neurody
divergence of models. Um, you know,
there's some uh it's not so different
from management skill or something
giving, you know, giving appropriately
scoped tasks to to your to your very
talented interns or very talented
neurodyiverent interns or something
something like that. I do I do think
that's right from the sorry to be a you
know um uh sorry sorry to be so
repetitive from the perspective of
capability explosions um and automating
R&D you know I think maybe the models
will get extremely good at um scoping
tasks for themselves such that it's
benchmark style or or something like
that but you know if they can't do that
I'm like well there's a lot of things
that aren't that don't look like
benchmarks that crop up in the real
world and you do need to be able to kind
of flexibly work with that if you're to
um do something as complicated as
automate a major AI company. Um um and
you know so so I do I do think it's um
yeah I I think it can both be the case
that the AIS are incredibly performance
um on some particular type of problem or
if you make other types of problems more
similar in scope or shape to to the type
of problem that they're best at and and
also that they you know can't flexibly
substitute for human workers because
that requires you know um yourself
setting up the problem in in a way
that's appropriate or or not having
those constraints yourself.
Yeah, it is interesting though just just
to your point about new capabilities is
thinking of almost another axis on the
graph that you have
>> because I think there's not just I
wonder if there's not just a time
horizon issue but there's a a task
category or type of work category like
like as your example of computer like
computer use is one of those examples
right like if we think about the
capability of computer use versus or
capability that would require computer
use versus a capability that could
become can be accomplished entirely in
text. Yeah. So, yeah, sure. But but but
like a lot of these are like like almost
all these benchmarks are basically text.
>> Um yes. Yes. Yes. And indeed, you know,
the ones the ones that aren't the ones
that require sort of um vision
capabilities are are notably lacking.
Yeah. I I I um I'm not sure exactly what
to what to make of this graph. I think
one thing I make is that you one thing I
make of it is that um uh you know there
probably is maybe not so much variation
in in sort of slope or doubling time
across task distributions. I think
there's only weak evidence for that.
But, you know, in in insects or, you
know, the base of where we are now, um,
yeah, there's there's possibly a great
deal of variety, especially on this sort
of um, uh, image- like capabilities
versus not to mention, but but physical
abilities even more, you know.
>> Yeah. Right. So, there's Exactly. Like,
so I mean, you could even go through
senses, right? Like you you could go
through like a tactile like like today
like they would all score zero. Nothing
has tactile. So like it can't tell you
anything about anything tactile.
>> Um well you know in producing this graph
we you know we try and make the models
as performance as possible on some held
out so you know we try and give them
some tactile stuff.
>> I'm not sure they perform zero.
>> Sure sure
we do have some examples.
>> Yeah space judgments judgments like
that. But um you know we we've obviously
seen configure find control and stuff
like that with other robotics.
It's just I I haven't even I don't even
know if anybody maybe somebody has
listed out what all of the capabilities
that we would expect in the future like
if we actually wanted AGI what is the
entire list of
>> that's a way to start a debate that
doesn't end. I think it's Basel Halperin
and Arjun Romani hopefully have a paper
on this often a small number of months.
>> Yeah. And that be to think about where
are we at and do all of the capabilities
follow the same all the capabilities
that we currently measure. Do they
follow the same uh log?
>> Yeah, it does seem like a reasonable
null hypothesis to to you as well as me.
I think not not not not certainty. I
mean, who knows? Yeah. Yeah.
Um, oh, something there was something I
wanted to add there. Um, oh, oh, here,
yeah, here's another thing I'm thinking
about. Not super in a research classy,
although kind of. Um,
um, so, you know, some people like me
are sort of skeptical of of um software
only singularity. That is the the the
idea that you could automate AI research
without also automating um uh chip
design and maybe also chip production as
well. um that you'd quickly get
bottlenecks by by computes because there
are only for fixed hardware there only
so sort of so many experiments that you
can run that that will be that will be
um sufficiently productive to to uh to
soon progress upwards. But you know even
for people like me who are skeptical of
that um uh you know you might think that
in fact like chip production is going to
get automated. you know, the robots like
they're they're coming. They can they
can do they can do the stuff that humans
do and then and then maybe you really do
have a fully self-sustaining um uh
robots plus AI economy and you know so
you uh you have some slow trend from
from comput slowing down but then you
have sort of a fuming back upwards once
once the whole thing is is um is is in a
tight loop. Um one interesting debate
that I uh heard about recently and would
like to think more is um uh you know I
think there's in the public discussion
there's some sense that you know why why
are robotics capabilities lagging um uh
lagging LLM like capability so much well
it's to do with training data or
something something something like that
or or maybe it's to do with hardware
constraints
>> I'm I'm curious if it's not to do with
hardware constraints like what what what
exactly are these hardware constraints
if we put super intelligence
inside hypotheical superchargers inside
of you know um hardware parts that
existed today could it build
chip production facilities and I I have
no idea because I'm s you know I'm I'm
beyond beyond beyond novice but it's not
obvious to me what the what the answer
is I think it's I think it's kind of
plausible I'm not sure you need this
like um yeah I'm not sure you need this
like very flexible fine motor control in
order to do it also I think maybe the
fine motor control is there subject to
having super intelligence controlling
it.
>> I mean to be fair like the key aspects
of chip production are done.
Um, oh, but but I'm also thinking like
building the robots and
>> yeah, the whole, you know,
>> and that's I'll tell you I I have a
friend who spent most of his career
doing software development, but during
COVID started working on manufacturing
things like papers and things like that
to help people and he found out how hard
the manufacturing world is and how slow
the iteration process is.
>> And it is really like he put it like he
he knew it was going to be worse. he
didn't understand that it was like next
level like an order of magnitude worse
and I think that probably like you know
we we from our perspective people who
don't do it it seems like how bad can it
be right it's the the feedback I've had
from everybody who actually works in
that space is it's way way different
that's what I've heard as well I I've
only talked a little bit with like
people who work in fabs and stuff but I
I was surprised when I did talk to them
>> of the level of human expertise required
yeah
>> in order to work at the fabs like a lot
of Those jobs are like fairly high
paying actually engineering jobs in
order to like success
>> also the rate of improvement is actually
glacial right compared compared to
software right
>> I think also because it's cost a billion
dollars to build a
>> f iteration is a huge cost of time money
it it's brutal right
>> so it's I think that's why it's been
hard to get it all the way there is just
like give them a couple more centuries
maybe they can get it done
>> is that really your view centuries
centuries
>> I do I do I I do think I'm skeptical
like you about like how easy some of
these tasks are.
>> Yeah.
>> We think they're easy, but in my
experience like
>> I I I remember when the self-driving
thing came out when people were like
pushing that and it was I actually
worked in that space for a while and it
was like
>> I get that we can get really close to it
but getting all the way to something
that is acceptable is extremely
difficult, right? Uh and we
underestimate how much work is involved
in getting that last little bit done.
the first 90% I I knew we could do it
with computers like you know 10 years
ago pretty much but getting the last bit
that everyone's happy with it
>> yeah work
>> I feel this myself you know I didn't get
a driver's license when I was uh because
I expected self-driving cars to come um
yeah I think I think totally but it
hasn't been that long you know and
they're expanding to to um to to the
entire bay
>> they're going to get there I don't think
it's going to take
>> is is the is the robot economy. Building
the chip production is going to take
centuries.
>> I don't know. I don't know about Well, I
I could see that it might take
it's it's so part of the trick with
self-driving is the economic incentive
is moving it along faster, right? And
probably the robot building robots kind
of thing, but also but like
>> yeah,
>> you know, where we're at right now is
like Ripre is kind of as far along as we
got of robots building robots, right?
Which is
>> Oh. Oh, but I I I feel like, you know,
is that is that paying sufficient
attention to the chart? GPT2 2019.
>> It's so it's so recent, you know, I I I
have some This is so This is so
>> Yeah. Yeah.
>> Um uh nonsensical, but I'm like maybe
we're in a sort of GP2 moment.
>> No, it's a fair point. I could be wrong.
It's just my guess is it's going to take
a lot longer than we think.
>> At least to be able to do like real mass
production.
>> Yeah.
uh at a scale that that causes the kind
of global impact that you're talking
about. Yeah. Right. That that's I I
think they can already do a great job
building one-offs, right? Robots are
very good at build doing one-off builds
>> at a small scale, but
>> it's totally impractical for doing it at
a large scale.
>> There is um
um
um
So one one fact I think is kind of
remarkable
is this maybe it's this is that the rate
of is it this yeah yeah the rate of
compute put to robotics models
lags behind
um sorry is is is about the same but the
levels uh two orders of magnitude
difference. Um I I am kind of um curious
if that's if that got closed um um uh
what we'd what we'd see. It does seem
like at least sort of more capable
robots are in some sense um very on the
table as something that could be the
case very soon if this if this not I'm
not saying all the way. I'm certainly
not saying trip production. It just does
seem like there's some sort of data
hang.
>> Yeah. Yeah. intuitively.
>> That's interesting.
>> Um
also also thinking some sort of um um
some like you don't just need to be
scaling data. You can also scale
parameters use same amount of data you
know flexible ways to use compute to to
close some gap.
>> Interesting.
>> Yeah.
>> Just give me a very interesting overview
of where AI is going into fabrication
>> and and what does it say? Oh, so it says
so it says there's a lot of areas where
right now it's going to help probably
pretty dramatically in the near future
and a lot of it's been computational
aspects. There's a lot of computational
aspects that are extremely expensive.
Designing like the mass basically the
hole that you're using for the laser to
get the transistors.
>> Um, and like calculating that, how to
build it and ensuring that it conforms
to the spec that you've written
basically is extremely computationally
expensive. Um, and there's a lot of
opportunity for AI to have it there. Um,
and there's also theoretically the
possibility for so like chip obviously
chip manufacturer is extremely precise
but also fragile and the opportunity for
an AI to detect parameters that are
basically out of whack and leading to
failure potential failure in like
imaging a wafer uh is could
theoretically dramatically improve yield
and yield is a big problem in fab in
chip manufacturing. Like the reason that
you get different speeds out of your
CPUs is because they actually just have
the one line that produces all those
CPUs and some of them come out better
and some of them come out worse. And
that's why the higher G that's why the
higher gigahertz models are more
expensive and the lower G like like if
you have like your Nvidia like your home
GPUs your your 5040 your 5050 your 5060
5090 are all the same chip right
>> that just had different quality
different tolerance essentially.
>> Yeah. Um, but the problem is that uh cut
the recording. They're going to kick us
out soon, but feel free to continue
discussion.
>> Yeah, cool.
>> You can also hang out here, but I'm just
going to
Ask follow-up questions or revisit key timestamps.
The discussion centers on the relationship between compute growth and AI capabilities, noting that a slowdown in compute (due to physical or financial constraints) could significantly delay AI milestones. Studies on AI's impact on developer productivity, particularly for experts on complex open-source projects, show mixed results, with skepticism about self-reported speed-ups and observations of initial slowdowns. AI also struggles with messy, internal corporate data, limiting its immediate value for data scientists. The conversation explores new methods for measuring AI capabilities, including "in-the-wild" transcripts and "AI Village" simulations, often revealing current AI limitations in open-ended, real-world tasks. An interesting analogy posits AI as neuro-divergent individuals struggling to operate in a human-designed world. Finally, the automation of complex manufacturing processes, such as chip production, is viewed as an immense challenge that could take centuries, despite AI's potential to assist with computational design and yield improvement.
Videos recently processed by our community