HomeVideos

Uber: Leading engineering through an agentic shift - The Pragmatic Summit

Now Playing

Uber: Leading engineering through an agentic shift - The Pragmatic Summit

Transcript

1003 segments

0:00

Awesome.

0:04

Good afternoon, folks. Uh, thank you so

0:06

much for joining. I really appreciate

0:07

it. Uh, yes, my name is Unshu. I lead

0:10

the developer platform organization at

0:12

Uber and Tai is a principal engineer.

0:14

He's been one of the the leading

0:16

engineering voices that's led our

0:18

agentic shift, but also our overall AI

0:21

strategy over the past couple years.

0:25

All right, let me try to get out of the

0:26

way. Um, we have a pretty packed agenda,

0:29

but I want to try to get to the end as

0:31

fast as possible because I'm curious

0:32

about your folks's perspective on what

0:35

I'm going to talk about. Um, we're going

0:37

to walk through what has motivated our

0:40

push into Agentic AI um, and some of the

0:43

key ROI that we we've uh, we've um,

0:46

manifested. And there is key ROI. I'm

0:48

really excited about the impact that

0:50

that we've realized for Uber. And then

0:52

I'm going to hand off to Ty and he's

0:53

going to get into the specifics, the

0:55

actual technologies that we built or

0:56

integrated that has resulted in that AI

0:59

uh that impact. Uh and then I'm going to

1:02

end the talk talking about some of the

1:04

non-technical challenges that we've been

1:06

dealing with organizational and cultural

1:09

basically people challenges measurement

1:11

and of course cost.

1:15

Okay. So um AI is not new to Uber. um

1:19

our fairs platform, our matching

1:21

platform has been using uh AI

1:24

methodology for years and years. It's

1:25

it's one of the things that sets us

1:26

apart from the competitors. Um but over

1:29

the past few years, using AI as part of

1:32

the the engineering productivity and

1:35

engineering life cycle, that's fairly

1:36

new. Um and it's AI's integration into

1:42

not just engineering but all aspects of

1:44

what Uber employees do has become uh a

1:48

bigger and bigger part of what we want

1:49

to focus on. So much so that DAR has

1:52

stated that uh AI is one of our six

1:55

strategic shifts. We move from a early

1:59

from a human/ early AI powered company

2:02

into a generative AI powered company.

2:04

Now I will say that that term is really

2:06

p these days. Nowadays it's more

2:09

fashionable to be an agentic powered

2:11

company but the concept still holds

2:12

where from the metrics from the data

2:15

that that we've gathered um DA made this

2:18

quote which is you know AI AI is

2:20

enabling people to become superhumans

2:23

um in terms of their productivity and

2:25

the impact that we can realize for our

2:27

end users. So from that standpoint, we

2:31

want to enable all tasks that people do

2:33

at Uber um to be supported by generative

2:37

AI to augment human productivity.

2:40

And that last part is really important

2:42

because um what we're not pushing for is

2:47

AI automate all humans in in the

2:50

company, right? And especially in the uh

2:53

in the engineering side, what we found

2:55

is we want to focus on enabling our

2:58

engineers to focus on creative work

3:00

rather than toil. I'm going to get into

3:02

some of the metrics um and the impact

3:05

that we realize from that.

3:06

the as we've unlocked AI, what we found

3:09

is when we push some of the boring stuff

3:13

to it, upgrades, migrations, bug fixes.

3:18

Um, not only does it result in much

3:20

higher satisfaction from our engineers,

3:21

they're able to push our product and

3:24

create features for end users in ways

3:26

that we didn't even thought was possible

3:28

and at velocities that were have just

3:30

been incredible. So, this has been the

3:32

place that we've really been uh been

3:34

doubling down on. Now one of the reasons

3:37

we've been able to push on this is the

3:38

capabilities of uh of the technology and

3:41

the industry and part of it first part

3:44

of it is we going from um pair

3:49

programming to peer programming. So if

3:51

you think about back in the good old

3:53

days of 2022 and 2023 uh when GitHub

3:56

copilot first came out it was pretty uh

3:59

novel way of augmenting development. You

4:02

had a system where you could do

4:04

synchronous tab completion, an ID chat

4:07

window that would help developers move

4:09

faster. And we saw it ourselves in our

4:12

in our metrics. We saw about a maybe a

4:14

10 to 15% bump in overall diff velocity.

4:18

This is pretty phenomenal. But this by

4:19

itself didn't push us in the direction

4:22

that that we've seen over the past let's

4:24

say year where the paradigm has shifted

4:27

to peer programming where you could hand

4:30

off workloads that are running

4:33

asynchronously

4:34

uh and the models that that we use are

4:38

so good, they're so accurate that all

4:41

you need to do with the AI agent is to

4:44

redirect it in certain ways, maybe give

4:45

it some some uh course correction.

4:49

Um, and that's that's uh all culminated

4:52

in this model where we imagine

4:55

developers acting as their own tech

4:57

leads, right? Developers are are uh

5:00

directing AI agents using a variety of

5:02

the different models and capabilities

5:03

that are available to be able to execute

5:06

uh asynchronously and come back for for

5:09

direction. Now, this doesn't work for

5:11

every single task, but again, when we

5:13

think about some of the toil work that

5:15

developers need to do, dead code

5:17

cleanup, writing docs, library

5:19

migrations, these all seem basic um

5:23

these are basic operations, but they're

5:25

absolutely essential for maintaining a

5:27

healthy code base, but they don't by

5:29

themselves help to grow the business.

5:32

So, pushing this workloads to AI agents

5:34

in effect helps to grow the business.

5:38

Another thing that's really helped is

5:40

the growth of the capabilities of the

5:42

system. So on the right we see a uh

5:44

logarithmic diagram that shows uh how

5:47

long some of the models have been able

5:49

to execute over the period of time. we

5:51

go from, you know, less than one second

5:53

to agents that can operate for hours and

5:56

hours. And that's helped with this uh

5:59

paradigm shift that's happened where

6:01

again when Copilot was first introduced,

6:03

it was still augmenting traditional

6:05

development. Um, but as the capabilities

6:07

have gotten better, we've seen that this

6:10

concept of vibe coding, which was just a

6:12

joke a couple years ago, becoming much

6:15

more prominent and much more of a

6:16

serious concept. Um and it's resulted in

6:20

companies like I have an example for

6:21

ramp. I know they have a talk today. Um

6:24

but there are other companies that have

6:25

had similar examples. In fact, even

6:26

Anthropic talked about how they were

6:29

able to release uh co-work very quickly

6:32

using a variety of agents. Uh this

6:34

example is not unique, but it is

6:36

representative of where the industry has

6:37

been going so far.

6:40

Okay. So talking about toil um Tai's

6:43

going to talk about the um the Agentic

6:45

system that we've deployed out. As soon

6:47

as we made our Agentic workflows

6:49

available to developers, what we saw is

6:52

70% of the workloads that developers are

6:54

pushing into the system were toil tasks.

6:57

There's a couple reasons for that. One

6:58

is the accuracy of these tasks uh was

7:02

much higher compared to the more

7:04

ambiguous workloads. And it makes sense,

7:06

right? like the start and end state of

7:08

some of these tasks if you think about a

7:10

library upgrade or a migration is much

7:13

more straightforward versus say building

7:15

a brand new feature or an experience out

7:17

that requires an experiment because the

7:20

accuracy was higher. Developers were

7:22

more likely to push more workloads into

7:24

the agent system that were toil oriented

7:26

and it became a virtuous cycle that that

7:28

we saw. Uh and based on that, based on

7:31

the success, we we uh pushed for making

7:34

this as one of uh my org's developer

7:37

platform's top priorities in terms of AI

7:39

augmentation.

7:41

Okay, I'm going to hand off to Ty now to

7:43

get into the specifics.

7:44

>> I'll start by saying that we are not

7:46

building this in isolation at Uber. Uh

7:49

Uber's had a long history of building AI

7:51

solutions and having a lot of engineers

7:53

across the organization building

7:55

infrastructure. And so we really see

7:57

ourselves as building on the shoulders

7:58

of giants where you know we have our uh

8:01

our historic Michelangelo platform which

8:03

has had uh some public content in the

8:05

past that provides things like uh a

8:07

model gateway so that we can proxy and

8:09

talk to the main uh frontier models or

8:11

host internal models traditional

8:13

inference training platforms and all the

8:15

other things you would expect in an ML

8:17

platform that within the last couple

8:19

years has really started to lean more

8:21

into like the agentic side and the APIs

8:24

that we're using to talk to you know,

8:25

Open AI and and Anthropic and those

8:27

folks.

8:29

On top of that, we have a lot of the

8:31

traditional infrastructure and context

8:33

at Uber that we would want to take

8:34

advantage of. Things like having access

8:36

to our source code, our engineering

8:38

documentation, uh, Jira tickets, Slack

8:41

information, like these are all things

8:42

that to have an effective agent have

8:45

organizational memory, it needs to start

8:47

to get access to. Um, one of the key

8:49

ones that I'll dig a little more into in

8:51

the later slides is our deployment of

8:53

MCPs throughout Uber. Um on top of that

8:56

we see a lot of industry agents. We

8:58

really take uh a perspective of trying

9:00

to enable the latest and greatest for

9:02

our engineers allowing them to

9:03

experiment uh allowing them to have a uh

9:06

a learning culture and use the best of

9:08

class. So that means that there's a lot

9:10

of clients that are coming in that folks

9:11

are using and we use a lot of those to

9:14

build specialized agents. This could be

9:16

our background agent platform that we're

9:18

going to be talking about, our uh test

9:20

generation platform or many other kinds

9:22

of internal ones. And then at the at the

9:25

top of that we have you know the engine

9:27

engineering enablement uh phrase that's

9:29

been going around the industry. It's you

9:31

know measurements and cost control and

9:32

education and everything else that you

9:34

would expect.

9:36

So let's dig in a little bit to how we

9:39

think about uh MCPS. This became a very

9:42

popular uh piece of technology in the

9:44

industry last year and we moved very

9:46

quickly to make sure that this was uh

9:48

deployed and secure for our engineers so

9:51

that we could um make them as productive

9:53

as possible. So we ended up putting a a

9:56

tiger team together from across the

9:58

company that came together and designed

10:00

a strategy and built a uh central MCP

10:03

gateway. Um, this allows us to both to

10:07

proxy external and internal MCPs from

10:09

our service infrastructure and expose

10:11

those in a consistent way to engineers

10:13

that handles things like authorization,

10:15

um, telemetry, logging, everything else

10:18

that you might expect. We also provide a

10:20

registry and a uh, a sandbox so that

10:23

developers can come in, they can play

10:24

with these MCPS, they can make sure that

10:27

uh, it's going to do what they're

10:28

expecting and that they can discover new

10:30

ones.

10:32

Continuing

10:34

on with that, we also have at Uber uh

10:38

through our ML our Michelangelo platform

10:41

built agent uh the ability to build

10:43

agents with both SDKs and with no code

10:46

solutions, building agent builders, the

10:48

ability to visualize, have telemetry, do

10:51

tracing, um that way as folks around the

10:54

company are building some of these

10:55

solutions that have access to uh the

10:58

internal services, the data sets, we can

11:01

reuse these in other systems. This can

11:03

be discoverable and that we can provide

11:06

a registry that's then can be found by

11:08

other engineers or uh non-engineering

11:10

alike to to deploy these and they're

11:13

deployed consistently in a lot of our

11:15

environment. This can be uh from our uh

11:18

devpod infrastructure which is our

11:20

remote um dev environment local laptops

11:23

uh through our background agent which is

11:24

called minions. We're going to introduce

11:25

that here in a minute or deploy these in

11:28

production.

11:31

So we talked about the agents, the kind

11:34

of registry there, the MCPs, the

11:35

registry there, all the different uh

11:37

agent clients that our engineers might

11:39

be using, be it cloud code or or codecs

11:42

or cursor. Uh, one thing that we

11:44

recognized we needed to platformize

11:45

pretty quickly was a central ability to

11:48

uh, provision and configure and update

11:52

uh, the clients, the agent clients

11:53

themselves, the ability to install and

11:55

discover MCPs from the registry,

11:57

configure those inside of the agent

11:59

clients, uh, deploy standard

12:01

configuration management so that people

12:03

who are just new to the space are having

12:05

more effective uh, prompts and uh,

12:08

configurations right away and management

12:10

in connection into our background. task

12:13

infrastructure. Uh so we built this tool

12:15

called AIFX. It's a CLI. Uh and it is

12:18

the kind of the forefront of what

12:20

developers are using to access our agent

12:23

infrastructure.

12:25

So let's let's take a minute before I

12:27

jump into our our specific product and

12:29

think about the traditional developer

12:32

workflow. If you looked at how people

12:34

were spending time, a little bit in

12:35

planning, probably a lot in code

12:37

authorship historically and then a small

12:39

amount in review

12:42

and then typically they'd be in this

12:44

edit run build run loop of editing their

12:46

code, building it, doing the

12:48

verification using some standard idees.

12:51

Now, of course, this has been changing

12:53

significantly with the agentic world.

12:55

And so, if we look at what the the first

12:58

agent workflow looked like, it might

13:00

look something like this. you have a

13:02

developer who's in the middle of using

13:03

cursor or clawed code. They're uh giving

13:06

a prompt. It's asking for the ability to

13:08

proceed and approve commands and they're

13:10

very interactive in the loop trying to

13:12

drive it to an outcome that they want.

13:14

Um

13:16

but what we're seeing emerge now uh in

13:18

the industry and at Uber is both

13:21

background agents that are running fully

13:23

autonomously as well as the ability for

13:26

uh multiple of these to be run at once.

13:28

Right? This gets into this place where

13:30

as an engineer, you're giving a prompt.

13:32

You're waiting for something uh to

13:34

you're waiting for some time while it's

13:36

running. You're thinking, "Oh, what am I

13:37

going to do? Am I going to go have a

13:39

coffee or browse Reddit? Uh, might as

13:41

well kick off another background agent."

13:42

Uh, and so they get into this mode of

13:44

the the new flow looks like running

13:46

several agents at once, right? Um, this

13:49

this sounds great. um we're I think us

13:52

and a lot of the industry is trying to

13:53

to push towards this but a lot of

13:56

challenges start to emerge with this

13:57

different way of working. One of them

13:59

for us was we wanted these background

14:02

agents to be running autonomous

14:03

autonomously and looking at the external

14:06

vendors that were offering you know

14:08

tools like like cursor and cloud code

14:10

and codecs all of them were running

14:12

their background agents in other

14:14

people's infrastructure. And while we

14:16

can get there, while that may make sense

14:18

longterm, having the ability to

14:20

bootstrap on our own infrastructure was

14:22

really important to us and allowed us to

14:24

move really quickly. And so we built a

14:26

product called Minion. Minion is our uh

14:29

formal background agent platform. It's

14:32

built on top of state-of-the-art agents,

14:35

CLI, and SDKs. Uh this leverages all of

14:39

Uber's existing infrastructure. It runs

14:40

in our CI platform. This has our monor

14:43

repos uh checked out, ready to work in

14:45

quickly. Handles all of the network

14:47

access into the rest of the infra. Um

14:50

allows the connection to all of those

14:51

MCP servers that we talked about earlier

14:53

through AIFX.

14:55

Um it's integrated for the developer in

14:57

a bunch of different work workflows and

14:59

uh panes of glass. There's the web

15:01

interface we're looking at here, which

15:03

is one of the main um interaction

15:05

paradigms, but it's also available

15:07

through Slack, uh through GitHub PRs in

15:10

the code review process, through the CLI

15:12

that we saw earlier, and we have APIs

15:15

exposed so that it can start to be uh

15:17

connected to by other workflows and

15:19

other services uh throughout the rest of

15:21

Uber. And one other powerful thing is

15:24

this offers good defaults. So when

15:26

people are coming here and kicking off

15:27

these background jobs, you know, they're

15:29

giving a prompt. They're expecting a PR

15:31

out of this. They may not be giving the

15:34

the ideal prompt or have the ideal setup

15:36

for it. And we can provide great

15:37

defaults for each of our monor repos,

15:39

make sure that this is more likely to

15:41

have a successful task that the engineer

15:44

is authoring than if they um were, you

15:46

know, if they just did this locally and

15:48

they didn't have a lot of the u cloud MD

15:50

setup or the other context that may want

15:53

to provide. So, let's walk through a

15:55

demo real quick of what using Minions is

15:57

like with an example. So, we have this

16:00

web interface and in this this is one

16:02

that I actually ran. We had a user

16:05

report an error. Um, they said, "Hey,

16:07

this is crashing on my machine when I

16:08

run this command. Here's what the error

16:10

is." And I threw that into into Minion.

16:13

I said, "Hey, you know, we we're having

16:14

this issue. The user's on a Mac. Here's

16:16

the error they're seeing. Um, here's

16:18

here's the command that was run." And

16:20

so, you can see a few things here that

16:21

are cool. one, we have these existing

16:24

templates that users can choose from

16:25

that are well-written prompts that have

16:27

uh placeholders they can fill in. Uh we

16:30

have the ability to choose and run in

16:31

our different monor repos. Uh it can run

16:34

in both on a branch or we can switch it

16:36

to a follow-up task of existing PRs or

16:38

diffs. Uh we have all the task history

16:40

here. We have some cool things here. We

16:43

have uh I can select the agent. So in

16:45

this case, I'm going to run it in cloud

16:46

code. Um put it out as a GitHub PR.

16:48

We've been in a long multi-year

16:50

migration from Fabricator to GitHub. So

16:52

having this dual mode is important for

16:54

our internal engineers. And one

16:56

interesting thing you'll see is this red

16:58

icon here. What this indicates is that

17:00

this wasn't a great quality prompt and

17:02

it would have less chance of successful

17:04

success. So one tool that we built into

17:07

this was a prompt improver. The ability

17:09

to analyze the prompt and make

17:11

suggestions that the user can accept on

17:14

how to have a more uh a higher chance of

17:16

success.

17:18

Now once that kicks off, this is

17:20

running, you know, background agents can

17:22

take a little bit of time. We ping the

17:24

users on Slack, uh give them links so

17:26

that they can go ahead and track this.

17:28

And a few minutes later, uh in this

17:30

case, it was 7 minutes later, the Slack

17:32

notification pings them again. It says,

17:33

"Hey, the minion task is done. You can

17:36

go look at uh the PR here. You can go

17:38

look at the artifacts."

17:40

So let's say let's not go to the PR

17:42

quite yet. Let's jump back into the task

17:44

completion. I have a view here now where

17:47

I can uh see what ran. I can investigate

17:49

the agent logs if I need to. If this

17:51

failed, I can retry or have follow-up

17:53

tasks. Let's say it failed. Then I can

17:57

search through the logs here and start

17:58

to try to understand what the agent was

18:00

doing and maybe give a follow-up. But in

18:03

this case, it it was successful. We got

18:04

a PR out of that immediately. Um it was

18:07

a very straightforward one. Uh our

18:09

minion bot co-authors this with the

18:11

person that kicked it off. Here we have

18:12

a link Jira. We have the test plan of

18:14

how it verified. Uh you can see it was

18:17

authored here by which agent Claude was

18:19

uh minion was running and it was a very

18:22

straightforward fix that we got. So this

18:25

was a very simple workflow. Um but it

18:28

was very much easier for the developer

18:30

to just hey dump in a prompt. Hey here's

18:32

a problem the user is having and get a

18:34

PR out of that as opposed to all of the

18:36

context switching that they would need

18:37

to traditionally do.

18:41

Right now the work the workflow for the

18:44

developers has changed and is changing

18:46

further. They're spending more and more

18:47

time in planning and code review because

18:51

there's so much more code being

18:52

generated that they're being forced to

18:54

do it. This probably isn't the favorite

18:57

type of work that developers love doing

19:00

code review. And there's a lot of

19:02

challenges with that. If people are

19:04

doing code review and it's taking more

19:05

time, they're maybe slowing down. they

19:08

maybe let more bugs in because they're

19:09

missing it in review because there's

19:10

much more. So let's jump into a few of

19:13

the investments we made to try to

19:15

improve that.

19:17

So one of the big problems is context

19:19

switching amongst all the background

19:20

agents. This could be on PRs that are

19:23

coming out or the agent itself needing

19:24

attention. So we built a tool called

19:26

code inbox which was designed to try to

19:28

help with this situation. It's a unified

19:30

inbox for PRs that a developer needs to

19:33

review. And what's interesting about

19:35

this is it's designed to try to remove

19:37

noise. So only bringing out the

19:39

actionable ones that are directly

19:40

relevant for a user then when it needs

19:42

attention, not when it's, you know,

19:45

sitting there waiting for someone else.

19:48

And we put a lot of work into the smart

19:51

assignments with code inbox so that we

19:53

try to find the most relevant person to

19:54

review the code both from a ownership

19:57

and compliance perspective, but also the

19:59

history of how that person was working

20:02

uh their time zone. phone availability,

20:04

their calendar availability. Um, and we

20:06

we try to find the right person and

20:08

assign and then have uh strict SLOs's

20:11

that we track so they can see how long

20:12

it's been assigned, help reassign, do

20:14

automatic reassignment or escalation if

20:16

necessary. And then this it does a a

20:19

smart job of the Slack notifications to

20:21

devs. So, it's doing thing like batching

20:23

notifications so they don't see uh a

20:25

bunch of noise or accounting for their

20:27

focus time so it's not bothering them in

20:29

the middle of it or you know their um

20:31

holiday time if they're out. Uh it'll

20:33

also uh handle integrating this into

20:38

teams existing processes. So, if teams

20:40

have existing Slack usage with code

20:42

review cues, uh we can plug directly

20:44

into that and inject the reviews at at

20:47

that level as well.

20:49

Some of the other cool stuff that we

20:51

built into this one was we tried to

20:52

understand the risk of the change and

20:54

we're going to continue to invest in

20:55

that. Uh there's a much different risk

20:58

profile to a small change in test versus

21:00

a change that's in one of our key

21:02

services. And so we we try to highlight

21:04

that here by analyzing the surface area,

21:07

the the blast, how much that's going to

21:10

affect, what type of service that's

21:12

hitting, um and then make those

21:13

estimates so that we can raise that to

21:15

the developer. So they might put more

21:16

scrutiny on the review or or bring in

21:18

another person or you know whatever

21:20

decision they might want to make for a

21:21

riskier change.

21:24

So in the code review space I want to

21:26

move on to a uh a second product. Uh

21:28

this one we we talked about the

21:30

notifications and bringing context

21:31

awareness. Uh but this is our product uh

21:34

it's called U review and this one is

21:36

designed more at the review help itself.

21:39

There's a bunch of external products

21:40

right now. We've all seen them in the

21:42

market. everything from code rabbit to

21:44

to graphite um all of those are are

21:47

trying to solve this problem and we've

21:49

played with a bunch and we'll continue

21:50

to use external ones as well but what we

21:52

found is we had a lot of internal

21:54

context we had a lot of complexity like

21:57

the migration between fabricator and

21:59

github that made it make sense for us to

22:01

have a platform that we were controlling

22:04

the surface area for the comments coming

22:06

through uh and so what this work how

22:08

this works is we have a pre-processor

22:09

for the code and at that point we have a

22:11

set of plugins that are going to run.

22:12

There can be general defect bots that

22:14

are analyzing it. Um, it can be pulling

22:16

from best practices or MCPs or other

22:19

types of information around the

22:21

organization.

22:22

Uh, and we also have an API so that we

22:24

can plug in external bots. So, if we

22:26

were using, you know, one of those

22:28

external code review tools, we can just

22:30

plug it into the API here and have it

22:32

surfaced with the rest of the comments

22:33

that are coming in to the developer to

22:36

help minimize duplicate comments or or

22:38

extra noise. that then runs through a

22:41

review grader. Um, this has been one of

22:43

the co common problems that we've seen a

22:45

lot of lowv value comments surface from

22:47

those because they'd rather uh give

22:49

something to the developer to do even if

22:51

it isn't maybe necessary. Uh, and we

22:54

really only want to put the high

22:55

confidence changes that the developer

22:57

really needs to focus on, not little

22:59

nits. And so this continues through the

23:01

flow. It looks for duplicates from these

23:03

different systems uh, and finally

23:05

categorizes these. Now each one of these

23:08

layers we've done evaluations and have

23:10

different models running based on the uh

23:13

the performance that each model has on

23:16

the type of behavior.

23:19

This has been something we've been

23:20

working on for most of last year. So we

23:22

saw uh some growth uh and some progress

23:24

in the system as as it matured. Um, one

23:27

we saw that we were able to get higher

23:30

quality comments uh at a higher rate as

23:33

as we invested in integrated uh

23:35

additional best practices and other

23:37

rules. Um, we also saw the rate of the

23:40

comments and the best practices increase

23:43

while maintaining a high rate of uh

23:46

comments being addressed. uh this is the

23:49

the specific piece of feedback that

23:51

we're looking at to make sure um that

23:53

this isn't noise that developers are

23:54

actually fixing these and it's not uh

23:56

just annoying them. And then here's a

23:58

screenshot in fabricator not in GitHub

24:00

where I mentioned we have to have the

24:02

dual kind of UI uh because of the two

24:04

systems at the moment. Uh and so this

24:06

was a a custom one we built to try to

24:08

have a feedback loop for the developer

24:12

in the code review space. it would it

24:15

would uh wouldn't be complete if I

24:17

didn't talk about the verification the

24:19

validation CI test I think that's the

24:22

other big part that we are really

24:23

concerned about to make sure that the uh

24:26

those mistakes aren't slipping through

24:27

code review as more code is coming in so

24:30

we built a system called autocover that

24:31

we've talked a little bit about in the

24:33

past actually the author for it is uh I

24:35

saw him around here somewhere um

24:39

this this was a system that we designed

24:41

to generate unit tests

24:43

Now you might say, well you can just do

24:45

that with cloud code or or many other

24:47

products and you can but what we found

24:49

is by really focusing on this project

24:51

building a custom agent on top of our

24:54

internal langx um SDK built on lang

24:57

chain we were able to get a much higher

24:59

quality type of unit test output. Um,

25:02

and so at this point we're seeing about

25:04

5,000 tests generated and merged per

25:06

month around the company from this and

25:08

um almost a 3x rate of quality versus

25:11

something that would be generated from

25:13

your typical generic agent. Now, as we

25:15

were doing this, we were quite concerned

25:17

with, you know, bad quality tests,

25:19

change detector tests, things like that

25:21

coming in. And uh, so we we built into

25:24

this a critic engine. So it has both the

25:26

generation and the critic engine. And we

25:28

separated that out into an independent

25:30

test validator that now developers can

25:32

use independently whether it's a human

25:34

generated test or an AI generated test.

25:36

Uh which is great to help uplevel uh the

25:38

test quality in general and ignore any

25:41

um false confidence that we might get

25:43

from having the higher coverage.

25:46

So I'm going to talk about one more

25:47

category before I hand it back to Anu.

25:49

Uh we've talked about um you know

25:51

authoring code initially and the code

25:52

review process but code maintenance is a

25:55

big area and you know at the beginning

25:57

an was talking about the toil work. This

25:59

is where a lot of folks kind of consider

26:01

toil the heaviest.

26:03

So as we were looking at how to how to

26:06

build this out, we we looked at the

26:08

space. We looked at the messages coming

26:10

out from other companies, you know,

26:11

where their CEOs are going up in front

26:13

of the news or to their their boards or

26:16

their investors and saying x% of our

26:18

code is generated by AI now. And we were

26:21

looking at that and saying, well, how is

26:23

how's some of this done? And some of the

26:24

companies were very mature companies

26:27

like like Google or Meta. And we'd had a

26:29

lot of discussions and seen that they

26:31

had fundamentals that Uber hadn't

26:33

invested before, which was the ability

26:36

to kind of scale out large scale changes

26:38

so that the AI can then build on top of

26:40

that. And and so we got together last

26:42

year and we decided we needed to run a

26:44

big program to create a scalable version

26:47

of how we handle large scale change. We

26:48

called this automigrate. Um we broke

26:50

this program up into kind of four key

26:53

areas. Uh we have the problem

26:55

identification area where someone would

26:56

be looking at a migration or an upgrade

26:58

and deciding what the risk of the

26:59

migration is or like what the surface

27:02

area of the the change is or how to cut

27:04

up the PRs so that they they make sense

27:06

and drisk. You have the code transformer

27:08

piece which could be an agent. You know

27:10

we could be using uh you know cloud code

27:12

or any others but it could be something

27:13

deterministic like open rewrite which

27:15

we've made a lot of investments with.

27:17

Then it gets into the validation phase

27:19

where we need to understand how we get

27:21

confidence that that automated change is

27:23

going to be successful uh and that it's

27:26

not just relying on human review. So

27:28

this might be CI or unit test or

27:30

sometimes even uh you know staging or

27:31

production signal. Um but this is a key

27:33

area as we think about it. And then

27:35

finally um the the area of campaign

27:38

management was something specifically

27:40

that we needed to build from scratch.

27:42

you know, the ability if you have a 100

27:43

PRs that need to go out to developers

27:45

for a migration, how do you get those

27:47

all into the right spot? How do you

27:48

track those? How do you make sure those

27:50

folks are notified? How do you refresh

27:51

this? And so this was this became the

27:54

key of the platform that we called

27:56

Shephard. Um here's introducing the

28:00

experience with Shephard that the

28:02

surface it's a web UI where developers

28:04

can go, migration authors specifically,

28:07

and they can track all of the PRs that

28:09

are associated with a migration.

28:12

It will uh it allows them to define

28:14

those simply through a YAML file where

28:16

they can either give a prompt if it's an

28:17

agent or they can point it to the script

28:19

that's going to be handling it. And then

28:21

Shephard is going to take care of

28:23

generating those PRs, refreshing them on

28:25

whatever cadence you defined, keeping

28:27

those fresh for the developers,

28:28

notifying the the people that need to

28:30

review it, getting it in the right cues,

28:32

integrating with code inbox, the last

28:34

product I showed.

28:36

So let's walk through two quick um demos

28:39

of this one. Here's a a PR using one of

28:43

the deterministic transformers. This

28:44

used open rewrite. Um we had Shepard uh

28:47

generate all of the PRs to move our Java

28:49

services to Java 21. Here we had this PR

28:52

generated that um correctly found the

28:56

owners created a limited PR in just the

28:59

space of the code owners that upgrades

29:00

it to Java 21. And we can see that

29:03

generated here and a small change needed

29:04

for that upgrade.

29:07

Here's one more where uh it's similar

29:09

but in this case it's using the Minions

29:11

platform. It's integrated with it as an

29:12

agent. So we have separate tools in our

29:15

programming systems group to do

29:17

analysis, find performance issues. A lot

29:19

of these are generating really great

29:21

data. One of these uh we called um Dr.

29:24

Fix actually it's a different one. Sorry

29:26

it's not Dr. Fix. Um, but it it

29:29

identified these performance issues and

29:30

it was able to generate a lot of PRs to

29:33

and or diffs in this case to account for

29:35

those, run those through Shephard, have

29:38

a a standard thing that accounts for how

29:40

how it was tested, how it was verified,

29:42

what the developer needs to know to

29:44

review it safely.

29:46

And with that, we've walked through kind

29:48

of the major deep dive. I want to hand

29:50

it back to Anu to talk about the couple

29:52

last topics.

29:53

>> All right. Um, so Tai talked through a

29:56

lot of the engineering investments that

29:58

we made to make um, you know, the the

30:00

agentic shift. Uh, I'm going to talk

30:02

through a few um, non-technical

30:05

challenges that we're still dealing

30:06

with. First up is uh, on the people side

30:10

um, and the business side. So on the

30:12

business side um, I have a diagram here

30:14

that uh, it's very topical since the

30:16

Olympics are on right now. um the the

30:19

leaders when it comes to AI tech is are

30:22

changing pretty frequently. Um you know

30:24

the the models that are the most

30:26

powerful for certain tasks whether you

30:29

should build something inhouse versus

30:32

use a SAS provider. These uh these

30:35

decisions need to be re revisited on a

30:38

pretty regular basis. Unfortunately, in

30:40

a large organization like ours, some of

30:42

the investments that Tai talked about,

30:43

whether it's autocover, auto migrate,

30:46

these are not trivial decisions to make.

30:48

We need to commit dozens of people on

30:50

projects that might uh be running for

30:53

months. So, we can't just change our

30:55

mind after a quarter. Uh there's there's

30:58

two things that that we've done to

30:59

mitigate this. One is seemingly pretty

31:02

basic, making sure that we have the

31:03

right abstraction layers in place. We

31:06

talked about the Tai showed the the

31:08

minions infrastructure under the covers.

31:09

If we need to swap out the model or we

31:12

need to swap out the technology that

31:13

we're using, we're we're now able to do

31:15

so. Uh if if a better technology comes

31:18

around that can solve some of the

31:20

underlying um pieces more effectively,

31:23

we can do so. Um but the the second part

31:27

is just having this um this uh this

31:30

belief that the tech we're building um

31:34

will likely be replaced with something

31:36

better in the industry. And so uh it's

31:39

really important for us to not be

31:40

married to the tech that we're building

31:42

and being okay if something comes along

31:44

like if the the co-founder cursor talked

31:47

about the uh the auto the test coverage

31:51

system that uh that might be coming in a

31:53

couple weeks. I'm really excited about

31:54

that. It might make our auto uh auto

31:56

cover infrastructure um obsolete and

31:59

that's okay because at the end of the

32:01

day we need to deliver impact for Uber.

32:04

The second part uh is another is a

32:06

people problem. Um so a lot of the

32:09

challenges that Tai alluded to it deals

32:12

with like you know historic

32:14

infrastructure that's been built out

32:15

over the last 10 to 15 years at Uber. Um

32:18

we have some really sophisticated code

32:21

that we built out and then we have some

32:22

really um I would say archaic code that

32:27

uh you know very few people know about.

32:29

Getting that technology integrated into

32:33

uh to places where AI can reach it is

32:35

challenging. Just getting MCP endpoints

32:38

set up to reach to different parts of

32:40

our ecosystem has been a challenge.

32:43

Similarly the tech that Tai talked about

32:45

like I've seen in action it's magic. I I

32:47

ran a demo session with some of my VPs

32:49

and in 24 minutes I had four VPs land

32:52

code for the first time in years. Um it

32:55

was it was a pretty amazing experience.

32:56

They were pretty satisfied by it too. Um

32:58

but our adoption for this technology has

33:02

been relatively slow. It's been slower

33:03

than I've expected and it's part of it

33:05

is because we're trying to have

33:07

developers uh do something that they're

33:10

so not used to. They're used to looking

33:11

at code and generating from scratch,

33:13

operating in their IDE, and we're

33:15

telling them to take a risk by operating

33:17

in a very different way. In both of

33:20

these cases, um, you know, we've tried

33:23

different tactics to get around this

33:24

people issue. We've tried a top downs

33:26

approach, you know, directives from

33:29

leaders to say you must do X, Y, and Z,

33:32

you must adopt. It's had some impact as

33:34

we track, as you folks know, you track a

33:36

metric, it's going to go up or it's

33:38

going to improve. the more successful um

33:42

uh technique that that we've applied is

33:44

actually just sharing wins. So as we

33:46

share examples between different

33:48

engineers uh cool things that they've

33:51

tried that have resulted in in wins,

33:53

adoption of that technology has uh has

33:56

erupted. Um, so that's been the the

33:59

tactic that we're we're pushing on now

34:01

is key promoters pushing techniques to

34:04

their peers because those promoters are

34:06

typically engineers and engineers trust

34:08

other engineers as opposed to directors

34:10

like me.

34:13

Okay. Uh, I'm going to touch on uh

34:15

measurements now. So um, we have tons

34:18

and tons and tons of metrics. Uh, I can

34:20

say with confidence that objectively AI

34:23

is having positive impact. Um, our net

34:26

promoter score, our overall developer

34:28

experience at Uber has never been

34:30

higher. Um, the the self-reported net

34:34

satisfaction developers have and their

34:36

productivity has never been higher. The

34:38

amount of code that we're landing

34:40

through AI is is amazing. The overall

34:43

engineering velocity is fantastic. And

34:45

you can see the graph over here. we see

34:47

the inflection point where when we were

34:49

introduced agentic um the minions

34:52

agentic system along with when the

34:54

models became really really good like

34:56

sonnet and opus being introduced uh the

34:58

the delta between developers that are

35:01

using it very casually versus the ones

35:03

that are the the the power users that

35:05

are using at least 20 days a week it's

35:07

only exploded uh it's only the uh

35:10

deviation has only gone up so uh I'm

35:14

really pleased about this now the the

35:16

issue is that these are activity

35:17

metrics, right? These are not

35:19

necessarily business outcomes. And when

35:22

we start talking about the costs of this

35:24

technology, you know, our CFO has has

35:28

asked me what is the impact of this,

35:30

right? He I can't point him to diffs. I

35:32

need to show him what's the impact on on

35:34

revenue. Um I'm sure you folks are

35:37

dealing with the same problem. We this

35:39

is not necessarily a solved problem for

35:40

us. One of the tactics that we're taking

35:43

this year is to instrument our uh

35:45

overall feature infrastructure so that

35:48

we can uh time from when a you know a

35:51

design is first created to when an

35:53

experiment is is launched in production

35:55

and then seeing how we're able to speed

35:58

that uh pipeline up.

36:01

And then speaking of costs, the cost of

36:04

AI is too damn high.

36:07

um you know since 2024 our our costs

36:10

have gone up at least 6x. Now, I will

36:12

say this technology is amazing. Like

36:14

there's again no question that it's had

36:16

positive impact, but it's gone from

36:18

something that I can self-und using my

36:21

own budget to something that I need to

36:22

ask permission from, you know, the CFO.

36:25

What that's necessitated, especially

36:27

where we went from a model where, you

36:28

know, it's it's not necessarily cursors

36:30

or Entropics's fault that it's going up.

36:32

The GPU costs are high and memory costs

36:34

are really high. So we've had to be more

36:37

responsible about how we use tokens, how

36:40

we think about what's the right model

36:42

for the job and then helping developers

36:44

select those models. So again going back

36:47

to the the example with minions, we help

36:49

developers think about um the right

36:52

model to form the plan for the uh for

36:55

the project and then lower cost but

36:57

still pretty effective models to do the

36:59

execution.

37:01

uh we don't necessarily want developers

37:02

to think about it but we want to be able

37:04

to help them uh help have the

37:07

infrastructure decide for them so that

37:09

we reduce the friction for them but then

37:10

we also optimize our costs um but this

37:13

is this is something that we

37:16

continuously have to keep on evaluating

37:18

and adjusting um especially as new

37:21

technologies introduced so like this

37:23

year we introduced uh Jet Brains AI and

37:25

um and warp which we hadn't introduced

37:27

in the past all have their own costing

37:29

model and all about their own

37:30

complexities with regards to how

37:32

developers are using it.

Interactive Summary

Uber is undergoing a strategic shift towards Agentic AI, focusing on augmenting human productivity and reducing developer toil rather than automating jobs. This move, driven by the capabilities of generative AI, has seen a transition from "pair programming" to "peer programming," where developers direct AI agents for tasks like dead code cleanup and library migrations. Uber has developed a robust internal platform including Minion for autonomous background agents, Code Inbox for intelligent PR review management, U-Review for AI-assisted code reviews, Autocover for high-quality unit test generation, and Automigrate/Shepard for scalable large-scale code changes. While these initiatives have significantly boosted engineering velocity and developer satisfaction, Uber faces non-technical challenges. These include navigating the rapidly evolving AI landscape, overcoming slow developer adoption by fostering peer-to-peer sharing of successes, struggling to definitively link AI activity metrics to core business outcomes, and managing a 6x increase in AI-related costs since 2024, necessitating responsible token usage and intelligent model selection.

Suggested questions

3 ready-made prompts