HomeVideos

Context Platform Engineering to Reduce Token Anxiety — Val Bercovici, WEKA

Now Playing

Context Platform Engineering to Reduce Token Anxiety — Val Bercovici, WEKA

Transcript

622 segments

0:00

This is Valberkichi, Weta's chief AI

0:03

officer, and I am joined by

0:05

>> Kellen Fox, head out of the product

0:07

management team here at WA

0:09

>> and we're both thrilled to present

0:10

context platform engineering to you at

0:13

the AI.engineering code summit. Now,

0:16

let's kick this off with uh an

0:18

announcement we're making. We're

0:20

actually open sourcing our context

0:23

platform engineering toolkit.

0:26

And this toolkit features a really cool

0:28

load generator that Kalen wrote that

0:31

lets you configure agent swarms uh and

0:33

agent subtasks with very specific SLOs's

0:37

being able to cycle through

0:38

deterministic and random prompt cycles

0:41

and engineer context platforms with all

0:44

sorts of model parallelism options,

0:46

disagregated or aggregated pre-fill and

0:48

decode options and some really important

0:51

memory tiering options we're going to be

0:52

discussing here. So, if we advance the

0:55

next slide, we'll see that this is an

0:58

open-source toolkit that's already

1:00

available to you on GitHub. So, Ken and

1:03

I really encourage you to just get on

1:05

GitHub, download this, play with it, and

1:07

give us your feedback. Let us know what

1:09

you need change. Feel free to contribute

1:11

and fork the project uh and advance the

1:14

field of context platform engineering,

1:16

which we're going to be introducing to

1:18

you later today. So moving on, one of

1:22

the key requirements for context

1:24

platform engineering really relates to

1:26

the contact engineering uh insight that

1:29

our friends at Manis shared with us

1:31

earlier this summer in their pretty

1:33

infamous now context engineering blog

1:36

and they highlighted the fact that KV

1:38

cache hit rate is the single most

1:40

important metric for production grade AI

1:43

agents. And the reason context platform

1:46

engineering is so important is it

1:48

dramatically simplifies reaching maximum

1:51

KV cache hit rates as we're about to

1:52

show you

1:54

on a more personal level. If we think

1:56

about token anxiety, I know that each

1:59

and every one of us, you know, feel that

2:00

anxiety. The reason context platform

2:03

engineering is so important is shared by

2:06

the context engineering blog from Manis

2:09

earlier this summer where they

2:10

particularly emphasize KV cache hit

2:13

rates are the single most important

2:15

metrics for production grade AI agents

2:18

and context platform engineering quite

2:20

simply maximizes KV cache hit rates in a

2:24

very straightforward manner.

2:26

On a more personal note, if you think

2:28

about to the concept of token anxiety,

2:31

as we all regularly hit token rate

2:33

limits, context platform engineering

2:35

helps to engineer platforms that

2:38

eliminate token rate limits uh and help

2:40

us be more productive with regards to

2:42

developing our software.

2:48

Now in the absence of context platform

2:50

engineering, we often resort to context

2:52

financial engineering and that's

2:54

fundamentally prom arbitrage where we

2:57

balance the needs of pricing between the

2:59

bookends of input and output tokens with

3:02

these new token pricing categories that

3:04

have appeared in the landscape over the

3:05

past few months focusing on cash rights

3:08

and cash reads. And we've got to be

3:10

somewhat clairvoyant

3:12

when we're doing the arbitrage to figure

3:14

out how many cash rights you want to

3:16

invest in for either five minute time to

3:18

live. In some cases with anthropic, for

3:20

example, uh we can do one hour time to

3:23

live. And that's all against balanced

3:25

against the predictions we need to make

3:27

on how many cash reads and cash hits we

3:29

think we're going to have during those

3:31

intervals. This becomes very very tricky

3:33

to be clairvoyant and predict the

3:35

future. And I think it's much better to

3:37

apply context prompt engineering

3:38

techniques to overcome token anxiety and

3:41

prompt cash arbitrage than to continue

3:43

to to do the arbitrage and context

3:45

financial engineering.

3:47

And so one of the ways we're going to be

3:48

doing that

3:51

is looking at and and Ken's going to

3:53

dive into this deeply, the cadence

3:55

mismatch between the relatively slow

3:58

human feedback loops for agents and then

4:00

the agent swarms and the agent subtasks

4:02

themselves that iterate at much higher

4:05

cadence, often in parallel, waiting on

4:08

humans, but conducting a lot of really

4:10

cool work in the background, consuming a

4:12

lot of tokens in the background, many of

4:14

which are cachable, but we just never

4:16

know how the platform is able to

4:18

respond. And that's one thing we're

4:19

going to be diving into here is the fact

4:22

that if we go to the next slide, we're

4:24

looking at fundamentally a token storage

4:27

problem. And what we're going to be

4:29

doing is explaining how the service

4:31

level agreements we sign up to when we

4:34

subscribe to our various, you know,

4:36

token tiers or we actually commit in our

4:39

instructions and our agentic

4:40

instructions to specific token cache

4:43

rights and cash reads. how those SLAs's

4:46

convert to service level objectives

4:48

delivered by the context platform

4:50

itself. And more particularly, one of

4:53

the insights that Kalan reached from his

4:55

research at WA Labs is that what we're

4:58

doing when we actually subscribe to our

5:01

token tiers or we actually pay for

5:03

particular token rights is we're really

5:05

purchasing cash KB slots in token

5:09

storage. So there's definitely a whole

5:11

science around the context platform

5:12

engineering to how context platforms

5:15

take those SLA requirements optimize

5:18

infrastructure optimize KV caching and

5:20

memory tiers and deliver specific SLOs's

5:24

to try and meet those SLAs's as much as

5:26

possible. So with that let me actually

5:28

hand it over to Ken for uh actual

5:31

research findings and lab and and test

5:34

results from WA Labs.

5:36

>> Thanks Val. So, look, what I want to do

5:37

is just go back to one of the slides

5:39

that Val showed earlier. And what I'm

5:41

going to do from now on is I'm going to

5:42

focus on that right hand loop. And the

5:45

first thing I'm going to do is I'm going

5:46

to start by visualizing what that loop

5:49

actually looks like. And then we're

5:50

going to go into a little bit more

5:51

detail.

5:56

So, if you if you think about that loop

5:58

as a column, and I've got a graph here

6:00

that shows a very very common uh pattern

6:02

that happens in agents. So the the

6:05

salmon color is showing new tokens that

6:07

the system's being exposed to. The gray

6:09

is something that could be ced again

6:11

within a limited amount of C. We'll get

6:13

into that shortly. The blue is the

6:14

output tokens. And these blue dots down

6:16

the bottom are showing when the user is

6:18

actually giving responses in this

6:20

particular case. This is a really common

6:22

example you get where basically you

6:24

start off you consume context all the

6:26

way up until you hit a um a high a high

6:30

watermark set by either the model

6:32

maximum length or by the inference

6:34

provider itself. there's a summarization

6:36

um phase and then you start a new cycle

6:39

and everybody knows that summarization

6:41

phase where sometimes you know the agent

6:43

loses a little bit of its fidelity a bit

6:45

of its intelligence and uh and that's

6:47

why we're trying to you know uh get more

6:50

context engineering to larger set of

6:52

platforms and we can we can raise that

6:54

watermark

6:56

so if we go into this in a little bit

6:57

more detail the question I often get is

6:59

okay well what is that that's a lot of

7:02

gray what what's that made out of so

7:04

here I'm able to um get the data and

7:07

actually look at individual prompts and

7:10

what actually makes them up. So when you

7:12

look at agentic data especially agentic

7:15

coding the actual user input is only a

7:18

really small part of it and you can kind

7:19

of see it here just visually that if you

7:21

just scan across the the lighter whiter

7:23

colors are the um the system prompt and

7:26

the user text itself and the rest of it

7:29

is tool use and tool responses. So uh

7:32

this is this one in particular is from

7:34

claw code where you're spending a lot of

7:37

time um where the the system is you know

7:39

doing like for example a a bash command

7:42

it's grapping something it's getting a

7:43

result and then it's doing something

7:44

else. So where where this really shows

7:47

out in the data is if you actually look

7:48

at the median time between requests it

7:51

may be some for conversation that looks

7:54

like that we have data for billions and

7:55

billions and billions and billions of

7:57

tokens. Um the median time is 10

8:00

seconds, 15 seconds maybe. Um that

8:02

heavily depends on whether the human's

8:04

involved in checking every single uh

8:06

tool use, but the meanantime is in the

8:09

minutes because the human or even hours

8:11

because the human time to respond is

8:14

much much much higher. And that's what

8:15

we're showing before of the two sides of

8:17

a loop.

8:18

So the other thing that's interesting

8:20

and and something that's very common

8:21

today is is uh is multi- aent. So you

8:23

might have a core agent which I've shown

8:25

here is the orchestrator and then you've

8:27

got these sub agents that are like spun

8:29

up to do individual tasks and depending

8:31

on the type of agentic uh coding um or

8:35

just any agentic software in general.

8:37

These agents or these sub agents may be

8:40

short-lived as in their context does not

8:43

endure between one wake up and the next

8:45

or there are somes some when they do

8:48

endure and it's really important to use

8:51

our agents because it allows us to

8:52

create to effectively target more

8:54

context at very particular parts of what

8:57

the problem you're trying to solve. But

8:59

as a result, you do actually end up

9:01

using more context and I'll explain that

9:02

very shortly. But if you visualize this

9:05

gray section a different way and I show

9:07

you the colors, you can kind of see how

9:09

there's this common relationship of the

9:11

common context between all of them.

9:14

Again, this is varies a little bit

9:16

depending on codeex versus cloud code

9:18

versus versus others. But you can see

9:21

how it changes over time and how the

9:23

agents um relate to each other and have

9:25

this common understanding and then back

9:27

to the orchestrator to to wake up the

9:29

next agent.

9:32

The the the the thing that we're here to

9:34

talk about today though mainly is that

9:35

like while there's a lot of gray that

9:37

could be ced, the reality is very

9:40

different. So if you send this to an

9:41

inference provider, what ends up

9:44

happening is you don't actually get 100%

9:47

of the C hits that you could um that you

9:49

could get. Now why does this matter?

9:52

Well, there's two ways to look at this.

9:54

If you're paying for API tokens, uh

9:58

you're literally it's literally costing

10:00

you more money because every time you

10:02

see a yellow here, and this is just a

10:03

simple example, you're paying input

10:05

token cost. So, you're re you're

10:07

refreshing your cage and you're paying a

10:09

full hit for that. So, potentially 10

10:11

times more than than what you were if it

10:13

was caged. If you're a subscription user

10:16

and you're thinking, well, I don't care

10:17

about the cost. I don't pay for that. I

10:19

pay a flat rate. That is true, but

10:20

you're still, like we said before,

10:23

you're paying for a subscription and

10:25

that subscription is rate limited due to

10:28

your case usage and um you may actually

10:31

hit rate limits further or quicker. So,

10:34

that's something that we want to be able

10:36

to do. We work with a lot of providers

10:37

today to to remove as much of this as

10:40

possible. That's good for the user

10:41

experience and it's also good for the

10:43

provider.

10:45

So, why does this happen? Well, I mean

10:47

it if you think about the last graph

10:49

where I show the columns, they're

10:51

they're not they don't take into account

10:52

time. They're just one after the other

10:54

after the other. But there's obviously

10:56

um a temporal uh way to look at this. So

11:00

this is the way that I like to think

11:02

about it. And I know this is a little

11:03

bit more of a complex graph to look at,

11:05

but bear with me for a second. So on the

11:07

left hand side, I'm talking about

11:08

working set. So that's the number of

11:10

tokens that the C system is holding in

11:13

its memory based on different time to

11:16

lives of the co of the actual C itself.

11:19

And then the the bit at the top the

11:21

dotted lines based on the right hand

11:23

secondary access is showing the case hit

11:25

rate as a result. So the red is showing

11:28

one minute time to live. And what you

11:30

can see is there's prompts here at the

11:32

start on the left where the um it's

11:36

thrashing up and down. And the reason

11:38

it's doing that is the time between

11:41

requests at that period is is longer

11:44

than 1 minute. So you're getting a

11:47

period where you might uh take the cash,

11:49

get a hit or two, and then drop the cash

11:51

and then you get another one. You got to

11:52

refresh it. So it it just it doesn't

11:54

really make sense, right? You go to 5

11:57

minutes, which is the blue, and you can

11:58

now ride out more and more of those cash

12:01

hits, and as a result, you get a higher

12:02

case hit rate. You can see it at that

12:04

very start um up there uh comparing the

12:07

two. But then you're still missing many

12:11

others. There's still many times where

12:13

the the time between a request is even

12:15

larger. So the next one up is showing 1

12:18

hour. And while that requires the C

12:21

system to hold uh you know a little bit

12:23

more tokens in C and eventually quite a

12:26

fair bit more tokens in C, it's got to

12:28

hold it for a longer period of time. But

12:30

the result to the end user is a better

12:33

um actual experience and to the enterp

12:36

to the uh inference provider which we'll

12:38

show very shortly it's a much better

12:40

experience for them as well. The problem

12:42

though is to do that you need to be able

12:44

to hold a lot of tokens in C and you

12:46

need good memory tiers to support that.

12:49

Um, so the next thing I want to go into

12:52

is that a lot of people think of C hit

12:56

rate isn't really something that a

12:58

human's able to really internalize.

13:00

Well, so another way that I can

13:02

visualize it is by thinking about it in

13:04

terms of the number of times on average

13:07

that a chunk of of tokens, which is a

13:09

group of tokens, is refreshed. So in

13:12

this particular conversation that we're

13:13

looking at here, you can see that

13:15

there's this is showing the relationship

13:16

of as I increase the time to live or how

13:20

that affects my case hit rate. But it

13:22

also shows based on the secondary access

13:24

that at 1 minute I'm literally re re uh

13:28

prefilling like 15 16 times the same

13:31

tokens. And over time we can get that

13:33

all the way down to approaching one um

13:37

and um make significant differences to

13:40

again the experience of both the user

13:43

and the inference provider.

13:46

So with that what I'd like to do now is

13:48

go into the the context engineering side

13:50

of it, some of the lessons we learned

13:52

and um just sort of really drive this

13:55

home. So now I want you to think about

13:58

uh what I think will be common in 2026

14:00

and onwards of people hosting their own

14:03

or having their own dedicated systems

14:05

hosting for them. So imagine you being

14:07

an inference provider now. Okay. So now

14:09

what I want you to think of is think of

14:11

yourself as an inference provider. Uh

14:13

maybe you've um you've you know worked

14:15

with us or one of our partners to build

14:17

your own your own self-hosted instance

14:20

um and uh you want to get the most out

14:22

of it. What this graph is showing you is

14:25

uh a relationship between a certain

14:27

context length and the C hit rate and

14:30

how many output tokens you get as a

14:32

result of that C hit rate. Now the first

14:34

thing you'll see is it's not linear and

14:36

it it and the shape of this curve will

14:39

change based on the context length based

14:41

on the accelerators you use. B there's

14:44

lots of things that come into it. how

14:45

you do p disag and prefill. Uh there's a

14:49

lot of stuff that comes into it, but the

14:51

co the the curve is more or less the

14:53

same. And if I asked you as an inference

14:55

provider, where do you want to be? You'd

14:57

obviously say C. And if you're in A or

15:01

B, you're you're not making money or

15:03

you're not getting enough value out of

15:04

the system. And inference providers that

15:06

we work with that they they have the

15:08

same answer obviously. So the question

15:10

is, well, how do they keep in C? And

15:13

this is where it goes back to a slide

15:15

that um Bow showed earlier where what

15:19

they're doing is they're incentivizing

15:20

users to stay within C. And this is

15:24

where we we came to the realization that

15:26

a lot of the times because of how much C

15:29

hit rate uh impacts your actual output.

15:33

That's why it's you're buying case a

15:35

lotments in storage when you're actually

15:37

buying subscription services because it

15:39

is so important to them that you stay in

15:42

a certain case hit rate band especially

15:44

for agentic workflows. Otherwise they

15:47

literally you'll just melt the GPU

15:49

clusters that they have. Um and I and I

15:52

think it's a really powerful thing to to

15:54

have in your head about how that works.

15:57

So what we're going to do now is go

15:59

through and think about okay what what

16:01

makes up this token storage.

16:05

So when you think about the token

16:06

storage there's lots of aspects that uh

16:10

the memory tiers that support the token

16:12

storage need to be able to do. But to

16:14

really make it really really simple it's

16:17

literally as as as simple as you need

16:20

enough capacity in these memory tiers so

16:22

that you can hold a optimal amount of

16:25

cash. Uh if you think back to the the

16:28

slides I just showed, there's this point

16:30

where having more cash helps you a

16:32

little bit, but it kind of gets to a

16:34

point of diminishing returns. Um you

16:36

need to get at least to that point and

16:39

you need to be able to store extremely

16:41

fast into it because if you can't,

16:42

you're going to be able drop in KVs

16:44

before they're in the memory tier or

16:46

you're going to be blocking GPUs, which

16:47

is probably even worse. And then the

16:50

other way you need to do it is you need

16:51

to be able to fetch from that token

16:53

storage very very rapidly so that you

16:55

can again not block the GPUs. They're

16:57

the primary first class citizen of this

16:59

whole system.

17:01

So what does it look like? So there's a

17:03

few different types of memory tiers. The

17:04

most common obviously is HBM and uh Val

17:07

and I would love it if all our sessions

17:08

are in HBM at all times. It's just not

17:11

reasonable. Um there's many reasons for

17:13

this around how the batch works which

17:15

we're not going to go into today. But

17:17

the point is is that the the the main

17:19

common way that this is done today is

17:21

DRAM. And there's nothing really wrong

17:23

with DRAM as such. It it's sort of a

17:26

means to an end, but it's quite limited

17:28

in size. It's it's okay in terms of

17:30

performance. But the other thing is it's

17:32

tightly coupled with the compute. So if

17:34

you want to expand your DRAM, there's

17:36

not really many good ways to do that.

17:39

There are some technologies out there

17:40

that kind of do this, but the way

17:42

they're implemented, they they kind of

17:43

just hurt your performance. And that's

17:45

what I'm showing with pulled DRAM. You

17:47

could pull more together, but it's, you

17:49

know, it's kind of a uh uh it doesn't

17:52

help that much. So what we at Wcker um

17:56

did is we took all the durable

17:59

advantages of our product which has been

18:01

you know tried and tested in AI training

18:04

in HBC environments and augmented memory

18:07

grid is basically a uh supported um

18:11

optimized connector between the

18:13

inference systems and our um existing

18:16

product. And because we're backed by

18:19

NVMe we we're we're much denser. where

18:22

like thousand times depending on how you

18:23

look at it denser it's quite significant

18:26

and then I show another example of a

18:28

storage at the top there where you know

18:30

not not something sluggish something

18:32

that can still get 50 60 GB a second but

18:35

uh and it has the capacity but still

18:38

relative to what we're talking about is

18:40

is still quite slow.

18:43

Okay. So then moving on to how do we

18:46

test this? So again, um we we talked

18:49

about how we're we've open sourced this.

18:50

Um basically, um Val already covered the

18:54

the main part of it and that the fact

18:56

that it it it acts like it's an

18:58

inference provider. It's trying to keep

18:59

the load within two SLOs's if you enable

19:02

them. You actually don't have to enable

19:04

them and it'll just go as hard as it can

19:06

regardless of of an SLO being time to

19:09

first token or output tokens per

19:11

request. But the main thing that it can

19:13

do is you can either set a static number

19:16

of coding agent users or you can um

19:20

increase the number of those users over

19:22

time so that you can slowly utilize more

19:26

of the memory tiers and be able to

19:27

compare different configurations.

19:30

So there's two ways that it works. Um

19:33

I'll just be quick through these

19:34

sections because you can read about

19:35

this. I have a blog that explains how I

19:38

do the testing that goes through all of

19:40

this in detail. And there's obviously

19:41

the GitHub as well, but basically it can

19:45

do the initial working set and then

19:47

sequentially go through those prompts.

19:49

So this will be very very very

19:51

deterministic because as soon as you

19:52

over overflow the memory tier even the

19:55

slightest bit, you'll see a massive drop

19:57

off in performance. But the other way

20:00

that it can be done and realistically

20:01

the more fair way that it can be done is

20:03

you can ex increase the size over time.

20:06

So the amount of concurrent users that

20:07

you're accessing out of a pool and you

20:10

can randomly sample where in that sample

20:14

set you'll get that uh prompt from. So

20:16

sometimes you might be hitting HPM,

20:18

sometimes you might be hitting your your

20:20

memory tier 2. Let's say that let's say

20:21

that's DAM and you get a really nice

20:24

blended number.

20:26

So with that, let's go in and tell show

20:29

you some results and just sort of

20:30

explain and and show why we're so

20:33

excited about what we're talking about

20:35

today.

20:36

So this showing three comparisons.

20:38

Comparison number one is HBM with weter.

20:41

That's the purple. Uh there's orange

20:43

which is HBM and DRAM. And there's the

20:45

you know orangey pinky color with uh HBM

20:48

plus DRAM plus that uh other uh posics

20:52

system that I talked about earlier. The

20:54

dotted line is showing uh concurrent

20:56

users. So the amount the amount of users

20:59

that are in a pool and that's increasing

21:01

over time. So in the initial shaded area

21:05

you can see that all three of them get

21:07

an advantage of HBM. The primary uh hit

21:10

out of uh C hit rate is coming out of

21:13

HBM. But then over time as we increase

21:16

the users more and more and more you're

21:18

overflowing what the DRAM system what

21:20

the DRAM memory tier can do and both

21:23

orange and the pinky color start to drop

21:26

off quite dramatically. Um we also from

21:29

a wcker perspective also drop off

21:30

because we get less and less advantage

21:32

from HBM. So we have to uh pull back our

21:35

concurrency a little bit. The system

21:37

does automatically the uh the

21:39

benchmarking tool. But then once we've

21:41

sort of got down to the steady state,

21:43

all three start to like um level out a

21:46

little bit. But the main difference is

21:49

is that once you get down to that steady

21:51

state, we can maintain that at a much

21:54

higher amount of users at a much higher

21:57

amount of output tokens.

22:00

The other way that you look at this is

22:02

um that was a decode focused role. Um if

22:05

you look at a pre-fill focus ro if

22:07

you're doing disag prefill um then the

22:10

prefill is actually even better result

22:12

for us because the systems the GPUs are

22:15

so much more efficient when you're doing

22:17

large um batches of pre-fill tokens with

22:20

a single decode. Um then we we can

22:24

basically saturate things more fairly

22:27

and um and it continues. Now the main

22:30

difference between pink and orange is

22:32

that we uh sorry purple and orange is

22:36

that we have a lot more cash. So we can

22:38

hit a lot more. The interesting thing

22:40

about the orangey pinky color is that it

22:42

also has the ability to hit every single

22:44

thing that it's possible but it's not

22:47

fast enough to get it into the GPU for

22:49

it to make a difference. And that's why

22:51

we're sort of showing the difference

22:52

between these three because with purple

22:55

you're getting the advantage of capacity

22:56

but at DM speeds so you can maintain

22:58

that benefit longer periods of time

23:03

and then maybe Val I'll hand back to

23:05

you.

23:07

>> Absolutely. That was a great walk

23:09

through Ken of all of your research and

23:10

benchmark results in WA labs. So once

23:13

again we're thrilled to be announcing

23:14

the open sourcing of this context

23:16

platform engineering toolkit today.

23:18

Please do download it, use it, give us

23:20

your feedback. Again, feel free to fork

23:22

it and improve it yourself. And we look

23:24

forward together just contributing to

23:26

less token anxiety overall, less prompt

23:29

cash arbitrage and more context and

23:31

context platform engineering in the

23:33

future. A nice QR code for you to find

23:36

out even more information. And at the

23:38

end of this video in um in the actual

23:40

transcript section and so forth,

23:42

there'll be links to all the blogs we

23:43

referenced here. So, thank you for

23:45

joining us today and we look forward to

23:47

pairing on the context platform

23:49

engineering conversation with you in the

23:50

future.

Interactive Summary

In this video, Valberkichi and Kellen Fox from Weta introduce an open-source context platform engineering toolkit aimed at optimizing AI agent performance. They explain how maximizing KV cache hit rates is critical for production-grade AI agents and how their toolkit helps developers manage context storage, memory tiering, and agent swarms to reduce 'token anxiety' and avoid the complexities of prompt cache arbitrage. The presentation includes technical insights into benchmarking these systems, emphasizing the importance of efficient memory access for maintaining high performance as user load increases.

Suggested questions

4 ready-made prompts