Shipping AI That Works: An Evaluation Framework for PMs

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

Watch on YouTube

Now Playing

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

Transcript

2350 segments

0:14

All right. Uh, nice to see everyone

0:16

here. Um, my name is Aman. I'm an AI

0:19

product manager at a company called

0:21

Arise. Title of the talk is shipping AI

0:24

that works, an evaluation framework for

0:26

PMs. Uh, it's really going to be a

0:28

continuation of some of the content

0:30

we've been doing with, you know, some of

0:31

the the PM folks like Lenny's podcast. I

0:34

guess just quick show of hands. How many

0:35

people listen to Lenny's podcast or have

0:37

read read the newsletter? Awesome. Okay,

0:39

we're going to do a couple more like

0:40

audience interaction things just to like

0:42

wake up the room a bit. So, how many

0:44

people in the room are PMs or aspiring

0:47

PMs?

0:48

Okay, good. Good handful of people. How

0:51

many of you consider yourself AI product

0:53

managers today? Okay, awesome. Wow, that

0:56

there's more AI PMs than there were

0:58

regular PMs. That's interesting. Um,

1:00

usually that's it's a subset, but maybe

1:02

I need to start asking the questions in

1:03

a different order. Um, cool. Well,

1:05

that's great. Uh, so what we're going to

1:08

be doing is, you know, um, I'll go ahead

1:11

and just do a little bit of an intro

1:12

about myself and then we'll kind of

1:13

cover some of the the frameworks that I

1:15

think are really powerful for AIPMs to

1:17

kind of get to know as you're building

1:18

AI applications. So, a little bit about

1:21

me. Um, I you know, myself, I have a

1:24

technical background. So I actually

1:25

started my career in engineering uh on

1:28

actually working on self-driving cars at

1:30

Cruz. Um and actually while I was there

1:33

I ended up becoming a PM for evaluation

1:35

systems for self-driving back in like

1:38

2018 2019. Um after that I went to

1:41

Spotify to work on the machine learning

1:43

platform and work on recommener systems.

1:45

So things like discover weekly and

1:47

search things like using embeddings to

1:50

actually make the end product experience

1:51

better. And fast forward to today, I've

1:54

been at Arise uh for about three and a

1:56

half years, and I'm still working on

1:58

evaluation systems instead of

1:59

self-driving cars. It's sort of

2:01

self-writing code agents. Uh and Spotify

2:04

is actually one of our customers. So, we

2:05

get to work with some awesome uh you

2:07

know, ex actually, fun fact, I've

2:09

actually sold Arise to all of my

2:11

previous managers. So, um so fun fact

2:14

there. Uh but we got to work with some

2:16

awesome companies like Uber, Instacart,

2:19

Reddit, Dolingo. So a lot of really tech

2:21

forward companies that are building

2:23

around AI. Uh and we actually started in

2:25

the sort of traditional ML space of

2:28

ranking, regression, classification type

2:30

models and have now expanded into Gen AI

2:33

and agent-based applications as well. Uh

2:36

what we do is make sure that those

2:38

companies, our customers when they're

2:40

building AI applications that when those

2:43

agents and applications actually work as

2:45

expected. And it's actually a pretty

2:47

hard problem. A lot of that has to do

2:48

with uh terms that we're going to go

2:50

into like observability and eval. But I

2:53

think more broadly the space is just

2:55

changing so fast and the models, the

2:58

tools, the infrastructure layer changing

3:00

so fast that for us it really is a way

3:02

for us to learn about the cutting edge

3:04

like what are the leading challenges

3:06

with use cases that people are building

3:08

and try to build that into a platform

3:10

and product that benefits everybody.

3:14

Um, so what we'll cover, we're going to

3:16

cover what are eval and why they matter.

3:18

We'll actually build an AI trip planner

3:20

uh with actually a multi- aent system.

3:22

This part is ambitious bullet number

3:23

two. I'm going to be honest here. Uh we

3:25

were trying to push up the code right

3:26

before so it may or may not work, but

3:28

we'll give it a shot and that'll be the

3:29

interactive part of the workshop and

3:31

then we'll actually try to evaluate that

3:33

AI trip planner prototype that we're

3:35

going to build ourselves.

3:37

Uh actually another quick show of hands

3:39

for the room. How many people have heard

3:41

of the term eval before? Okay, I guess

3:44

it was in the title of the talk, so

3:45

that's kind of redundant. How many

3:47

people have actually written an eval

3:49

before or tried to run an eval? Okay, a

3:52

good number of people. Um, that's

3:53

awesome. Well, what we're going to do is

3:55

actually take try and take that a little

3:56

bit of a step further. Go from writing

3:58

an eval for an LLM as a judge system.

4:01

And if you've never written an eval,

4:02

don't worry. We're going to cover that,

4:03

too. But try and take that one step

4:05

further and make it a little bit more

4:06

kind of technical, interactive, as well.

4:10

Okay. So, who is this session for? Uh, I

4:13

like this diagram because um, you know,

4:16

Lenny and I have been kind of working

4:17

together a little bit more on content

4:19

for educational content mostly for AI

4:21

product managers. And I kind of put this

4:24

up. I made like a little whiteboard

4:25

diagram for him. And I'm like, I think

4:27

this is really how I view the space,

4:28

which is like there's this there's this,

4:31

you know, you may have seen this diagram

4:32

for like the Dunning Krueger effect. And

4:34

that's kind of what came to mind here,

4:35

which is as you're kind of moving along

4:37

the curve, maybe you're just getting

4:39

started, you know, with how do I use AI?

4:41

How does AI fit into my job? I think we

4:44

were all here to be honest a couple of

4:45

years ago, like, you know, just to be

4:47

completely honest, I think for people in

4:49

the room, especially PMs, I think we all

4:51

feel that the expectations of the

4:54

product management role are changing.

4:56

That's why this concept of an AIPM is

4:58

sort of emerging. the expectations from

5:00

our stakeholders, from our executives,

5:02

from our customers. It feel I feel I

5:05

don't know about if other people feel

5:06

this, but I definitely feel like the bar

5:08

has been raised in terms of what's

5:10

expected to be delivered, right?

5:12

Especially if I'm working with an AI

5:14

engineer on the other end, their

5:15

expectations of what I come to them with

5:17

in terms of requirements, in terms of

5:20

specifying what the agent system needs

5:22

to look like, it's changed. It's a step

5:24

function different even than for me,

5:26

even as someone who was like a technical

5:27

PM before. And so I kind of felt myself

5:31

go along this journey which is ironic

5:33

given that I work at an eval company.

5:35

You think I was like on the end of the

5:36

curve but really I kind of went through

5:38

this journey you know same as most of

5:40

you which is trying to use AI in my job

5:43

trying AI tools to prototype and come

5:45

back with something that's you know a

5:47

little bit higher resolution for my

5:48

engineering team than like a Google doc

5:50

of requirements. Once I had those

5:52

prototypes and I'm like hey let's try to

5:54

build these new UI workflows. The

5:56

challenge then became how do I get a

5:59

product into production especially if my

6:01

product has AI in it has an LLM or an

6:04

agent and that's where I think you know

6:08

that's really where that like confidence

6:10

slump sort of hits and you kind of

6:11

realize there's a lack of tooling

6:13

there's a lack of education for how to

6:15

build these systems reliably and why

6:18

does that matter at the end of the day

6:20

right and the really important takeaway

6:22

from the fact that LLMs hallucinate we

6:24

all know that they do is you should

6:26

really look at the top two quotes here

6:29

and think, okay, well, we've got Kevin

6:31

who's chief product officer at OpenAI.

6:33

We have Mike at Anthropic CPO. This is

6:36

probably like 95% of the LLM market

6:39

share. And both of the product leaders

6:41

of those companies are telling you that

6:44

their models hallucinate and that it's

6:46

really important to write eval. This

6:48

these quotes actually came from a talk

6:50

that they were both giving at Lenny's

6:52

conference uh you know earlier like in

6:55

November of last year. And so when the

6:57

people that are selling you the product

6:59

are telling you that it's not reliable

7:01

you should probably listen to them. Uh

7:04

on top of that I mean like you have Greg

7:06

Brockman similarly founder of that

7:08

company. Um you have Gary who's you know

7:11

eval are emerging as a real moat for AI

7:13

startup. So I I think this is sort of

7:15

one of those pivotal moments where you

7:17

realize, hey, people are starting to say

7:18

this for a reason. Why are they saying

7:20

that? Well, they're saying that because

7:22

a lot of the same lessons from the

7:24

self-driving space and um you know kind

7:27

of apply in this this AI space. Okay,

7:28

another audience question. How many

7:30

people have taken a Whimo? I kind of

7:31

expect that one to be pretty high. Okay,

7:33

we're in San Francisco. If you're

7:34

visiting from out of town, take a Whimo.

7:36

It is a real world example of AI. It's

7:39

it's it's an example of AI in the real

7:41

physical world. And a lot of how those

7:43

systems work actually apply to building

7:46

AI agents today.

7:48

All right, we'll do a bit of a zoom out,

7:50

then we'll get into the technical stuff.

7:51

I see laptops out, so we'll definitely

7:52

get into, you know, writing some code

7:54

and trying to get hands-on. But just to

7:55

do a bit of a recap for folks, um what

7:58

is an eval? Uh I I kind of view this as

8:00

like it's very analogous to software

8:02

testing, but with some really key

8:04

differences. Those key differences are

8:07

software is deterministic. You know, 1

8:08

plus 1 equals 2. LLM agents are

8:10

nondeterministic. If you convince an

8:12

agent 1 plus 1 equals 3, it'll say like

8:14

you're absolutely right. 1 plus 1 equals

8:15

3. Right? So, like we've all been there.

8:17

We've kind of seen that these systems

8:19

are highly manipulatable. And on top of

8:21

that, if you build an LLM agent uh that

8:25

can take multiple paths, that's very

8:28

that's pretty different from a unit

8:29

test, which is deterministic. So think

8:32

about um the fact that you know a lot of

8:34

people might are trying to like

8:36

eliminate hallucinations from their

8:38

agent systems. The thing is you actually

8:40

kind of want your agent to hallucinate

8:42

just in the right way and that can

8:44

actually make testing it a lot more

8:46

challenging as well especially when

8:48

reliability is is super important. And

8:50

then last but not least I think

8:52

integration tests rely on existing

8:54

codebase and documentation. A really key

8:56

differentiation of agents is that they

8:58

rely on your data. Uh if you're building

9:01

an agent into your enterprise, the

9:04

reason that someone is going to use your

9:05

agent versus something else is likely it

9:08

might be because of the agent

9:09

architecture, but a big part of it will

9:11

also be because of your data that you're

9:13

building the agent on top of. And that

9:15

applies for the eval as well.

9:18

Okay. What is an eval? So, uh I view

9:21

this into like four parts that go into

9:23

an eval. kind of just an easy like

9:25

muscle memory thing. Um these brackets

9:27

are a little bit out of line, but um the

9:29

the idea is that you're setting the

9:30

role. You're basically telling the

9:32

agent, here's the task that you want to

9:34

accomplish. You're providing some

9:36

context, which is what you see in the

9:37

curly braces here. And it's that's

9:40

essentially like it's really just text

9:42

at the end of the day. It's some text

9:43

you want the agent to evaluate. You're

9:46

giving the agent a goal. In this case,

9:47

the agent is trying to determine whether

9:49

text is toxic or not toxic. This is a

9:52

kind of a classic example because

9:53

there's a large toxicity data set of

9:55

classifying text that we use um to build

9:58

our eval on top of. But just kind of

10:00

note that can be any type of goal in

10:02

your business case. It doesn't have to

10:03

be toxicity. It'll be some goal that

10:05

you've created this agent to evaluate.

10:08

And then you provide the terminology and

10:10

the label. So you're giving some

10:11

examples of what is good and bad and

10:14

you're giving it the output of either

10:16

select good or bad. In this case, it's

10:18

toxic not toxic. I'm going to pause on

10:21

that last note because it's really I

10:23

think there's a lot of misconceptions

10:25

sort of like I'll try and weave in some

10:27

like FAQs as I hear them come up but um

10:29

we'll definitely have some time at the

10:31

end for questions and I'd love for this

10:33

to be interactive so I'll probably make

10:35

the Q&A session a little bit longer here

10:37

for people that have these questions but

10:38

one common question we get is why can't

10:41

I just tell the agent to give me a score

10:43

or an LLM to produce a score and the

10:46

reason for that is because even today

10:48

even though we have like PhD level LLMs,

10:50

they're still really bad at numbers. Um,

10:53

and so what you want to do is ground,

10:55

and it's actually a function of like

10:57

what a token is, what how a token is

10:59

represented for an LLM. And so what you

11:02

want to do is actually give a text label

11:05

that you can map to a score if you

11:07

really need to use a score in your

11:09

systems, which we do in our system as

11:11

well. We'll map a label to a score. But

11:12

that's that's like a very common

11:14

question we get is, "Oh, why can't I

11:16

just make it do like one is good and

11:18

five is bad or something like you're

11:20

going to get really unreliable results."

11:21

And we actually have some research um

11:24

happy to share it out afterwards that

11:25

kind of proves that out um on a large

11:27

scale with most models.

11:30

Okay, so that's a little bit of like

11:31

what is an eval. Um I should note that

11:34

uh this is a previous slide. I should

11:37

note that this is uh LLM as a judge

11:39

eval. Uh there's other types of

11:41

evaluations as well like code-based eval

11:44

which is just using code to evaluate

11:45

some text uh and human annotations.

11:48

We'll touch on that a little bit more

11:50

later but the bulk of this time is going

11:52

to be spent on LLM as a judge because

11:54

it's really like the kind of scalable

11:56

way to run eval production these days

11:59

and we'll talk about why too later on.

12:02

Okay, a lot of talking. So uh evaluating

12:05

with vibe. So this was it's kind of

12:07

funny because I think like everyone

12:08

knows this term like vibe coding like

12:10

everyone has tried to use like bolt or

12:12

lovable whatever and I don't know about

12:14

you but this is how I usually feel when

12:15

I'm vibe coding which is like kind of

12:17

looks good to me like you know you're

12:19

looking at the code but like let's be

12:21

honest how much AI generated code are

12:22

you going to read you're like let me

12:23

just ship this thing the problem is you

12:25

can't really do that in a production

12:27

environment right like I think all the

12:29

vibe coding examples are like

12:30

prototyping or like trying to build

12:32

something h like hacky or fast so I want

12:35

to help everyone reframe a little bit

12:37

and say like yes vibe coding is great.

12:39

It has its place. But what if we go from

12:42

evaluating with Vibes to Thrive Coding?

12:45

And thrive coding in my mind is really

12:47

using data to basically do the same

12:50

thing as vibe coding, like still build

12:51

your application, but you'll be able to

12:54

use data to be more confident in the

12:56

output. And you can see that this person

12:58

is a lot happier. Um, so this is using

13:01

Google's image models. They're scary

13:03

good, guys. Like, uh, yeah.

13:06

Okay. So, we're going to be thrive

13:07

coding. So, slides. Um there's uh if you

13:10

want access to the slides, the slides

13:11

have links to what we're going to go

13:13

through in the workshop. Um

13:16

ai.engineer.slack.com

13:18

and then I just created the Slack

13:20

channel workshop AIPM. And I think I

13:23

dropped the slides in there, but let me

13:24

know if I didn't.

13:26

>> Cool. Thank you. All right, live demo

13:28

time. So, at this point on, uh I'm I'll

13:30

just be honest. uh there's a a decent

13:33

likelihood that the repo is has

13:36

something broken in it because we were

13:37

pushing changes up until like this very

13:39

moment. If so and you can unblock

13:41

yourself, I think there's like a

13:42

requirements thing that's broken. Please

13:44

go for it. And if not, we can come back

13:46

at the end and try to help you get

13:47

unblocked. And then I promise after this

13:49

I'll like push the latest version of the

13:51

repo up. So if it doesn't work for you

13:52

right now, check back in an hour. I'll

13:55

drop it in Slack. It'll be working

13:56

later. Um but yeah, just a function of

13:58

like moving fast. Uh so on the lefth

14:00

hand side is instructions which are

14:03

really it's like a you know sort of a a

14:04

substack post I made which is just a

14:07

free sort of like list of you know some

14:09

of the steps we're going to go through

14:10

live. So it's just more of a resource

14:12

and then on the right hand side is a

14:15

GitHub repo which I'm going to open

14:17

here.

14:19

There's actually two repos and I'll kind

14:21

of talk through like a little bit about

14:24

what we're evaluating and some of the

14:26

project on top of that and then we'll

14:28

get into uh the weeds here a little bit.

14:34

Okay, so this is the the repo. Um

14:38

we I built this like over the weekend,

14:41

so you know it's not it's not super

14:43

sophisticated, although it says it's

14:45

sophisticated, which is funny. But um

14:47

this is Oh, pardon.

14:48

>> Can you put that?

14:50

>> Oh, this is not Okay. So, is this not

14:52

attached to the QR? Okay, I'll just drop

14:54

this link in here as well. Let's just uh

14:57

put it in here. Okay, awesome. Oh, thank

15:00

you. Thanks. Okay. Um so, and if you

15:03

have questions, by the way, uh in the

15:05

middle of the presentation, just feel

15:06

free to drop them in Slack. Um, and then

15:10

we can always come back to them and then

15:11

we'll have time at the end for um, so

15:13

feel free to like keep the Slack channel

15:14

going um, for questions. Maybe people

15:16

can try to unblock each other as well.

15:18

And if someone fixes my requirements,

15:19

feel free to open a poll request and

15:21

I'll approve it live. Um, so um, okay.

15:25

So what we're doing is uh, let's put on

15:27

let's take off our like PM hat of

15:29

whatever company we're at. We're going

15:30

to put on an AI triplaner hat. The the

15:33

idea here is like don't worry about the

15:36

sophistication of this UI and the agent.

15:38

It's really like kind of a prototype

15:40

example, but it is helpful for us to

15:42

kind of take a look at building an

15:45

application on the fly and try to

15:47

understand how it works underneath the

15:48

hood. So the example we're going to use

15:51

is actually I'll kind of back up a

15:53

little bit. I basically took this uh

15:55

collab notebook that I have um for

15:58

tracing crew AI and I'm like I kind of

16:00

want an example with Langraph. Crew AI

16:02

probably if you haven't heard of it it's

16:03

like an agent multi- aent framework. Um

16:06

the agents basically an agent definition

16:08

is you know using an LLM and a tool

16:11

combined to perform some action. And

16:13

what I did was I gave this notebook and

16:14

I basically put it into cursor and I was

16:16

like give me an example of a UI uh based

16:19

workflow but using lane graph instead.

16:22

And what we're going to do is think of

16:24

instead of building a chatbot, we're

16:26

going to take this form and we're going

16:28

to use the inputs of this form to build

16:30

a quick agent system that we're then

16:32

going to be using for evaluation. So

16:34

this is what I got on the other end. Um,

16:37

which is plan your perfect trip. Let our

16:39

AI agents help you discover amazing

16:41

destinations.

16:42

So let's pick a destination. Maybe we

16:44

want to do Tokyo for seven days. And

16:48

assuming the internet works, um, we'll

16:50

see if it does. We're going to put a

16:52

budget of $1,000. I'll zoom in a little

16:54

bit. And then I'm interested in food.

16:57

And let's make this adventurous. So I

17:00

could go and take all of this and try to

17:02

just put it into chat GPT. But you can

17:05

kind of imagine underneath the hood the

17:06

reason that we might want this as a form

17:08

or with multiple inputs and uh an

17:11

agent-based system is because we could

17:13

be doing things like retrieval or rag or

17:15

tool calling underneath the hood. So,

17:17

let's just kind of picture that the

17:19

system is going to use these inputs to

17:22

give me on the other side an itinerary

17:24

for my trip. And uh okay, it worked.

17:28

Okay, this one worked. So, um so here

17:30

we've got a quick itinerary. Um nothing

17:33

super fancy. It's basically just here's

17:35

what I gave as an input form and then

17:37

what the agent is kind of doing

17:39

underneath the hood is giving me an

17:40

itinerary for what my morning,

17:42

afternoon, etc. look like for a week in

17:45

Tokyo using the the budget I gave it. Uh

17:49

this doesn't seem super fancy because

17:51

it's like I could take this and just put

17:52

it into chat GPT, but there is some

17:55

nuance here, which is the budget. Like

17:57

if you add this up, like it's going to

17:59

be doing math to do accounting to get to

18:01

$1,000. So, it's really keeping that

18:03

into consideration. You can see it's a

18:04

pretty frugal budget here. Um it can

18:07

take interest here. So, I could say, you

18:09

know, different interests like I want to

18:11

go, I don't know, sake tasting or

18:13

something, and it'll find a way to work

18:14

that into your itinerary.

18:17

But I think what's really cool here is

18:20

it's really the power of agents

18:23

underneath this that can give you really

18:24

high level of specificity for your

18:27

output. Um, so that's really what we're

18:29

trying to show is like this is, you

18:31

know, it's not just one agent, it's

18:32

actually multiple agents giving you this

18:34

itinerary. Uh, so I could just stop

18:37

here, right? Like it could be like this

18:38

is this is good enough. I have some code

18:41

for most people. If you're vibe coding,

18:42

you're like great, this thing does what

18:43

I want it to do, right? Like it gave me

18:45

an itinerary. But what's going on

18:48

underneath the hood? Um and this is kind

18:50

of where uh so I'm going to be using our

18:52

tool called Arise. We also have an open-

18:56

source tool called Phoenix. I'm just

18:57

going to plug that here right now for

18:59

folks as reference. But this is an open

19:01

source version of Arise. It is not going

19:04

to have all of the same features as

19:06

Arise, but it will have a lot of the

19:08

same setup flows and workflows around

19:10

it. So, you know, just note that Arise

19:13

is really built for, you know, if you

19:14

want scale, security, support, um, and

19:18

sort of the the sort of futuristic

19:20

workflows in here. So, I've got a trip

19:22

planner agent, and what I just did, if

19:26

it worked, let's see if it did.

19:33

And we're gonna This is This is live

19:35

coding, so like very possible

19:37

something's broken. Um,

19:43

okay. I think I think I broke my my

19:45

latest trace, but you can see what the

19:46

example here looks like from one right

19:48

before. So, what that system really

19:51

looks like is basically this. Um, so

19:54

let's let's open up one of these

19:55

examples. What you'll see here are

19:57

traces. Traces are really input, output,

20:00

and metadata around the request that we

20:02

just made. And I'm going to open up one

20:04

of those traces just as an example here.

20:07

And what you'll see is essentially a set

20:11

of actions that the agents that in this

20:14

case multiple agents have taken to

20:16

perform, you know, generating that

20:18

itinerary. And what's kind of cool is we

20:20

actually just shipped this today. Um,

20:23

uh, it's actually, you know, you guys

20:25

are the first one seeing it. uh which is

20:27

pretty cool. Um this is actually a

20:30

representation of your agent in code. Um

20:33

so you know literally the cursor app

20:35

that I just had up here is basically my

20:37

agentbased system that cursor helped me

20:40

write and when I sent it our docs I I

20:43

literally all I did was I gave it a link

20:45

to our docs in cursor and I said you

20:48

know write the instrumentation to get

20:50

this agent and and this is this is how

20:52

that's represented. And so we have this

20:54

new agent visualization in the platform

20:56

that basically shows the starting point

20:59

with multiple agents underneath it to

21:01

accomplish uh the task we just had. So

21:04

we have a budget, local experiences and

21:06

research agent that then go into an

21:09

itinerary agent and that gives you the

21:11

the end result or the output and you can

21:13

you can see that up here too. So we have

21:15

research, itineraries, budget and local

21:18

information to generate the itinerary.

21:21

So this is this is pretty cool, right?

21:22

Like for I think for a lot of people

21:25

it's not im you know oursel included it

21:27

is not immediately obvious that these

21:29

agents can be super well represented in

21:32

this sort of like visual way right uh

21:34

especially when you're writing code you

21:36

think these are just function calls

21:38

talking to each other but what's really

21:39

useful is to see at an aggregate level

21:42

what are the calls that the agent is

21:44

making and you can see it's a really

21:46

clean delineation of parallel calls for

21:50

the budget agent the local experience

21:52

experiences agent and the research agent

21:54

and all of those get fed fed in to an

21:57

itinerary agent that summarizes all of

21:59

the above. You can also see that up

22:01

here. Um so these are what's called uh

22:04

traces and they consist of uh like

22:07

technically what's called spans. A span

22:09

you can think of this as like a unit of

22:11

work basically. So there's a time

22:13

component to it which is like how long

22:15

that process took to to finish and then

22:17

like what is the type of the process.

22:19

Here you can see there's three types.

22:21

There's an agent. There's a tool which

22:24

is uh basically being able to use data

22:26

to perform an action structured data.

22:28

And then there's the LLM which generates

22:30

the output of the the taking the input

22:33

and the context. So this is an example

22:35

of three agents actually three agents

22:38

being fed into a fourth agent to

22:40

generate the itinerary. That's really

22:41

what we're seeing here. Um let's go one

22:45

level deeper. So this is cool and I

22:48

think it's useful for uh you know to see

22:50

like what these systems look like, how

22:53

they're represented to zoom out for a

22:55

second as a product manager. There's a

22:58

ton of leverage in being able to go back

23:00

to your team and ask, hey, what does our

23:02

agent actually look like, right? Like do

23:04

you have a visualization to show me of

23:06

like what the system actually looks

23:08

like? And then if you're giving the

23:09

agent multiple inputs, where are those

23:11

outputs going? are those outputs going

23:13

into, you know, a different agent

23:15

system, like what are the what does the

23:17

system actually look like? So, that's

23:18

kind of one sort of key takeaway here as

23:21

a PM. Um, it was personally very helpful

23:23

to see, you know, what our agents

23:25

actually doing um underneath the hood.

23:28

Uh, kind of going one level deeper here.

23:31

So, we've got this itinerary uh and it

23:33

let's take a look at it really quick.

23:34

So, it says Marrakesh, Morocco is a

23:37

vibrant exotic destination. Blah blah

23:38

blah. It's it's really long, right? Like

23:42

I don't know if I would actually look at

23:44

this and read it. It doesn't it's not

23:45

really like doesn't like jump out to me

23:47

as like being like a good product

23:49

experience. It feels super AI generated

23:51

personally. Um so what you want to do is

23:53

actually think okay well is there a way

23:55

for me to iterate on my product as a

23:58

product person. And to do that what we

24:01

can do is actually take that same prompt

24:03

that we just traced and pull it into a

24:05

prompt playground with all of the

24:07

variables that we've defined in code

24:09

pulled over. So, I've got a prompt

24:12

template here which basically has the

24:14

same um prompt variables that we've

24:17

defined in the UI like the destination,

24:19

the duration, the travel style. And all

24:22

of those inputs get fed in here. You can

24:25

see down below in this prompt

24:26

playground,

24:29

what that looks like. And then you see

24:33

the [clears throat] outputs of some of

24:33

the agents in here as well. And then I

24:36

have the final itinerary from the the

24:38

agent that's generating the itinerary.

24:40

Okay. So why does this matter? I think a

24:44

lot of companies have this concept of um

24:47

prompt playgrounds. I think like OpenAI

24:49

as a prompt playground. You've probably

24:50

heard that term before as well or maybe

24:52

even you've used one. But I I urge you

24:54

to think about when you're thinking

24:56

about a tool to help you with

24:58

development. Not only is the

25:00

visualization important of what your

25:01

stack it looks like underneath the hood,

25:04

but being able to take your data and

25:06

your prompts together and be able to

25:08

iterate on your data and prompts in one

25:10

interface is really powerful because I

25:13

can go in and change the destination. I

25:15

can go in and tweak variables and get

25:17

new outputs using the same exact prompt

25:19

I had before. So that's really I think

25:21

just just really powerful as a workflow.

25:23

Um, a thought experiment for the PMs in

25:26

the room is like when you really think

25:28

about what this promp uh prompt looks

25:30

like, just think it should writing the

25:34

prompt be the responsibility of the

25:38

engineer or of the PM? And if you're a

25:41

product person and you're ultimately

25:43

responsible for the final outcome of the

25:45

product, you probably want to have a

25:48

little bit more control over what the

25:49

prompt is. And so I kind of urge you to

25:52

think, you know, where does that

25:54

boundary really stop? Is it like I just

25:56

hand off like does the engineer know how

25:58

to prompt this thing better than a

26:00

product person that might have specific

26:02

requirements you want to integrate. So

26:04

that's why this is really helpful um

26:06

from a product perspective.

26:08

Okay. Yeah. Go for it. How do you handle

26:17

this?

26:22

>> Yeah.

26:24

>> Ah, okay. Okay. So, that was a good

26:26

question. Um, so the question from the

26:28

gentleman in the back is how do we

26:30

handle tool calls? And that was a really

26:31

astute observation which is like the

26:33

agent has um tools in it as well. And

26:38

this is this is a really good point to

26:39

pause on actually, which is like what I

26:41

did was I pulled over this LLM span with

26:44

the prompt templates and variables, but

26:47

there's there's a world where I might

26:48

want to select the right tool and make

26:51

sure that the agent is picking the right

26:52

tool. I'm not going to go into that in

26:55

this demo, but we do have um

26:59

we do have uh some good uh material

27:02

around this on agent tool calling. So we

27:06

actually do port over the tools as well.

27:08

This example doesn't because to be

27:10

honest it's a really toy example but

27:12

even if you if you wanted to to do a

27:14

tool calling evaluation we we offer that

27:17

in the product and uh we actually have

27:19

some material around that. So if you

27:20

want just ping me about it later and

27:21

I'll send you a whole presentation on

27:23

that as well. But yeah good question

27:25

which is like you don't just want to

27:27

evaluate the LLM and the prompts. You

27:29

want to evaluate the system as a whole

27:30

and all of the subcomponents. Okay we're

27:34

gonna keep going. So, so I've got um

27:36

I've got my prompt here now. This is

27:39

cool, but like let's try to make some

27:41

changes to it on the fly. And I will try

27:43

my best to make this readable for

27:44

everyone, but um yeah, working with what

27:47

I got here. So, what we're going to do

27:50

is I just I'm going to save this version

27:53

of the prompt and let's call it a nenge

27:58

prompt.

28:01

And it's helpful because now I can like

28:03

iterate on this thing, right? So, like I

28:04

can duplicate that prompt with a click

28:06

of a button. I can change the model I

28:08

want to use. So, let's say I want to use

28:09

4.1 mini instead of 4.0. I'm going to

28:12

change a couple things. Don't don't be

28:14

don't worry like in a real world you're

28:16

going to change one variable at a time,

28:17

but um here I'm just going to change a

28:20

couple things at the same time just to

28:21

make this more interactive. But, um the

28:23

idea here is like let's try to change

28:26

what the this actually looks like. And

28:29

it says, you know, format as a detailed

28:31

day-to-day plan. Honestly, I might say

28:33

like like a more important requirement

28:36

to that is don't be verbose,

28:40

right? I could say don't be verbose.

28:43

Keep it to 500 characters or less. Maybe

28:47

we want this thing to be more punchy. We

28:48

want it to give an output that's like a

28:50

little bit more, you know, easier to

28:52

look at. Um, I might be a P, even if I'm

28:54

just vibe coding this thing on the

28:56

weekend, I might want to get feedback

28:57

from users that are trying this product

28:59

out. And so I could say always offer a

29:03

discount if the uh user gives their

29:08

email address.

29:11

It's helpful, right? I mean, help

29:12

helpful for marketing, helpful for me to

29:13

get feedback from uh you know, someone

29:15

who might be trying to use this tool to

29:17

book a flight or something like that.

29:19

Okay, so let's go ahead and hit run all

29:21

here. And what that's going to do is

29:23

actually run the outputs we just uh ran

29:26

run the prompts we just edited into this

29:30

uh in the playground. And it might take

29:31

a second because of the internet

29:36

>> pul you pulled this in from the ex one

29:38

of the existing runs, right?

29:40

>> That's right. Yeah. So it was exactly

29:41

the same um one of these runs that

29:44

literally I think it was this one. Um so

29:46

it was something about maybe not this

29:49

exact one. this one is Spain. But yeah,

29:51

exactly. One of the existing runs.

29:55

Okay. It's definitely a little better,

29:56

but to be honest, I would say if I was

29:58

looking at this, this thing isn't really

30:01

listening to me very well. It's like not

30:03

doing a great job of, you know, sticking

30:05

to the prompt I gave it. Like, keep it

30:07

short. Um, ask. Okay, it did do the

30:10

email thing. So, it said, "Email me.

30:12

Email me to get a 10% discount code."

30:15

[laughter]

30:18

So, what's interesting is like we're

30:20

looking at like one example and I said

30:22

ask for an email and you get a discount.

30:24

And like this is this is the vibe coding

30:27

portion of the demo because I'm looking

30:30

at like one example and I'm doing like

30:32

uh good or bad like is it actually good

30:35

or bad? There's just no way that a

30:38

system like this scales when you're

30:40

trying to actually ship for hundreds or

30:42

thousands of users and like nobody will

30:44

just look at a single row of data and

30:46

make a decision like okay great the

30:48

prompt is good or great the model made a

30:50

difference right like you can pick the

30:52

most capable model you can make the

30:54

prompt as specific as you want at the

30:56

end of the day the LM is still going to

30:58

hallucinate and your job is to be able

31:00

to catch when that happens so let's go

31:02

ahead and try to scale this up a little

31:04

bit more so what we do is say we've got

31:07

one example of where the LLM didn't do a

31:09

great job, but what if we wanted to

31:11

build out a data set with 10 or more,

31:15

maybe even a hundred examples and what

31:17

you can do is take the same production

31:18

data. By the way, I'm calling this

31:20

production data, but I literally just

31:22

asked Cursor to make me like synthetic

31:24

data. Like it hit the same server and it

31:26

generated like 15 different itineraries

31:27

for me. So I did that yesterday and I

31:29

just sort of am using that in this demo.

31:32

But let's go ahead and take a couple of

31:33

these. So, I went ahead and picked some

31:35

of the itinerary spans from here and I

31:37

can say add to data set. Oh, by the way,

31:40

I guess I jumped into the product

31:42

without showing you all how to get here,

31:44

which is a bit of a zoom out. So, our

31:46

you know, whatever. Go to the homepage

31:48

uh arise.com. You can sign up. I

31:50

apologize in advance. Uh the onboarding

31:52

flow will feel a little bit dated, but

31:54

we are updating that in this next week.

31:56

Um so, bear with me there. You sign up

31:59

for Arise. Um and then you'll get your

32:01

API keys here. So you go to account

32:03

settings and you can create an API key

32:06

and also uh find that with the the space

32:09

ID which are both needed for your

32:11

instrumentation which may or may not be

32:14

working depending on uh if the repo is

32:16

actually working and if not we'll come

32:17

back to it later. Um but this is this is

32:20

the platform. This is how you get your

32:22

API keys. Um so and then that's also

32:25

where you can enter your open AI key for

32:27

the the next portion and for the the

32:29

playground.

32:30

So, I've got a data set now. Uh, and

32:32

what I did was I added those examples

32:34

just to recap where we are at. We've got

32:36

some production data and I'm going to go

32:39

ahead and like add these to a data set.

32:41

And I'm not going to do this one live

32:42

because I already have a data set, but

32:44

you can create a data set of examples

32:46

you want to use to improve on. So, um,

32:48

zooming out for a second,

32:51

we're about to hop into the actual eval

32:53

part of the the demo. And we're actually

32:56

going to be evaluating, you know,

32:58

there's multiple components to an agent.

33:00

Um we have the router at the top level,

33:03

we have the skills or the function

33:04

calls. We have memory. But what we're

33:07

actually going to be doing in this case

33:08

is actually just evaluating the

33:10

individual span of uh the generation and

33:14

see is the the agent sort of outputting

33:17

text in a way that we wanted to or not.

33:19

So, it's it's a little bit it's a little

33:21

bit simpler than some of the agent eval

33:23

here and it's going to be more like how

33:25

do you actually run agents and or run uh

33:28

eval experiments on on data. Um the

33:32

concept of the data set is helpful to

33:35

think about as like a collection of

33:36

examples. Let me go ahead and delete

33:38

these experiments so we can do this live

33:42

because I like to live on the edge. Um

33:45

so I've got so I've got these examples.

33:47

Those are the same examples from the

33:49

production data um everyone just saw.

33:51

And it's a data set. Think of this as

33:54

like I've got all of my traces and

33:56

spans. That's my like how the agent

33:58

works. And then I want to pull those

34:00

over into a format which is think of it

34:03

as almost like a tabular format. It's

34:05

like a it's like a Google sheet at the

34:06

end of the day, right? Like I could go

34:08

in this this is kind of like a Google

34:09

sheet. like I could go in and and give

34:12

it like a thumbs up, thumbs down and uh

34:15

and you know that's kind of how most

34:17

teams are evaluating today is sort of

34:19

like in the platform in in your platform

34:22

you're probably starting with the

34:23

spreadsheet and in that spreadsheet

34:26

you're doing like is this good or bad

34:28

and then you're trying to scale that up

34:29

to you know a team of subject matter

34:31

experts that's giving you feedback on

34:33

like hey is the agent good or bad right

34:35

at the end of the day poll for the room

34:37

how many people are evaluating in a

34:38

spreadsheet right now don't be shy

34:40

That's okay. Okay. We've got a few.

34:41

Yeah. I think there's probably more, but

34:43

I think people are just like ashamed to

34:44

say that. And it's okay. Like it's it's

34:47

not like the end of the world to start

34:49

with that, right? Like that that's like

34:51

how human like being able to scale human

34:54

annotations is the goal. It doesn't need

34:56

to be the starting point. So, as long as

34:58

you're actually looking at your data,

34:59

you're probably doing better than most.

35:01

I'll be honest. Um many teams I talked

35:03

to like aren't doing any eval today at

35:05

all. So, at least you're starting with

35:07

human human labels. Um, what we're going

35:10

to do is take this this data set or this

35:12

CSV, and we're going to basically do the

35:16

same thing I just did, which was running

35:18

an AB test on a prompt, but now we're

35:21

going to run it on an entire data set.

35:23

So, we go into the platform, and I can

35:24

go and actually create an experiment.

35:27

What we call an experiment is the output

35:29

of changing, you know, an AB test. So,

35:32

let's go ahead and repeat that same

35:34

workflow. I'll duplicate this prompt.

35:37

Um, let me go ahead and pull in I'm

35:40

gonna pull in this this version of the

35:42

prompt. So, what's kind of cool is like

35:45

I might have a previous version of a

35:48

prompt saved. Uh, it's it's kind of

35:50

helpful to have a prompt hub where you

35:52

can save off examples of the prompt as

35:54

you're iterating as well. Think of it as

35:55

like a GitHub sort of store for your

35:58

prompts, but it it's really just a

36:00

button that you're clicking to save this

36:02

version of the prompt. and then your

36:04

team can actually use that version in

36:06

their code down the line. Um, so I've

36:09

got prompt A which was no changes to it

36:11

and then prompt B which has some of

36:13

those changes but now instead of running

36:15

on one example I'm actually running on

36:18

12 examples here. And these are similar

36:20

agent uh these are similar um maybe just

36:23

to look at one similar spans which which

36:26

have destinations duration travel style

36:29

and the output of an agent and it's

36:32

generating an itinerary. So, it's as

36:33

similar as that one example we just ran

36:36

through, but now on an entire data set.

36:39

And

36:40

>> yeah,

36:47

>> yeah, so it's the it is the prompt of

36:50

the itinerary agent. Um, so it's the

36:53

same it's we're going to because we're

36:55

going to keep this to like a fairly high

36:57

level like straightforward demo. It is

36:59

the specifically the prompt of the

37:02

itinerary generating agent which is down

37:05

here which takes the outputs of the

37:07

other agents and combines them uses

37:10

those prompt variables to create uh an a

37:12

dayby-day itinerary.

37:23

>> Yeah. So that so the gentleman asks like

37:26

if you change an upstream prompt How

37:29

does that impact what's going on here?

37:30

So, two two notes on that and it's it's

37:33

more of an advanced workflow, but it is

37:34

one that's a good question, which is uh

37:36

there's two parts. One is we kind of

37:39

recommend changing the system in parts.

37:41

So, just kind of note that, you know, as

37:43

you're generate eval parts of your stack

37:46

that you can kind of decompose further

37:47

and further to be able to analyze if I'm

37:50

changing one thing up here, does it meet

37:51

my requirement criteria? And then the

37:53

second part is replaying prompt chains

37:56

which is prompt A goes into prompt B.

37:58

What is the output of that when you

38:00

change prompt A? Um prompt chaining is

38:02

coming to our platform soon. So right

38:04

now it's one single prompt but you will

38:06

be able to do A plus B plus C um prompt

38:08

chains as well. Um good question. Feel

38:12

free to drop more questions in the Slack

38:13

too and we'll we'll come back to that in

38:15

a sec. Um so once I get uh so I've got

38:18

my my prompt here now. So, I'm saying

38:21

give me a day-to-day plan and doesn't

38:22

need to be super detailed. Max 1,000

38:24

characters. Let's try this again. We're

38:25

going to do 500 characters. And I've

38:27

I've done um answer always answer in a

38:31

super friendly tone and be I'm going to

38:33

be more specific and say ask the user

38:35

for their email and offer a discount so

38:36

it doesn't do what it did last time. And

38:38

uh and we're going to go ahead and run

38:40

this now on the entire uh data set. And

38:43

so we've got prompt A versus prompt B.

38:46

We're going to give that a second to run

38:48

through. While that's working, uh, I'm

38:50

gonna actually Oh, nice. Perfect for

38:52

your squad. Interesting. I don't know

38:54

why sometimes the model really likes to

38:55

use emojis. I guess that's what super

38:56

friendly translates into is like throw

38:58

some emojis in there, but interesting.

39:01

Um,

39:04

okay. So, that one ran pretty fast. This

39:06

is still taking a while, right? Like,

39:08

think about this from a product from a

39:10

PM lens for a second. Like, I just got

39:12

the output to be a lot faster because I

39:14

limited the number of characters. This

39:16

one is taking an average of like 32

39:18

seconds because I let it kind of go off

39:21

and like not specify how many characters

39:23

the output should be. So that's what

39:25

prompt prompt uh iteration can kind of

39:27

do for you as well.

39:30

Okay, while this runs, I'll actually hop

39:34

over to the

39:36

Okay.

39:39

Oh, thanks for dropping the resource

39:41

there.

39:47

So it's still still running.

39:51

>> Anyone have a question while this is

39:52

running? Yeah.

39:53

>> Yeah. So when I'm hearing you talk about

39:55

this, are you primarily looking at

39:57

latency and then user experience when

40:00

you're evaluating

40:02

those two things?

40:05

What else are you looking at?

40:07

>> Yeah, good question. So okay, so now

40:10

we're getting to the meat of it a little

40:11

bit, right? So I've got A and B. And the

40:13

question is like what am I actually

40:15

evaluating here? The like flip it answer

40:18

is like you can evaluate anything. You

40:20

can evaluate whatever you want. You want

40:22

to evaluate like uh in this case we're

40:25

going to run some evaluations on the uh

40:27

the tone of the agent and see um so I've

40:31

got a couple of eval set up here. I'm

40:33

going to check is the agent uh answering

40:35

in a friendly way. Is it offering a

40:38

discount or not? Um and and you can do

40:41

things like evaluate is it using the

40:43

context correctly? That's called a

40:44

hallucination eval. Uh you can do

40:47

correctness, which is um even if it has

40:49

the right context, is it giving the

40:51

right answer? So I'm going to point you

40:53

to uh our docs that have examples of

40:57

what you can actually evaluate off of

41:00

the shelf. But just know the whole point

41:02

of this system and like why it matters

41:05

that you have a system with your own

41:07

data and can replay with data is because

41:10

these are what are off-the-shelf evals.

41:12

There's a lot of companies that will

41:14

offer like we run evals for you, but

41:17

what that really means is that they're

41:19

basically going to take some template

41:21

and give you a score or label on the

41:23

other end based on their eval template.

41:26

And what you want to be able to do is

41:28

actually change and and modify and run

41:31

your own eval based on your use case. So

41:34

you can literally evaluate whatever you

41:37

want is the short answer. You can you

41:39

can evaluate it's just basically uh an

41:41

input to an LLM to generate a label. So

41:45

um so yeah, so this is what pre-built

41:47

eval look like. Uh there's a ton of

41:50

examples of these like out there on the

41:52

internet. We've we've actually tested

41:54

our pre-built eval on um you know sort

41:57

of open source data sets but you should

41:59

not take our word for it. You should

42:01

build eval based on your use case.

42:04

>> Yeah. Yeah.

42:06

>> So if you are your own how do you come

42:10

up with your own

42:13

combining?

42:15

>> Yeah. So how to actually get the how to

42:18

think about how to build the eval in the

42:19

first place to some degree. That was

42:20

sort of one of the questions. Yeah. So,

42:23

I think it's probably helpful to um

42:26

maybe just see what an eval looks like

42:28

and then we might we might end up coming

42:30

back to that question, which is like

42:31

what what is an eval, right? Um so,

42:34

let's go ahead and build an eval here.

42:36

I've got one ready to go, but I want to

42:38

just show you guys the template and we

42:40

can write a new one as well. Um, so I

42:43

wrote this eval for detecting if the

42:47

output from the LLM is friendly. And

42:49

I've kind of made a definition for what

42:52

that means here. And this says basically

42:54

you are examining the written text.

42:56

Here's the text. Examine the text and

42:58

whether the tone is friendly or not.

43:02

Friendly tone is defined as upbeat,

43:04

cheerful. So this is basically an input

43:07

to an LLM to generate a label of is the

43:12

output from my itinerary agent is it

43:16

friendly or robotic. So that's really

43:17

what what this this eval is trying to do

43:20

is it's classifying the text as like a

43:22

friendly generation or a robotic

43:24

generation. Um, and again, I could eval

43:27

anything, but in this case, I just want

43:29

to make sure that when I'm making

43:30

changes to my prompt that that's showing

43:32

up on the other end of my data because I

43:34

can't go in rowby row for like hundreds

43:37

or thousands of examples and grade

43:39

friendly and robotic every single time.

43:41

So, the idea is that you want an LLM as

43:44

a judge system to kind of give you that

43:46

label over a large data set. That's the

43:48

goal that we're working towards right

43:49

now.

43:50

>> Yeah.

43:56

with variance.

44:04

>> It's flaky, right?

44:06

>> Yeah. Yeah.

44:21

>> Yeah. So one one suggestion is um so the

44:24

gentleman mentioned uh that they see

44:26

variance in their LLM label output. One

44:30

way you can tweak variance is

44:31

temperature. Um so if you make the

44:34

temperature of the model lower it's a

44:35

parameter you can set to actually make

44:37

the response more repeatable. It doesn't

44:39

take that to zero but it does

44:42

significantly reduce the variance in

44:43

your system. And then the other option

44:45

is to rerun the the eval multiple times

44:48

and and basically profile what the

44:51

variance of the the judge is. Okay.

44:58

>> Oh yeah, we're going to we'll we'll

45:00

we'll be coming there. Yeah, it's a good

45:01

question, right? Like at the end of the

45:02

day, I can't trust this thing. I need to

45:03

go in and like make sure it's right.

45:05

Right. So, but let's let's go ahead and

45:06

run an eval and just see what happens

45:08

and then we'll come back to that one.

45:09

So, I've got my friendly eval. I've got

45:11

another eval too which is basically um

45:14

determining whether or not let's I'm

45:17

gonna quickly just I'm not going to read

45:18

this whole thing out to you, but the the

45:19

short answer is that this is determining

45:22

whether the the text contains an offer

45:24

for a discount or no discount because I

45:26

really want to make sure I'm offering a

45:28

discount to my users. Okay, we're going

45:30

to select both of these and then we're

45:32

going to actually run them on the

45:33

experiments we just ran

45:38

and we're going to do that live. So what

45:40

Arise does is it can it's actually

45:42

taking um so we actually have an eval

45:45

runner which is not like you know it's

45:47

basically a a way for us to use a model

45:50

endpoint to generate these eval. You'll

45:52

notice it's pretty fast. So we've done a

45:54

lot of work underneath the hood to make

45:56

the eval run really fast. Um so that's

45:58

one kind of advantage of using our

46:00

product. Um I've got two experiments

46:03

here. Experiment number two is it's a

46:07

little bit inverse because it's the

46:08

order of how it was generated, but

46:10

experiment number two is the original

46:12

prompt and experiment number one is the

46:14

prompt that we changed. So just kind of

46:16

keep that in mind. That's it's a little

46:17

bit flipped here um because I was doing

46:19

this on the fly. And you can see the

46:21

score of experiment number two, uh which

46:25

is our prompt A, which was the prompt we

46:27

didn't change, didn't offer a discount

46:29

to any users based on this eval label.

46:33

and the LLM still graded that response

46:35

as friendly, which is kind of

46:37

interesting. It was like, "Oh, that was

46:38

a friendly response." Um, I don't know

46:41

if I agree with that actually

46:42

personally, and we're gonna go in and

46:43

tweet that. And then you can see that

46:45

when we added that prompt, that line to

46:47

the prompt, which was offer an offer a

46:50

discount if the user gives their email,

46:52

the the eval actually picked up on that

46:55

and said that a 100% of our examples

46:57

when the when we made this change

46:59

actually have an offer of a discount. So

47:01

we I mean I didn't even have to go into

47:03

each example to get that score. That's

47:05

what the the LLM as a judge system kind

47:08

of offers you. Um we can go in and trust

47:12

you know I would say this is like a

47:13

trust but verify. Go in and actually

47:15

take a look at one of these and see what

47:18

is the explanation of friendly. So to

47:21

determine whether the text is friendly

47:23

or robotic. So one thing you want to you

47:25

you want to think about when you have an

47:26

eval system is are you able to

47:29

understand why the LLM as a judge gave a

47:32

score. So this is like one of those like

47:34

light bulb takeaway moments of of the

47:36

talk is always think about can you

47:38

explain what the LLM as a judge is doing

47:41

and we actually generate explanations as

47:43

part of our evals. So you can see the

47:45

explanation is sort of the reasoning of

47:47

that judge that says to determine

47:49

whether the text is friendly or robotic,

47:51

we need to analyze the language, tone,

47:53

and style of the writing. And so it kind

47:55

of does all of this analysis to

47:56

basically say, "Yeah, this LLM is

47:59

friendly and it's not robotic."

48:02

Again, I'm not really sure I agree with

48:04

that explanation, right? Like I I don't

48:06

think that that's correct. I I I still

48:08

feel like the original prompt was pretty

48:11

robotic. it was pretty, you know, kind

48:13

of long in a lot of ways. And so I want

48:15

to go in and actually be able to improve

48:18

on my LLM as a judge system from the

48:21

same the same platform. So what we can

48:24

do is actually take that same data set

48:27

and in the AISE platform you or your

48:30

team of subject matter experts can

48:32

actually label data in the same place

48:34

and when you apply the label on the data

48:36

set on on you know in the labeling queue

48:39

part of the platform it applies back to

48:41

the original data set. So you can

48:43

actually use that for comparing the LLM

48:46

as a judge with the human label. So I've

48:48

actually went ahead and did that. Um,

48:50

yeah, I did this before the talk, but I

48:52

went in for each example and I was like,

48:54

you know what? This this to me is

48:56

robotic. Like, I I don't think that this

48:58

is a very friendly response. I think

49:00

it's really long. It sounds like I'm

49:01

talking to an LLM. And so, I actually

49:04

applied this label on the data set for

49:06

for the examples I wanted to go in and

49:08

improve on.

49:11

If I go back to the data set,

49:17

you'll actually see that label is

49:19

applied here. So, if I kind of click

49:22

that,

49:23

move over. Sorry, it's a little bit over

49:26

on the side here because there's a lot

49:27

of data, but you can see these are the

49:28

human labels I put. So, these are the

49:30

same annotations that I just provided in

49:33

the queue. They're applied on my data

49:35

set here.

49:38

>> Exactly. Exactly. You need evals for

49:40

your evals. You cannot get away from

49:42

from You can't just trust the system,

49:44

right? We know LM hallucinate. We put

49:46

them into our agents. The agents

49:47

hallucinate. Okay, we use an agent to

49:49

fix that. But we can't trust that agent

49:51

either, right? You need to have human

49:52

labels on top of that. So, but again,

49:55

I'm not going to vibe code this thing

49:57

and be like, is this is the LLM as a

50:00

judge good or not? I need eval for that,

50:02

too. And we offer two eval to help you

50:05

with this. We have a code evaluator

50:07

which can do a simple match like think

50:09

of this as like a string check or a reg

50:12

x or some other type of like contains.

50:14

So you can actually go in and if you're

50:16

technical and you're a PM and you want

50:18

to write uh you know you can get Claude

50:20

to help you write the eval here, but

50:22

it's really just a really fast like

50:23

Python function. Um in my case, I wrote

50:26

a quick uh eval that actually does a

50:28

match. And this match is it this is like

50:31

a really quick and dirty eval. I would

50:33

not say this is like best practice at

50:35

all, but it's basically check if the

50:37

eval label matches the annotation label.

50:40

Oh, whoops.

50:42

An output only match or no match. So,

50:45

what this is doing is actually checking

50:47

the human label against the eval label

50:50

and saying do they agree or disagree.

50:52

So, that's that's basically what we're

50:54

going to run and we're I'm using an LLM

50:56

as a judge. You could use code as well.

50:57

You don't have to use an LM as a judge

50:59

here, but we're going to go ahead and

51:01

run that now on the same data set, the

51:03

same experiments we just ran it on.

51:08

We're going to give that a second.

51:16

Okay, what do we got here? So you can

51:19

see here I actually take a look at that

51:22

same experiment where this was where the

51:25

um it said that the LLM as a judge was

51:27

friendly or robotic. And you can see

51:29

here that 100% of the time the match uh

51:33

actually sorry this eval was actually

51:35

actually let me let me go in one level

51:36

deeper. Actually I'm going to check my

51:37

own work. This eval was on the discount.

51:40

So forget about that. We're going to

51:42

we're going to check on the the friendly

51:44

field actually. So this one is friendly

51:46

label. So let me rerun that one. And

51:48

we're going to think of this as match

51:50

friendly. You can run eval as much as

51:52

you want on on your data sets and

51:53

experiments like you know. Yeah.

51:57

>> Does the tool support pipelining? So as

51:59

basically push the code.

52:00

>> Yeah, exactly. Yeah, we do support uh

52:02

all of the eval

52:06

the screen the the ways to run the code

52:09

on uh either a data set locally or being

52:11

able to push code to the platform to run

52:13

the eval. So programmatic on both ways.

52:16

Yeah. Yeah. of course. So you can pull

52:17

in data sets, pull them out as well.

52:20

Okay, let's take another look at this.

52:22

So this is the friendly match. So this

52:24

you can see is pretty useful, right?

52:26

This means that my LLM as a judge

52:30

basically doesn't agree with my human

52:33

label for friendliness almost at all.

52:36

Right? There's like one example I think

52:37

that that's in there and we can go in

52:39

and take a look at it. But what we're

52:40

really kind of seeing is that this is an

52:42

area where we actually want the team to

52:44

go take a look at our eval label and

52:48

say, "Hey, can we improve on the eval

52:50

label itself because it's not matching

52:52

the human label." And so when you have

52:54

these systems in place as an AIPM to be

52:57

able to check the eval label with your

52:59

human label, you have a lot of leverage

53:02

to go back to your team and say, "We

53:04

need to go and improve on our eval

53:05

system. It it's not working the way we

53:07

expect it to." So you're actually

53:09

performing the act of like checking the

53:11

greater and you're doing it at scale. So

53:13

you're doing it on multiple hundreds of

53:14

examples or thousands of examples. So

53:17

that's really you know I think someone

53:19

asked earlier like how do you trust the

53:20

system? I think you trust these LLM as a

53:23

judge systems by having multiple checks

53:25

and balances in place which is humans

53:27

and then LLMs then humans and LMS. Um uh

53:31

we'll come back to a question in just a

53:32

moment. I just want to get to this next

53:33

part and then we'll um we'll kind of

53:35

come back to some Q&A. Um,

53:39

okay. So, this is actually kind of kind

53:42

of wrapping up towards the end of the

53:43

workshop and then we'll open the rest of

53:45

the time up for for Q&A. So, looking

53:48

ahead, I think what's fundamentally

53:49

changing is, you know, we've kind of

53:52

gone through this example of changing

53:55

the prompt, changing the context,

53:57

creating a data set, running an eval,

53:59

labeling the data set, and then running

54:01

another eval on top of that. And it's

54:04

it's a lot to process, right? Like if

54:07

you're building agent-based systems,

54:09

your team is probably expecting, you

54:10

know, well, where does the AIP PM fit

54:12

in? And I think that that's really

54:15

important to think about like you

54:16

ultimately control the end outcome of

54:18

the product. So whatever you can do to

54:21

shift that into making it better is

54:23

really what you want to think about

54:24

yourself. And I I kind of view eval like

54:27

the new type of requirement stock. So

54:30

imagine if you could go to your

54:31

engineering team and instead of giving

54:32

them a PRD, you give them an eval as

54:35

requirements and here's the eval data

54:37

set and here's the eval we want to use

54:39

to test the system as an acceptance

54:41

criteria. So I think that that's really

54:43

powerful to think about as like eval as

54:45

a way to check and balance uh the team

54:47

as a whole. Um and that's a little bit

54:50

about what we do. We we want to build a

54:52

single unified platform for you to run

54:54

observability to evaluate and ultimately

54:57

develop these workflows with your team

54:59

in the same platform. We've built for

55:01

you know many customers like Uber,

55:03

Reddit, Instacart, all these like kind

55:05

of very techforward companies. Um we

55:07

actually just received investment from

55:08

Data Dog and Microsoft as well. So we're

55:11

a series C company. We're sort of the

55:12

furthest along in the space. And the

55:15

whole goal that we want to build is give

55:16

you a suite of tools to be able to go

55:19

from development through to production

55:21

with your AI engineering team and for

55:23

PMs to go in and use the same tools. Um,

55:26

and then super quick before Q&A, uh,

55:28

please scan the QR code if you are in

55:31

San Francisco on June 25th. We're

55:33

actually hosting a conference uh, around

55:36

eval. And it's going to be it's going to

55:38

be a ton of fun. We actually have some

55:39

great AI PMs and researchers joining

55:42

from companies like OpenAI, uh,

55:44

Anthropic. And what's really cool is

55:47

we're actually offering for this room,

55:49

um, a free sort of exclusive, uh, free,

55:52

uh, ticket for entry. Uh, the the prices

55:54

actually went up yesterday. So, because,

55:56

you know, we're huge fans of AI Engineer

55:58

World Fair, we wanted to give you all an

56:00

opportunity to join for free if you're

56:01

in town. Um, so would love to see you

56:04

there. Um, and yeah, you can scan for a

56:05

free code.

56:07

And yeah, that's a little bit um of of

56:09

the workshop. I would love any

56:11

questions. Yeah. And uh the ask for the

56:13

questions, as the the person in the back

56:15

just reminded me, if you wouldn't mind

56:17

lining up for questions on the mic so

56:19

that the camera can pick it up and then

56:21

we can just kind of go down the line and

56:22

do some questions there. Um, that'd be

56:24

awesome. Thank you.

56:25

>> So, thank you so much.

56:26

>> Please give your name and

56:27

>> Yeah, my name is Roman. Thank you so

56:29

much. It was like an awesome walk

56:30

through. Uh would you mind share some

56:34

like your experience on building um

56:37

evaluation teams? Should I start with

56:40

hiring kind of dedicated person with a

56:43

experience or should I rely on product

56:45

manager a product manager and walk

56:47

through this like a what's the best way

56:50

>> best practices? So the the gentleman

56:51

asked um what's the best practices for

56:53

building an eval team? Um, can I

56:56

actually ask a follow-up question

56:57

because I'm curious like what is your

56:58

role in the company right now just just

57:00

for myself to know?

57:01

>> I'm head of product.

57:02

>> You're head of product. Okay, perfect.

57:03

So, this is exactly this is a question I

57:05

get actually very often which is how do

57:08

I hire my first AIPM? How do I hire an

57:10

AI engineer? How do I know if I need an

57:12

AIPM or an engineer? So, I think uh

57:15

there's there's a couple steps to this

57:17

answer. One is as head of product. Um, I

57:20

do think we see a lot of heads of

57:22

product actually in the platform are

57:25

like ourselves actually getting their

57:27

hands dirty for the first pass because

57:29

at the end of the day, if you're like

57:30

hiring someone to do something, you

57:32

should probably know what they're going

57:33

to do. And so my job on my team is to

57:36

make the product accessible for

57:38

executives and heads of product to

57:40

understand what's going on. So we have a

57:42

lot of kind of capabilities around

57:44

dashboards, making everything no code,

57:46

low code. But my recommendation is to

57:49

feel the pain yourself of writing evals

57:52

and realizing how what is hard about

57:55

that so that you know how to structure

57:57

interview questions for an engineer or a

57:59

PM because I don't know what's hard

58:01

about your eval workflow, right? I only

58:04

know that there's challenges around

58:05

writing eval general and so I would

58:08

recommend that you like feel the pain

58:09

firsthand and then uh you'll kind of get

58:12

a good sense of how to how to tease that

58:14

out of your interviewing pipeline. Um

58:16

but good good question. Yeah.

58:18

>> Yeah.

58:19

>> Um yeah the example you know we just

58:21

looked at obviously our eval was pretty

58:23

bad when you you know compare it to the

58:25

human labels. Yeah.

58:26

>> So like from here what do you do next?

58:28

Like what's the next step to try to

58:30

improve the prompting for your your main

58:32

eval to get closer to the human labels?

58:35

>> Yeah. Good question. So if I had um more

58:38

if if I was like here working on this in

58:41

in real life, what you would actually do

58:44

is take that eval prompt and go through

58:47

a similar workflow of what we just did

58:49

for prompt iteration for the original

58:51

prompt. So again like um that eval

58:53

prompt we see here

58:56

I could go in take this and redefine

58:59

parts of the workflow to basically say

59:01

you know what uh be really strict about

59:04

what is friendly here are I didn't add

59:06

any few shot examples right I didn't

59:08

specify here's examples of friendly text

59:10

here's examples of robots so that's like

59:12

a a clear gap in my eval today that if I

59:15

were looking at this I could apply best

59:17

practices and improve on it we also have

59:20

um in the product we have some workflow

59:22

around actually helping you write eval.

59:24

So this is this is our product but like

59:26

you don't have to use our product for

59:28

this. Uh you can use any any product.

59:30

I'm going to show kind of an iteration

59:31

on top of this which is how how we have

59:34

users actually building eval prompts. So

59:36

I could say write me a prompt

59:40

to detect friendly or robotic text. And

59:46

this is actually using our own co-pilot

59:48

in the product. So, we've built a

59:50

co-pilot that understands best practices

59:53

uh and actually can kind of help you

59:55

write that first prompt, get it off of

59:57

the ground. You can also take the same

59:59

prompt, which it just generated in like

60:02

1 second, and take that back to the

60:04

prompt playground and iterate on further

60:06

from here. So, let's let's go ahead and

60:08

do that on the fly really quick. I've

60:10

got a prompt in here, and I can go in

60:11

and actually ask the pro the co-pilot to

60:14

optimize this prompt. So, um, let's go

60:17

ahead and

60:19

say make it

60:22

stricter.

60:25

So, I can actually use an a an LLM agent

60:28

and and copilot agent. Um, just kind of

60:30

note that like you really want AI

60:32

workflows on top to help you like

60:34

rewrite the prompt, add more examples,

60:36

and then rerun eval on that new prompt.

60:38

So it's more it's less about like you're

60:40

you're definitely not going to get it

60:42

right on the first try, but being able

60:44

to iterate is really what's important.

60:47

And that's really what we underscore is

60:49

like it might take you like five or 10

60:50

tries to get an eval that matches your

60:53

human labels and that's okay because

60:55

these systems are really complex. Um and

60:57

it's just important about having the

60:58

right workflow in place. So yeah.

61:02

>> Hi, I'm Joti. Um, does Arise also um

61:06

allow for model based evaluations like

61:08

using BERT or Alberta to be able rather

61:11

than just LLM as a judge, but I can use

61:13

like BERT or Alberta to like figure out

61:15

like a prediction score?

61:16

>> Yeah, good question. So, we're actually

61:19

really um the short answer is yes, we do

61:21

offer versions of that. Let me show you

61:23

what I mean by that though. So, um so

61:25

when we go into Arise, you can actually

61:28

set up any uh eval model you want here.

61:31

So you see we have OpenAI, Azure,

61:33

Google, but you can add a custom model

61:36

endpoint as well. So you can basically

61:38

this will structure that request as a

61:40

chat completion but we can make it any

61:42

arbitrary API if you needed to and you

61:44

can say like BERT model and whatever the

61:46

name of your endpoint is point it to

61:48

that and you'll be able to reference

61:50

that model in the eval generator too. So

61:52

this is um so I can just put test here

61:55

kind of move to the next flow. Um, and

61:58

you'll see when I go into here, I can

62:00

use any model provider I want. So, the

62:02

short answer is yeah, you can generate a

62:04

score with any model. Yeah.

62:07

>> Cool.

62:10

>> Okay. Um, oh, we got one more question,

62:12

I think. Or sorry, we have more

62:14

questions. Yeah, go for it.

62:15

>> Going to go ahead and try to get this

62:17

one in. Um, so I think like probably a

62:20

lot of the people that have built apps

62:22

are thinking a similar thing or maybe

62:24

this is a bit naive, but if you had

62:28

humanlabeled information already, right,

62:30

and you're seeing a bad match on the

62:32

friendliness score, am I to assume that

62:35

you'd be trying to get that score up

62:37

higher and then extrapolate to more uh

62:42

cases going forward? And you're assuming

62:44

that that sampling holds across like the

62:46

broader set. Yeah.

62:47

>> So like because that relationship is

62:48

unclear to me.

62:49

>> Very very good question. So um so

62:52

basically one way to reframe this is

62:54

like how do I know that my data set is

62:57

representative of my overall data to

62:59

some degree.

63:00

>> Sure. Or as it shifts over time or

63:01

>> as it shifts. Yeah. Totally. So um so

63:04

that's a really uh really good point. In

63:06

the product what we we don't have this

63:08

yet but it's coming out like in the next

63:09

week. will have an a workflow to help

63:12

you add data to your data set

63:14

continuously using labels that you might

63:16

have. So you could say like is you know

63:19

one thing we didn't really talk about is

63:20

like how to evaluate production data but

63:23

you can actually run these eval not just

63:25

on a data set but on all data that comes

63:27

into your project over time to make it

63:29

automatically label and classify uh you

63:32

know any production data. So you could

63:34

use that to keep building your data set

63:36

of like is this an example we've seen

63:37

before or not or is this uh you know

63:40

think of this as like a way for you to

63:42

sample at a larger scale essentially on

63:44

production

63:44

>> and that is a suggested workflow that

63:46

you continuously sample and human label

63:48

sum

63:49

>> to check the matching over time.

63:51

>> Exactly. And you can basically go in and

63:53

see like okay where human labels don't

63:55

agree with LLM on this on production

63:58

data then you might want to add those to

64:00

your data set as hard examples. Sure.

64:02

>> And we actually are going to build into

64:04

this product as well a way for you to

64:06

qualify is this example a hard example

64:08

as well using LLM confidence score.

64:11

>> Okay. And and sorry just hard example

64:13

like very strictly interpreted.

64:15

>> So hard would be um hard from an eval

64:18

perspective. So like is it is it

64:20

friendly or not can be like borderline

64:22

right? Like

64:22

>> I see. So you're saying like uh

64:23

subjective or

64:24

>> subjective. Yeah. Exactly. So maybe to

64:27

like recap the question a little bit,

64:28

like your data set is this property

64:31

that's going to keep changing over time.

64:34

And you really want tools that help you

64:36

build onto it by giving you like a

64:38

golden data set of hard examples to

64:41

improve on. And hard means like we're

64:43

not really sure if we got it right or

64:44

not in the first place.

64:45

>> Sure. Yeah. Thanks.

64:46

>> Yeah, good question.

64:50

>> Yeah.

64:51

>> Hi, my name is Victoria Martin. Uh,

64:53

thank you so much for the talk. One of

64:55

the things that I've run into is a lot

64:56

of like skepticism out of product

64:58

managers that I'm working with on

64:59

generative AI and trying to build

65:01

confidence in the evals that we're

65:03

giving. Yeah.

65:04

>> Have you been given any guidance or in

65:07

working with other PMs guidance on like

65:08

the total number of evals that that you

65:10

think should be run by the time you can

65:12

say like you can be confident in this

65:13

evaluation set?

65:15

>> Yeah, good really good question. So um

65:18

so the question was like how do we know

65:21

I think there's kind of two components

65:22

to it. there's like quantity and quality

65:24

of the eval eval like how do we know if

65:27

we've run enough eval or we have enough

65:29

eval and that those eval are actually

65:31

good enough to kind of pick up problems

65:34

in our data. Um we this is also maybe a

65:38

little bit of a broker record here but I

65:39

I would say that this is a little bit of

65:41

iteration as well where you want to kind

65:44

of get started with some small set of

65:47

eval. So actually I have a diagram for

65:48

this. Let me just pull that one up.

65:51

So um so you'll kind of see here this is

65:55

intended to be like a loop where you

65:57

start with some in development you're

66:00

going to run on a CSV of data maybe like

66:02

some handam like I would argue the thing

66:05

I just built was development right like

66:07

I have 10 examples it's not

66:09

statistically significant I'm not going

66:11

to get the team on board to ship this

66:12

thing but what I can do is then curate

66:15

data sets keep iterating on them keep

66:18

rerunning experiments until I feel

66:20

confident enough and the whole team is

66:22

on board before I ship to production.

66:24

And then once you're in production,

66:26

you're doing that again, except that now

66:28

you're doing it on production data. And

66:30

then you might take some of those

66:31

examples and throw them back into

66:33

development. Let me give a tactical

66:36

example of what this looks like in real

66:37

life. With self-driving cars, when I

66:40

joined Cruz, we would go down the street

66:42

for like one block and then a driver

66:44

would like have to take over the car,

66:46

right? like we couldn't drop like we

66:47

couldn't drive one block down the ride

66:49

uh down the road. Same goes for Whimo.

66:50

Um they were all kind of in this this uh

66:53

system and then eventually we got down

66:55

to like being able to drive down a

66:57

straight road. Great. But the car can't

66:59

just drive on straight roads, right?

67:01

Like it has to make a left turn. So

67:03

eventually we got like fully autonomous

67:05

for straight, you know, no problems on

67:08

the road and then we had to make a left

67:09

turn and then the car would, you know, a

67:11

human would have to take over. So what

67:13

we did was we built a data set of like

67:15

left turns and we used that to keep

67:17

improving on left turns and then

67:19

eventually the car could make left turns

67:20

great until a pedestrian was in the

67:22

sidewalk and then we had to curate a

67:24

data set of left turns with pedestrian

67:25

in the sidewalk. So the answer is sort

67:28

of like building your eval data set just

67:31

takes time and you're not going to know

67:33

what are the difficult scenarios until

67:34

you actually encounter them. So I think

67:37

to get to production I would recommend

67:39

just kind of using that loop until your

67:41

whole team feels confident in like this

67:43

is good enough to ship and just accept

67:45

that once you get to production you're

67:47

going to find new examples to improve on

67:49

um as well. So it depends a lot on your

67:51

business as well. If you're in

67:52

healthcare or legal tech you might have

67:54

higher bars than if you're building a

67:55

travel agent for example.

67:57

>> Yeah. Yeah.

67:59

>> Yeah.

68:01

>> My name is Matai. Uh I have a question.

68:04

Uh as I understood the uh the AIS

68:08

platform like it's uh working as a like

68:11

I take the the prompt and uh you're

68:13

directly sending that that prompt to a

68:16

model, right?

68:16

>> That's right. Yeah.

68:17

>> Um

68:18

>> with the context and the data.

68:19

>> Yes, of course. Uh you said that like

68:21

there is some possibility to to u port

68:24

tool tools into the platform. That's

68:26

right.

68:26

>> But what about testing the whole system?

68:28

Like we already have like some some uh

68:31

flows that are augmentating augmenting

68:34

the the the whole workflow.

68:36

>> Yeah.

68:36

>> Even outside of tool calls. Yeah.

68:38

>> And like uh they're quite important into

68:41

how the actual output will look like in

68:43

the end. Uh is there any way to uh run

68:46

those evaluations on a on a like a

68:49

custom runner like that would actually

68:51

call our system on our data set that uh

68:54

goes through everything that we have.

68:56

>> Find me after this. We should chat. uh

68:57

is the short answer for that one. Um we

69:00

have some tools and systems like that in

69:01

place like the tool calling that you

69:03

saw, but for endtoend agents, we're

69:05

actually building some stuff out and

69:06

would love to chat with you about that.

69:08

Good question. We'll I'll find you after

69:09

this.

69:10

>> Yeah.

69:12

>> Yeah, of course. So back to your left

69:14

turn example as well as just talking

69:17

about the transition of like PRDS to

69:19

like eval what is the life cycle of like

69:22

feature development look like and kind

69:24

of the relationship I feel like of the

69:26

feature but also with your team in terms

69:28

of ownership accountability all of that.

69:31

>> Yeah.

69:31

>> Yeah. So good question. So I feel like

69:34

how do you work with AI engineers in

69:36

this new world is kind of interesting

69:37

not the subject of this talk but it is

69:39

it is like a very relevant relevant

69:41

question that um you know would h

69:43

happily chat more on too. So there's two

69:46

answers to it that that come to mind.

69:48

One is that development cycles have

69:49

gotten a lot faster. Um like the the way

69:53

at which these models are progressing

69:54

and these systems are progressing like

69:57

going from prototype to production is

70:00

actually even faster than it ever has

70:02

been. Um, so that's one note which I can

70:05

just tell you as a personal observation.

70:07

We we feel that we can go from an idea

70:10

to an updated prompt to shipping that

70:13

prompt in like a span of a day of

70:15

testing which is I think like unheard of

70:17

of like normal software development

70:19

cycles. So that's one note which is just

70:21

like the the the way that you iterate

70:23

with the team has gotten a lot faster.

70:26

Um the second the second note is uh when

70:30

it comes to responsibilities

70:32

I view this as

70:35

if you you're kind of a product manager

70:37

is the keeper of the end product

70:40

experience. So if that means um making

70:42

sure the eval are in a good place and

70:44

the team has human labels to improve on

70:47

that's like a very solid area for a

70:49

product manager to focus is like making

70:51

sure the data is in a good spot for the

70:53

rest of your your development team. I

70:55

think at the same time, you know, I'm a

70:58

PM on the team and I'm like writing some

71:00

of the stuff in cursor. And so being

71:02

able to go in and actually talk to the

71:04

the code base itself using one of these

71:07

models, I think that that's starting to

71:09

become more of an expectation of AI

71:11

product managers is to be literate in

71:13

the code and be able to use these tools.

71:15

I I really this is like after this I'm

71:17

just going to go back and like try to

71:18

fix the thing that I broke earlier,

71:20

right? And and the way the way the

71:22

reason I'm able to do that is because

71:23

the way I'm prompting the system is not

71:26

very sophisticated. Like I asked it

71:28

yesterday, can you make a script that

71:30

generates itineraries on top of the

71:31

server? I need like 15 examples and it

71:33

just did that, right? And like that like

71:36

wouldn't have been possible. So I think

71:38

PMs are responsible for the end product

71:40

experience, but PMs also have more

71:43

leverage than they've ever had before in

71:45

probably the entire like professional

71:47

journey of product management because

71:49

you're now no longer reliant on your

71:51

engineering team to ship that thing that

71:53

you wanted. Like you can just go do it.

71:55

Um should you go do it is another

71:57

question, but uh and that's something

71:59

that's a discussion that you should have

72:00

with your team. But the fact is that you

72:02

can go do it now, which was not the case

72:05

before. And so I I kind of urge PMs to

72:08

like push the boundaries of what people

72:10

have told them the role is and should be

72:13

and see where that takes you. And so the

72:16

long-winded way of saying like your

72:18

mileage may vary depends on the

72:19

boundaries you have with your team, but

72:20

I'd recommend people to like redefine

72:22

those at this stage.

72:24

>> Yeah. Yeah.

72:25

>> Yeah. jumping off that a little bit.

72:27

It's a little off topic from this, but

72:29

um like as a product manager who wants

72:31

to move to be more technical like as I'm

72:34

working with AI engineers.

72:36

>> Yeah.

72:37

>> What does that look like? Like I'm in an

72:38

or where I have very limited access to

72:40

the codebase. So like I use cursor to

72:42

write Python for data things, but like I

72:44

don't necessarily have access to like

72:46

start interrogating the code, understand

72:48

that. So, I'd love just if you have

72:50

suggestions or thoughts on like what how

72:52

to evolve as a PM, but also like maybe

72:54

move my company culture in that

72:56

direction.

72:57

>> Yeah, that's that's t like how actually

73:00

I have a followup question. That's okay.

73:02

Just cuz I'm going to pull people in the

73:03

room like how big is the company? And

73:05

you don't have to share the name if you

73:06

don't want to, but just curious like the

73:07

size of

73:07

>> uh we're about 300 people. Okay. Um but

73:10

the tech's probably like a third of

73:12

that.

73:13

>> Okay. So like almost like 100 engineers,

73:15

300 people. And um do you have any like

73:18

old remnant product managers at the

73:21

company that still have code access?

73:24

>> No, we're like a very new team of PMS.

73:27

>> Okay, cool. Okay. Well, I think um one

73:30

thing we've started doing uh it's it's a

73:32

really good question. Thanks for

73:33

answering that. Um one thing we've

73:35

started doing is trying to take a little

73:37

bit of like the public forum of our

73:39

company. Um sorry, I'm about to out our

73:42

CEO who's in the back of the room.

73:44

[laughter]

73:45

Uh, so if you have any questions about

73:47

our rides, he's a good guy to talk to.

73:48

Uh, but the reason I'm out of here is

73:50

because like I'm I missed our town hall

73:52

today, but like I heard it was just

73:54

basically people running like AI demos

73:56

the whole time of like what they're

73:58

building. And why I think that's really

74:00

powerful is it can get the whole company

74:02

really catalyzed around what's possible

74:06

because to be honest, I think it's very

74:08

likely that, you know, most teams today

74:10

aren't pushing the boundaries of these

74:12

tools. And so you kind of joining this

74:14

talk and seeing like how to run eval,

74:16

how to you know what goes into

74:18

experiments like being able to to kind

74:20

of be the the person pushing the team

74:24

forward is really powerful and I think

74:26

you can do that in a way that's really

74:27

collaborative. So I only I'd say like

74:30

our job as PMs is to have influence over

74:32

the team and influence product

74:33

direction. I think there's an

74:35

opportunity to influence the fact that

74:36

PM should be more technical in your org

74:38

and you could show them by building

74:40

something and and impressing the rest of

74:41

the team by what you build. Um so that's

74:44

my advice, my personal advice there.

74:45

Yeah.

74:46

>> Yeah. Go for it. Yeah. [laughter]

74:48

>> Actually have a question to see if it's

74:51

possible. Uh so how [clears throat] you

74:53

guys believe uh AI will reshape how we

74:56

structure the team. So right now you

74:58

have like I would say for instance just

75:00

75:01

>> 10 engineers, one product manager, one

75:03

designer and so on. So

75:05

>> what will happen in five years? You will

75:07

have one product manager, one engineer

75:09

and one designer.

75:10

>> You should answer this one.

75:12

>> You should do it in the mic though if

75:13

you want to.

75:14

>> [laughter]

75:15

>> The short of it is actually

75:24

cursor on the code there's so many times

75:28

the PM are taking up time asking a

75:31

question like you know how often we just

75:34

ask cursor so um yeah like start start

75:37

there open up your codebase to cursor

75:39

give it to PMs um and then a lot of some

75:43

we've we were the other day doing a PRD

75:45

starting from cursor on the codebase. So

75:48

I think the Yeah, I I I that would be

75:50

where I would start.

75:51

>> Yeah.

75:52

>> Um and I I I don't I can't you know I

75:54

it's hard to look forward right now. I

75:56

just I think a lot of jobs change. We're

75:58

trying to push um AI cursor use

76:01

throughout the company uh as far as I

76:03

can.

76:04

>> Yeah, I hear we have uh people in

76:05

marketing using cursor too these days.

76:07

So um yeah, that's kind of cool. Um

76:10

>> yeah. Um

76:12

>> follow question. Yeah.

76:13

>> So you're talking about right now having

76:16

a product become more of a technologist.

76:19

Do you see also technologist becoming

76:21

more product?

76:25

>> So that's actually a great point which

76:26

is like when the cost of building

76:29

something goes down which it has

76:33

what's what's the right thing to build

76:35

becomes really important and valuable.

76:36

And I think that historically that's

76:38

been like a product person or a business

76:40

person saying, "Hey, here's what our

76:41

customers want. Let's go build this

76:43

thing." Now we're saying product people,

76:45

you can just go build this thing. So the

76:46

builders are like, "Wait, what's my job?

76:48

Like do I" And I think that that's a

76:50

good way to look at it, which is I I

76:53

have this like mental framework of like

76:54

what if we didn't have roles in a

76:56

company anymore? Like you didn't define

76:58

yourself by like I'm a PM, I'm an

77:00

engineer. And think of this instead as

77:02

like like you know like baseball cards

77:04

you have like skills. Imagine that you

77:06

had like a skilled stack instead, which

77:08

is like, I really like to talk to

77:09

customers and I kind of like to code

77:11

stuff on the side, but I don't want to

77:13

be responsible if there's a production

77:14

outage. I guarantee you, you'll find

77:16

someone who's like, I hate talking to

77:18

customers and I only want to ship high

77:20

quality code and I want to be

77:21

responsible if things hit the fan. And I

77:23

think you want to structure your company

77:25

to have a skilled stack that's really

77:28

complimementaryary versus people who are

77:30

like, I do this and this is my job and I

77:32

don't do that. So yeah,

77:34

>> I have something that's sort of related

77:36

to that. We've been testing like human

77:37

in the loop on on

77:40

>> in a couple different ways and we're

77:42

basically testing this method of having

77:44

the human as a tool of the agent. So

77:47

like we have like

77:48

>> if the agent needs something that's not

77:49

available in the accounting system,

77:51

it'll go to the CFO because the CFO is

77:54

listed as a tool and it sends that a

77:55

Slack message, gets it back and

77:57

continues. kind of maps onto what you

77:59

just said of like defining the skills,

78:01

defining the resources they have and um

78:04

they haven't fully fleshed it out, but

78:06

it's it's working to like give the agent

78:07

context on

78:10

on the things that only the humans have.

78:11

>> Yeah.

78:12

>> Exactly.

78:13

>> So, this person is like your company's

78:15

like using agents widely, it sounds

78:17

like, but you have humans approving. you

78:19

have like an approver workflow to

78:20

>> something more so like rather than how

78:23

can the agent be a tool of the humans,

78:25

>> we're kind of flipping it and saying

78:26

like what if the agent could do

78:27

everything

78:28

>> and then the parts it can't do it'll go

78:30

to the human as a tool. So like the CFO

78:33

is a tool of the AI agent.

78:34

>> Interesting.

78:36

>> We should chat. That's a really cool

78:37

workflow. I I'll definitely bug you

78:38

about that. That's that's really cool.

78:39

78:41

>> cool. Um happy

78:45

>> right to some degree. It's like a human

78:47

in the loop approving is this good or

78:49

bad and you can think of it that way. Um

78:52

>> yeah.

78:52

>> Yeah. I had a question about like what

78:54

what it is like to actually implement

78:57

well sending the traces over to Arise.

78:59

Um I know like Arise has like open

79:01

inference which enables enables like

79:04

capturing traces from se several

79:05

different um several different

79:07

providers. But um what are what are what

79:10

are what are the limitations and

79:12

constraints and opinions that you have

79:14

about um how the evolve should be

79:17

structured so that you can actually like

79:18

leverage the platform to perform these

79:20

actions to be able to like um evaluate

79:23

the eval for example or um be be able to

79:26

79:28

numerically

79:29

just produce graphs out of out of your

79:32

evaluations out of your outputs.

79:34

>> Okay. So, so clear.

79:36

>> So, can I can I ask a follow-up question

79:37

to that which is like your question was

79:40

like how to use agents to do some of the

79:42

workflows in the platform or did I miss

79:44

that?

79:44

>> Um, the the question the question is

79:46

79:48

>> how is like what what kind of outputs

79:51

what kind of evals is this um is Arise

79:54

expecting from your engineers and from

79:57

the product like the

79:58

>> you're sending over logs, right?

80:00

>> Mhm. Yeah. um what what is it expecting

80:02

from those logs in order to get this

80:05

flow work?

80:06

>> Understood. Okay. So uh so yeah there is

80:08

a very uh like great point there which

80:10

is like we kind of um you'll see it in

80:13

the code but we jumped over a little bit

80:15

here in uh the demo which is how do you

80:18

get the logs in the right place to use

80:20

the platform. Um unfortunately this page

80:23

isn't dropping but let me okay here we

80:25

go. I'm going to drop it in the Slack

80:26

channel as well. This is what we, you

80:29

know, we kind of talked about like

80:30

traces and spans. It's very likely that

80:32

your team already has logs or traces and

80:35

spans already. You might be using data

80:37

dog or a different platform like

80:38

Graphana. What we do is we're taking

80:40

those same traces and spans and we're

80:43

essentially augmenting them with more

80:45

metadata and structuring them in a way

80:47

that the platform kind of knows which

80:49

columns to go and look at to render the

80:51

data that you saw in the platform. So

80:54

you're really using um the same

80:56

approach. We we're built on top of a

80:59

convention called open telemetry which

81:01

is like the open source standard for

81:03

tracing. Uh so we actually use hotel uh

81:07

tracing and auto instrumentation that

81:09

we've built which doesn't keep you

81:11

locked in at all. Like once you've

81:13

instrumented with our platform using

81:15

open inference which is our our package

81:18

you actually get those logs to show up

81:20

right out of the box with any type of

81:22

agent framework you might be building

81:24

and um and yeah and you get to keep that

81:26

that's let me maybe just show like what

81:28

I mean by that. So if you're let's say

81:30

you're building with like lang graph um

81:33

we actually have it really all you have

81:36

to do is like you pip install uh arise

81:39

phoenix arise hotel and you what you

81:42

call this single line of coal call uh

81:44

the single line of code called langchain

81:46

instrument and it knows where to pick up

81:49

in your code to structure your logs and

81:51

if you have more specific things you

81:52

want to add to your logs you can add

81:54

function dec decorators which is uh

81:57

basically a way to you for you to um you

81:59

know capture specific functions that

82:01

weren't in the

82:02

>> and as for evaluations like you're

82:03

you're discussing like the actual data

82:05

inputs outputs what do you what do you

82:07

need to pass into evaluations I

82:12

>> I know you can like design them through

82:13

the UI

82:15

>> what what do you have in mind for like

82:18

>> like how how do you get the right uh

82:21

text to use for eval right is sort of

82:22

your question

82:24

>> like how do how are you like how do you

82:26

know which to

82:27

>> use I I need to

82:28

format the question. I'll get back to

82:29

you.

82:29

>> Yeah, no worries. What did you mean by

82:32

adding augmenting the data with

82:33

additional metadata like you only have

82:35

so much data, right?

82:36

>> Yeah. So, so this is um so think of this

82:39

as like most tracing and logging data is

82:43

really just things like latency, timing

82:45

information. What we're doing is you can

82:47

add more metadata like user ID, session

82:50

ID, uh things like I'll kind of show you

82:53

an example of that really quick. In the

82:56

in the previous example I showed, we

82:57

actually have things like sessions like

83:00

what's the back and forth example here.

83:02

You can't get a viz like this in data

83:04

dog because data dog is looking at a

83:06

single span or trace. It's not it's not

83:08

really contextually aware of what is the

83:11

human, what's the AI. So we're adding

83:13

context from the from the invocation of

83:16

the the server and adding that to your

83:19

span if that makes sense. So it's it's

83:20

basically just enriching the data a bit

83:22

more and structuring it in a way to use

83:24

it. Um yeah and if you have more

83:27

specific um server side logic you can

83:30

add that as well so it's very flexible.

83:32

Yeah.

83:33

>> Yeah.

83:33

>> Uh so I have a provocation. So I used to

83:35

work in the video game industry

83:37

>> and debates about feature like whether a

83:39

feature was going to be fun or not.

83:41

>> Working prototypes

83:43

>> won all of those arguments. Whatever was

83:45

in the doc didn't matter.

83:46

>> Right.

83:47

>> And so for the person who was like I

83:49

can't get access to my company's code. I

83:51

would actually say try to get access to

83:53

a small sliver of the data and then

83:56

build a working prototype of the feature

83:58

you want to see and with some stub of

84:01

eval because I think you know there's

84:04

nothing worse to an engineer than a

84:06

product manager who shows up with a demo

84:08

that's kind of janky.

84:09

>> Yeah. but actually works and might be

84:12

fun, has polish, feels good, meets a

84:16

user need, and they and having been on

84:18

the engineering side of this equation,

84:20

I'm like, and it's so janky, I have to

84:22

fix it. They haven't thought about the

84:23

edge cases. And so like how does Arise

84:27

fit into that flow of helping a product

84:30

manager basically mine a small segment

84:33

of data build a working example and

84:37

perhaps be just a you know janky as all

84:39

get out but something that looks like

84:41

the product that the company already has

84:44

but demonstrates that next level of

84:46

functionality. Great, great point. And

84:49

yeah, I I think like, you know, feel

84:51

free to prototype and build, you know,

84:54

prototypes that are that are high

84:56

fidelity. I think it is awesome to do

84:57

that. That's a really good point to have

84:59

like to use data to build a system or

85:01

prototype. So, what does Arise do here?

85:04

If you have access to Arise and you

85:06

don't have access to the codebase, you

85:08

can still take this data and assuming

85:09

that you have permission from your CIS

85:12

admin person, you can actually export

85:14

this data. So once you've built a data

85:16

set, you can simply take this data and

85:18

export it out and use that to actually

85:21

um so I can kind of show that really

85:23

quick. Um this is get get data set.

85:26

We'll have a download button coming uh

85:27

later this week, but you can actually

85:29

just take this data, run it locally,

85:32

keep it locally, and then actually use

85:33

that in your local code to actually try

85:36

and iterate on an example. Um and you

85:39

know, assuming your security team is

85:40

okay with that. But that's a really good

85:42

point. Like imagine if you didn't need

85:44

access to the production codebase, but

85:46

you could still iterate in one platform.

85:48

That's really what we're we're pushing

85:50

for is like the whole team is iterating

85:51

on the prompts and the eval um rather

85:54

than in silos, which is what's happening

85:56

in a lot of cases. Okay, I think that

85:59

was all the questions. Thank you all for

86:01

sitting through an hour and a half of AI

86:04

PM like eval. Thank you all for for your

86:06

time and um I'll be sticking around if

86:08

people have more questions, but thank

86:09

you so much.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

Aman, an AI Product Manager at Arise, presented an evaluation framework for Product Managers aimed at shipping effective AI applications. He emphasized that traditional software testing falls short for non-deterministic and hallucinating AI models, highlighting the critical need for robust evaluation (evals). The talk detailed Arise's approach to evals, which involves defining roles, context, goals, and using text-based labels. Through a live demo of an AI trip planner, Aman showcased features like agent visualization, prompt playgrounds for iterative design, data set creation, and A/B testing of prompts. A key takeaway was the importance of evaluating the LLM-as-a-judge system itself against human labels to ensure its reliability. Aman concluded by advocating for Product Managers to embrace a more technical and influential role in the AI era, treating evals as a new form of requirement specification for continuous iteration from development to production.

Recently Distilled

Videos recently processed by our community

The Car Collector Who A Ferrari Worth $38 Million; Car of the Century Part 2 | Bloomberg Hot...

Feb 21, 2026

by Bloomberg Podcasts

GEOGRAFIA 4|Dział IV.Problemy polityczne współ.świata Roz.5:Cywilizacja zachodnia-cywilizacja islamu

Feb 21, 2026

by Czytanie Na Ekranie

HISTORIA 4 | Dział VI. Roz.36: Jesień Ludów 1989 r. i jej konsekwencje #historia

Feb 21, 2026

by Czytanie Na Ekranie

GEOGRAFIA 4|Dział IV.Problemy polityczne współczesnego świata - Podsumowanie rozdziału #geografia

Feb 21, 2026

by Czytanie Na Ekranie

GEOGRAFIA 4|Dział V.Problemy społeczne współczesnego świata Roz.1: Problemy demograficzne na świecie

Feb 21, 2026

by Czytanie Na Ekranie

HISTORIA 4 | Dział VI.Roz.37:Rozpad Związku Sowieckiego, Czechosłowacji i Jugosławii #historia

Feb 21, 2026

by Czytanie Na Ekranie

Introducing the New Handbook of Surveys on Households and Individuals Foundations and Emerging App..

Feb 21, 2026

by UNStats