Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

Watch on YouTube

Now Playing

Transcript

1492 segments

0:13

[music]

0:21

Hey everyone, gonna get started here.

0:23

Thanks so much for joining us today. Um,

0:25

I'm Sally. I'm the director of RISE. I'm

0:28

going to be walking you through some of

0:29

crowd prompt learning. Uh we're actually

0:31

going to be building a driven

0:33

optimization loop for the part of the

0:34

workshop. Um I come from a technical

0:36

background and started off in data

0:38

science before I made my way over to

0:40

product. Uh I do like to still be

0:42

touching code today. I think one of my

0:44

favorite projects that I work on is

0:45

building our own agent Alex into our

0:47

platform. So I'm very familiar with all

0:49

of the pain points um and how important

0:51

it is to optimize your prompt. So I'm

0:54

going to spend a little bit time on

0:55

slides. I like to like just set the

0:56

scene, make sure everybody here has

0:58

context on what we're going to be doing

0:59

and then we'll jump into the code with

1:01

me. So, I'll let you do a little bit of

1:02

an intro.

1:03

>> Yeah, thank you so much, Ellen. Great to

1:05

meet all of you. Excited to be walking

1:07

through prompt learning with you all. I

1:09

don't know if you got a chance to see a

1:10

harness talk yesterday, but hopefully

1:12

that gave you some good background on

1:15

how powerful prompting and prompt

1:16

learning can be. Uh, so my name is I'm a

1:19

product manager here at Arise as well.

1:21

And like Sally said, we like to stay in

1:23

code. We'll be doing a few slides, then

1:25

we'll walk through the code and we'll be

1:26

floating around helping you guys debug

1:28

and things like that. My background is

1:30

also technical. So, I was a backend

1:32

distributed systems engineer for a long

1:34

time. So, no stranger to how important

1:36

observability infrastructure really is.

1:38

Um, and I think it's an appropriate

1:40

setting in AWS for that. So, yeah,

1:42

excited to dive deep into front loading

1:44

with you all. Thank you.

1:46

>> Awesome. All right, so we're gonna get

1:47

started. Just give you a little bit of

1:48

an agenda of the things I'm going to be

1:50

covering. Uh, so we're gonna talk about

1:51

why agents fail today. what is evening

1:53

prom learning? I want to go through a

1:55

case study kind of show youall why this

1:57

actually works. Uh and we'll talk about

1:59

learning versus GA. I think everybody I

2:01

had a few people come up to me over the

2:02

conference about like what about GEA? Uh

2:04

we have some benchmarking against that

2:05

and then we'll hop into our workshop. Um

2:08

but with this I want to ask a question.

2:09

How many people here are building agents

2:11

today?

2:12

>> Okay, that's what I expected. Um and how

2:14

many people actually feel like the

2:15

agents they're building are reliable?

2:19

>> Yeah, that's what I also thought. So

2:21

let's talk a little bit about why agents

2:22

fail today. So why do they fail? Well,

2:25

there's a few things that we're seeing

2:26

with a lot of our folks and we're seeing

2:27

even internally as we build with Alex

2:29

for why agents are b breaking. So um I

2:33

think that a lot of times it's not

2:34

because the models are weak. It's a lot

2:36

of times the environment um and the

2:38

instructions are weak. So uh having no

2:41

instructions um from their learned

2:43

environment uh no planning or very

2:45

static planning. I feel like a lot of

2:47

agents right now don't have planning. We

2:49

do have some good examples of planning

2:50

like we have cloud code cursor. Those

2:52

are really great examples but I'm not

2:53

seeing it make its way into every agent

2:56

that I come across. Uh missing tools big

2:58

one. Sometimes you just don't have the

2:59

tool sets that you need. Uh and then

3:01

missing kind of tool guidance on like

3:03

which of the tools we should be picking

3:04

and then context engineering continues

3:06

to be a big struggle for folks. If I

3:10

were to distill this out, I think it's

3:12

like these three core issues. So

3:14

adaptability and selfarning. Um so no

3:17

system instructions learned from the

3:19

environment touched on determinism

3:20

versus non-determinism balance. So

3:23

having the planning um or no planning

3:25

versus doing like a very static

3:26

planning. You want to kind of have some

3:28

flexibility there. And then context

3:30

engineering I think is a term that just

3:32

kind of emerged in the last like you

3:34

know six to eight months but it's

3:35

something that's really really important

3:36

that we're finding you know missing

3:38

tools tool guidance just not having

3:40

context or confirming your data and not

3:42

giving the LM enough context. So these

3:46

are um kind of the core issues to still.

3:48

But I think there's one other pretty

3:50

important thing. Um and that is kind of

3:52

this distribution of who's responsible

3:54

for what. So um there's these technical

3:56

users, your AI engineers, your data

3:58

scientists, developers, and they're

3:59

really responsible for the code

4:00

automation pipelines actually, you know,

4:03

managing the performance and costs. But

4:05

then we have our domain experts, subject

4:07

matter experts, AI product managers.

4:09

These are the ones that actually knew

4:10

what the user experience would be. they

4:11

probably are super familiar with um the

4:14

principles that we're actually building

4:16

to our AI applications. They're tracking

4:18

our evals and they're really trying to

4:20

ensure that the product success. So

4:21

there's this split between

4:22

responsibilities but everybody is

4:24

contributing but then there's this

4:26

difference um in terms of like maybe

4:29

technical abilities. And so with prompt

4:30

learning it's going to be a combination

4:32

of all these things. So everybody's

4:33

going to really need to be involved and

4:35

we can talk about that uh a little bit

4:37

more. So [clears throat] what even is

4:39

prompting? I'm going to first kind of go

4:41

through some of the um approaches that

4:45

we kind of borrowed when we came up with

4:46

prompt learning. So this is something

4:47

that Arise has been really really uh

4:49

dedicated to doing some research. And so

4:51

one of the first things we borrow from

4:53

uh which is reinforcement learning. How

4:55

many folks here are familiar with how

4:56

reinforcement learning works? All right,

4:58

cool. Um so if I were to give like a

5:00

really like silly kind of analogy, we

5:02

have a reinforcement model. Uh pretend

5:04

it's like a a student brain that we're

5:06

trying to kind of, you know, boost up.

5:08

And so they're going to take an action

5:10

uh which might be something like you're

5:11

just going to take a test an exam and

5:13

there's going to be a score. A teacher

5:14

is going to come through and actually

5:15

you know score the exam here um that's

5:17

going to produce this kind of like

5:19

scaler reward um and you know pretend

5:22

the student has an algorithm in their

5:23

brain that can just kind of take those

5:24

scores and update the weights in their

5:26

brain and kind of like the learning

5:28

behavior there and then we kind of

5:29

reprocess. So you know in this kind of

5:31

reinforcement one we're updating weights

5:33

based off of some scalers. Um, but it's

5:36

really actually difficult to update the

5:37

weights directly, especially in like the

5:40

LLM world. So, reinforcement learning

5:42

isn't going to quite work that well uh

5:44

when we're we're doing things like

5:46

prompting. So, then there's

5:48

metaprompting, which is very close to

5:50

what we do with uh prompt learning, but

5:53

still not quite right. So, here with

5:56

metal prompting, we're asking LM to

5:58

improve the prompt. Uh, so again, we use

6:00

that kind of like student example. We

6:02

have an agent which is our student. Um,

6:04

and it's going to produce some kind of

6:05

output like that's a user asking a

6:07

question getting an output. That's our

6:08

test in this example. And then we're

6:10

going to score. Eval is pretty much what

6:12

you can think of there. Uh, where it's

6:14

going to output a score and from there

6:17

we have like the metapromp thing. So now

6:18

the teacher is kind of like the metapar

6:20

prompt. It's going to take the result uh

6:22

from our scorer and update the prompts

6:25

based off of that. Um, but it's still

6:28

not quite what we want to do. And that's

6:31

where we kind of introduce this idea of

6:32

prompt learning. So prompt learning is

6:34

going to take the the exam going to

6:36

produce an output. Um we're going to

6:38

have our enlumm evals on there. But

6:40

there's also this really important piece

6:42

which is the English feedback. So which

6:44

answers were wrong? Why were the answers

6:46

wrong? Where the student needs to

6:47

actually study? Really pinpointing those

6:49

issues. And then we still aren't using

6:51

metapro. We still are asking an LLM uh

6:54

to improve the prompt. It's just the

6:55

information that we are giving that LLM

6:58

uh is quite different. And so we're

7:00

going to update uh the prompt there with

7:02

all of this kind of feedback. So from

7:05

our evals from a subject matter expert

7:07

going in and labeling and use that uh to

7:10

kind of boost our prompt with better

7:11

instructions and sometimes exams.

7:15

So this is kind of like the traditional

7:17

prompt optimization where it's like we

7:19

have we're kind of treating it like an

7:20

ML where we have our data and we have

7:23

the prompt. We're saying optimize this

7:24

prompt and maximize our like prediction

7:26

impulse. Um but that doesn't quite work

7:29

uh for Allens were missing a lot of

7:31

context. So what we really found um is

7:34

that the human instructions of why it

7:37

failed. So imagine you have your

7:38

application data, your traces, a data

7:40

set, whatever it is. Your subject matter

7:42

expert goes in and they're not only

7:44

annotating correct or incorrect. They're

7:46

saying this is why this is wrong. It

7:48

failed to adhere to this key

7:49

instruction. It didn't adhere to the

7:51

context. It's missing out whatever it

7:53

is. Um, and then you also have your ego

7:55

explanations from Ellen as a judge,

7:56

which is same kind of principle where

7:58

instead of just the label, it provides

7:59

the reasoning behind the label. And then

8:02

we're pointing it at the exact

8:03

instructions um to change. We're

8:05

changing the system prompt to help it

8:07

improve so that we then get, you know,

8:09

prediction labels, but we also get those

8:11

evals um and explanations of it. So,

8:14

we're just kind of optimizing more than

8:16

just um our outlet here. And I think a

8:19

really key learning that we've had is

8:21

the explanations in human instructions

8:23

or through your own as a judge. That

8:25

text is really really valuable. I think

8:27

that's what we see not being utilized in

8:29

a lot of other broad optimization

8:31

approaches. Um they're either kind of

8:33

optimizing for a score uh or they're

8:35

just paying attention to the output. But

8:38

you can think of it this way. It's like

8:39

these elements are operating in the text

8:40

domain. So we have all this rich text

8:42

that tells us exactly what it needs to

8:43

do to improve. why wouldn't we use that

8:45

to actually improve our so um that's

8:50

kind of the basics of prompt learning

8:53

but everybody always comes up to me and

8:54

like sounds great s but does it actually

8:57

work um it does and we have some

8:59

examples of when we do this so we did a

9:01

little bit of a case study um I think

9:03

coding agents everybody is pretty much

9:05

using them at this point there's a quite

9:07

a few that have been really really

9:08

successful I think cloud code is a great

9:10

example cursor but there's also client

9:12

uh which is more of a um an open version

9:15

of this and so we decided to take a look

9:19

and compare to see if we could you know

9:21

do anything to improve. So these are

9:22

kind of the the baseline of where we

9:24

started here. Um you can see the

9:26

difference between the different models.

9:28

U obviously using two and throttle kind

9:30

of the state-of-the-art there but we

9:32

also had this opportunity where CL was

9:34

using you know 45 and it was working

9:37

decently well at 30% versus 40. Um and

9:40

then there was kind of the conversation

9:42

around. So this is where we started um

9:45

and we took a pass optimizing the system

9:48

prompt here. So you can see this is what

9:50

the old one was looking like. It has

9:52

like no rules section. So it was just

9:54

very like you are a cloud agent. You're

9:56

built on this model. You're you're here

9:58

to do coding. Um but there was no rules

10:01

and so we took a pass at updating the

10:04

system. So there were all of these

10:05

different uh rules associated. So when

10:07

dealing with errors or exceptions,

10:08

handle them in a specific way. make sure

10:10

that the changes align with, you know,

10:12

the systems design. Um any changes to be

10:15

accompanied by appropriate test. So

10:17

really just kind of building in like the

10:18

rules that like a good engineer would

10:20

have uh which was completely missing

10:22

before. Um and so we found that plan

10:25

performs better with updated system

10:27

problem. Pretty kind of simple. It's

10:28

kind of the whole concept here. It's

10:30

like you can see these different

10:31

problems and we're seeing you know

10:33

things that were incorrect now being

10:34

correctly done just by simply adding

10:36

more instructions. [clears throat]

10:38

So it really demonstrates pretty well

10:40

here um how those system prompts can

10:43

improve and we benchmarked again with a

10:46

s bench light to get another just like

10:48

kind of coding uh benchmark for these

10:50

coding agents and we were able to

10:52

improve by 15% just through the addition

10:56

of rules. Uh so I think that that's

10:57

pretty powerful. So no fine-tuning, no

10:59

tool changes, no architecture changes. I

11:01

think those are the big things folks

11:02

like reach for when they're trying to

11:04

improve their agents. Uh but sometimes

11:07

it's just about your system prompt and

11:08

just adding rules. I think we've really

11:09

seen that and that's why we're really

11:10

passionate about prompt learning and

11:12

prompt optimization in general is it

11:14

feels like the lowest lift way to get

11:16

massive improvement gains in your agent.

11:18

Uh 4.1 achieved performance near 4.5

11:21

which is pretty much considered right

11:22

now state-of-the-art when it comes to

11:24

coding questions and it's twothirds of

11:26

the cost which is always uh really

11:28

beneficial. So uh these are some of kind

11:31

of the tables here. will definitely

11:32

distribute this so you can kind of take

11:33

a closer look. But I think the main

11:35

point I want y'all to come away with is

11:37

the fact that like, you know, 15% is

11:39

pretty, you know, powerful uh

11:41

improvement in our performance.

11:44

Now, a question we get all the time is

11:46

we're taking these examples of perform

11:48

learning. So, how this is really

11:49

important is we're going to take a data

11:50

set. A lot of time that data set is

11:52

going to be a set of examples that

11:54

didn't perform well. either a human went

11:56

through and uh labeled them and found

11:58

that they you know were incorrect or you

12:00

have your emails that are labeling them

12:02

incorrect and so you've gathered all

12:04

these examples and that's what we're

12:06

going to use to optimize our prompt. So

12:07

I get a question all the time like well

12:09

aren't we going to overfitit uh based

12:11

off of these bad examples but there's

12:14

this rule of generalization where

12:16

mending properly enforces high level

12:18

reusable coding coding standards rather

12:19

than repo specific fixes and we are

12:22

doing this train test split uh to ensure

12:24

that the rules are generalized beyond

12:26

just like local quirks and whatever our

12:28

uh training data set is. But if you kind

12:31

of think of this as like you hire an

12:33

engineer, right, to to be an engineer at

12:35

your company, you do kind of want them

12:36

to overfit to the database that they're

12:38

working on. So, uh we kind of feel that

12:40

overfitting is maybe a better term for

12:43

it is expertise. Uh we are again not

12:45

kind of training in the traditional

12:47

world. We are trying to build expertise

12:49

and as we'll talk about this is not

12:50

something we feel that you do once.

12:52

You're actually going to kind of

12:53

continuously be running this. So, um

12:55

more maybe questions before we get to

23:05

the bridge club? Any

23:08

switch back?

23:13

All right. Um, so here is going to be a

23:17

QR code uh for our prompt learning repo.

23:20

Um, so I'll give everyone a few minutes

23:22

to get such with that. Get it on your

23:25

laptops. I know it's a little bit clunky

23:27

to add this QR and like airdrop it. was

23:30

not sure a better way. Um I can just

23:33

show you also here if you want to find

23:35

it. Um it is going to be in our Rise AI

23:39

uh repo here and under prompt learning

23:42

and you just want to kind of clone that.

23:43

We are going to kind of be running it uh

23:45

locally here.

23:48

>> You go back to the page with the URL.

23:50

>> Yes. Sorry about that.

23:55

>> no, the page with the URL. Oh,

24:08

>> we'll give folks just a few minutes to

24:09

get

24:10

>> What do you What's your process when

24:12

you're building a new agent or work for

24:16

anything that could be evaluated? Do you

24:18

guys start by just like, oh, try

24:22

something prototype and then see where

24:23

it's bad and then do eval?

24:25

[clears throat]

24:26

>> Yeah, I think there's different

24:27

perspectives on this. Our perspective is

24:29

EOS should never block you. Like you

24:30

need to get started and you need to just

24:32

build something really scrappy. We don't

24:34

think like you should, you know, waste

24:35

time doing eval. I think it's helpful to

24:37

pull something out of the box sometimes

24:39

in those situations just because it's

24:41

hard to comb through your data. like

24:42

that's something we've experienced with

24:43

Alex of like when you're getting started

24:45

just running a test manually reviewing

24:47

like it it's kind of painful. Um so I

24:50

think that having eval is helpful but

24:52

shouldn't be a blocker. Pull something

24:53

off the shelf maybe start with that then

24:55

as you're iterating you're understanding

24:56

where your issues are then you're

24:57

starting to refine your evals as you're

24:59

refining your agent.

25:01

>> Yeah.

25:03

One last question. Yeah.

25:06

>> So it makes sense to like optimize the

25:07

system

25:10

like sub aents or commands or how are

25:13

you thinking about this like multi-

25:15

aent?

25:16

>> Yeah. So the question is is like are you

25:18

just doing one single prompt or how do

25:20

you think about this in a multi- aent? I

25:22

think we're kind of thinking that this

25:23

right now is kind of independent tasks

25:25

that can optimize your prompts kind of

25:27

independently and then running tests um

25:29

to get into like the agent simulation of

25:30

running them all together. But right

25:32

now, our approach is a little bit

25:33

isolated, but I definitely see a future

25:35

where we're going to kind of meet the

25:37

the standard of like sub agents and

25:39

everything else that's going on right

25:41

now.

25:43

>> No, I think that's pretty accurate. And

25:44

also like I mean even in a single agent

25:47

use case versus like a multi- aent use

25:50

case like ultimately like each of those

25:51

agents may be specialized. They may have

25:53

their own prompts that they need to

25:55

learn from. So I think doing this in

25:58

isolation still has benefits for the

26:00

multi- aent system as a whole that can

26:02

pass on over time in scenarios like hand

26:04

off etc and making something like really

26:07

really specialized. So I guess like what

26:09

we're talking about with like the

26:11

overfitting as well which is again like

26:13

question we get all the time but really

26:15

you want to be over fit on your code

26:17

base as an engineer. Um you don't want

26:19

to be so generalized that you're no

26:21

longer good at picking up specific works

26:23

in your code base.

26:25

Yeah.

26:28

>> All right. Everybody kind of getting to

26:30

read the mode. Okay. Anybody need any

26:32

help?

26:36

>> All right. So, we are going to be using

26:38

OpenAI for this. So, I think the next

26:40

thing that I'll have everyone do is

26:41

probably spend some time just grabbing

26:43

your API key. We'll get to it and then

26:44

I'll just kind of start walking through

26:47

our notebook here.

26:51

So, we are going to be doing a JSON

26:52

webpage prompt example. So, you're going

26:54

to find that under notebooks here. Um,

26:57

and so we'll give everybody a second to

26:59

pull it out. There's going to be just

27:00

some slight adjustments we're going to

27:01

add to this example uh just to make it

27:04

run a little faster and work a little

27:05

better. The first is um what this is

27:09

even doing this is going to be a very

27:10

simple example uh for just a JSON web

27:14

page prompts. If anybody has like a

27:15

prompt or use case that they want to

27:17

kind of like code along, Van and I are

27:19

absolutely help like glad to help kind

27:21

of adapt what you're working on to the

27:23

use case here. It's something very

27:24

simple just to kind of demonstrate um

27:27

the the principles and we are going to

27:29

be using we can definitely experiment.

27:32

If you want to swap out any other

27:33

providers that you want to use, we can

27:34

also definitely help you do that.

27:38

Um but the the goal of this is

27:39

essentially going to be to iterate

27:41

through different versions of a prompt

27:43

using a data set. Um

27:46

and we will optimize. So the first thing

27:48

is obviously we need to do some

27:49

installs. Um I am just going to have you

27:51

all update it. It says like greater than

27:54

2.00.

27:55

Uh but we're going to actually just use

27:57

I think 22 today.

28:02

And then the next thing is just to make

28:03

this run a little faster. So we're going

28:05

to run things in async which is missing.

28:08

So you can go ahead and add these lines

28:11

in the cell as well. All right, everyone

28:14

kind of follow along and I never know

28:16

want to move too fast. Seems to head

28:18

nuts. Cool. Let's talk about

28:19

configuration. So um I kind of talked

28:21

about it a little bit when I was going

28:23

through the slides. So we are going to

28:24

be doing some looping. So the general

28:26

idea is is we start out with the data

28:28

set uh with some feedback in it and

28:30

we'll we'll look through the data set

28:31

once we get it. Um, but you're going to

28:33

want to have either human evaluation.

28:36

Um, so like annotations, either free

28:38

text, labels, um, or you're going to

28:40

want to have some evaluation data. But

28:41

the feedback is really important. That's

28:43

what makes this kind of work. Um, we're

28:45

going to then, you know, pass that to

28:48

Allen to do the optimization and then

28:50

it's going to basically have eval. So as

28:53

it's optimizing, it's using that kind of

28:55

data set to then run and assess whether

28:57

or not it should, you know, kind of keep

28:59

optimizing. Um, and then it also

29:02

provides you data that you can kind of

29:03

like use to gauge which of the prompts

29:05

that it outputs um, in you know a

29:09

production setting. So we're going to do

29:11

some configuration. Um, so I've kind of

29:13

wrote out here kind of what each of

29:14

these means. So we have the number of

29:16

samples. So this controls how many rows

29:18

of the sample data set. Um, you can, you

29:21

know, set to zero to use all data or you

29:23

can, you know, use a positive number to

29:25

limit for, you know, faster

29:26

experimentation. So I think that

29:28

sometimes folks use different um

29:30

approaches here. Sometimes you want to

29:32

just move really quick so you set a low

29:34

sample. Sometimes you want to be a

29:35

little bit more representative so you up

29:37

it. Um I have it here set as 100. Feel

29:40

free to adjust. Um and then the next

29:42

thing is train split. Um so I think

29:45

folks are probably pretty familiar with

29:47

the concept here of like a train test

29:49

split, but it's just how much of the

29:50

data do we want to use into our

29:52

training? Again, that's what we're using

29:53

to actually optimize. Then how much of

29:55

it do we want to use when we're testing

29:57

when we're running the eval um on the

29:59

new prompt? Um

30:02

and there's number of rules. Uh

30:04

basically the specific number of rules

30:06

to use for evaluation. This just

30:09

determines which prompts to use. Um and

30:12

so this is like as we're running these

30:13

loops, we're outputting, you know, a

30:15

bunch of different prompts. So this is

30:17

just saying how many um we should use

30:19

for evaluation. And then key one here,

30:22

number of optimization loops. So this

30:24

sets how many optimization iterations to

30:26

run per experiment. Um and each loop

30:30

basically generates those outputs,

30:31

evaluates them and refineses the prompt.

30:34

And so these just control the experiment

30:35

scope the data splitting um just went

30:38

through the whole prompt learning loop

30:39

and and how much data we want to use. So

30:42

you can kind of just run these as you

30:45

are or if you want to adjust them feel

30:47

free. Uh and then the next step pretty

30:49

simple. We're just going to uh grab that

30:51

open AI key if you haven't already uh

30:54

set that up. So, get passage is going to

30:56

like pop up. Um I'll show you here

30:58

quick. It's going to pop up there. You

30:59

can just paste in your API key there

31:03

before we start looking at the the data

31:05

a little bit.

31:12

Just if anybody runs into any issues,

31:14

you just give this away. All right. I

31:16

think this particular

31:18

we get through this

31:26

>> I'm doing good but if you have a free

31:27

one you want to give me

31:30

that

31:31

>> I wish

31:34

>> all right let's talk about the data so

31:36

we provided data with you with queries

31:39

um you can see here that we're doing the

31:40

8020 split based off of kind of

31:43

configuration we set above I'm just

31:45

going to pull this um train set here and

31:48

let's just

31:49

>> Yeah, I run because in the minus

31:54

31:55

>> Oh, yep. You're right. That's a mistake

31:57

on my part.

32:01

Yeah, it is the 50. Um

32:04

let's take a look at what this data set

32:08

looks like. No. Uh just so folks can

32:10

kind of understand. Um so kind of

32:13

starting here with some just basic input

32:15

and output. Um

32:19

transcept we don't have any of the the

32:21

feedback in these rows that I printed

32:23

out here but you can imagine you can

32:24

have different uh correctness labels

32:26

here explanations any real validation

32:28

data can be whatever it is that um you'd

32:30

like it to be. Some folks use multiple

32:32

eval

32:34

feedback sometimes it's a combination

32:36

but you really want to have you know the

32:38

input and output that will use that way.

32:40

Should my output of train set be the

32:42

same as you?

32:44

>> Not necessarily. Depends on

32:46

>> I didn't know if head was sort or not.

32:50

>> It all depends on kind of what the the

32:52

same but we could look at like you know

32:53

if I did this this should be the same

32:55

for you maybe just to make sure.

33:04

>> Yeah.

33:06

>> That's what you're saying. Okay. Yeah.

33:09

Quick question. [clears throat]

33:10

Um, is it possible for the input to be

33:13

like a chat history and not just

33:17

>> Great question. So, I think it depends

33:19

on like what it is you're trying to do.

33:21

If you're doing just like a simple kind

33:22

of uh system of the input, you kind of

33:24

want it to be one to one. You don't want

33:26

to give it a ton of um like conversation

33:28

data that's not relevant to the prompt

33:30

that you're optimizing. um we we

33:32

generally just use like the single input

33:34

but I think that there are applications

33:35

that you could do like conversation

33:37

level um inputs.

33:39

>> Yeah. Because because quite often the

33:41

failure is somewhere middleation

33:45

right. So so if you put just the

33:47

original task in uh then the probability

33:51

of you hitting you know a failure in the

33:53

middle of the

33:56

>> totally. So in that case, what we

33:57

generally see is like different rows of

33:58

like having each of like the back and

34:00

forths be like kind of independent rows

34:02

because you're probably going to

34:03

evaluate each of them and um honestly

34:07

probably like get the human feedback on

34:08

each of them. So we usually separate

34:10

them out in that way.

34:12

>> But it's a good point. If you just

34:13

always are focusing on the first turn,

34:15

there's probably a lot of redundancy

34:17

there. uh you definitely will have to

34:19

like say over parts of the conversation

34:21

>> and and how we can biferate like

34:24

instructions and we have some context

34:26

also. So

34:27

>> should not touch the context. It should

34:29

only uh whatever the manipulate the

34:32

system instruction or the prompting

34:35

context it should be the static it

34:37

should not be like

34:39

based on the answer it will change my

34:41

context.

34:42

>> Yes. What you're saying is like looking

34:43

at the input there might be like a tool

34:44

volume context you're kind of passing

34:46

that in. You can absolutely include that

34:47

in your data set um so that the

34:49

application kind of understands what

34:51

other or not the application

34:52

[clears throat] but the prompt learning

34:53

um LM can understand all of the data

34:57

that's kind of like available. So you

34:58

can just have that passed in as extra

35:00

column if you want. Most people start

35:02

with just kind of input and the

35:04

feedback. Um but you can absolutely add

35:06

what other data you think is relevant

35:10

and if when for the rerunning when we're

35:12

doing the experiment of testing you'll

35:14

definitely always want to have the data

35:15

that would be required to answer

35:17

>> any even very simple some call some call

35:21

or some context it is pulling some API

35:24

call

35:26

whatever the prompt engineering it

35:28

should be based on the out

35:31

getting the output right and whatever

35:33

the context front my plus whatever the

35:37

tool call I have done API call all the

35:40

uh contact engineering and then last

35:43

finalize

35:44

>> totally yeah so again at this point

35:47

we're we're testing just like one prompt

35:48

and not that kind of end to end but you

35:50

definitely want to have everything that

35:52

like is flowing into the prompt that

35:53

you're optimizing so uh if your system

35:55

prompt takes in the user input for

35:56

example some data from an external API

35:59

you would definitely want to provide all

36:01

of that data does that make sense

36:04

Because because you're saying that like

36:06

the the like trajectories

36:08

[clears throat] the like tool calls and

36:09

what the agent's going to do depending

36:11

on what the tool call was is what you're

36:13

trying to proper to.

36:14

>> Yeah, exactly. We want to just like

36:15

because we're kind of trying to replay

36:16

and optimize one step of it. We

36:18

definitely don't want to do it

36:19

completely in isolation. So if there's

36:21

like data that flows into that prompt um

36:23

that's context that's using that's

36:25

producing the output, right? So we want

36:26

to be sure that we're including that. We

36:28

don't want to exclude anything. But if

36:29

it's data that comes like at a different

36:32

step probably not then you don't want to

36:34

do that that way. It's just like think

36:36

about what's relevant for the the step

36:37

that we're trying to optimize in this.

36:44

All right. Any other questions coming?

36:48

All right. Cool. So we're going to set

36:49

up our initial system prompt. You can

36:51

see this is something very very basic.

36:53

Uh we'll definitely I think we can do a

36:54

whole lot better than this, but I just

36:56

kind of want to illustrate something uh

36:58

that we're going to test and optimize.

37:01

So we're just saying you are an expert

37:02

in JSON web page creation. Your task is

37:04

input. And then so all these inputs that

37:06

we're seeing are going to be what we're

37:08

actually generating outputs for and

37:10

trying to optimize. Now I already kind

37:12

of touched on this. Um evaluators are

37:15

extremely important to make all of this

37:18

work, right? Um so we're going to uh

37:20

initialize two evaluators that use

37:23

elements as a judge to assess the

37:24

quality of generated outputs. So we are

37:26

using elements a judge. If you have any

37:28

other like codebased evaluations,

37:29

whatever you need to do to evaluate, you

37:32

can definitely swap those out. Uh but

37:34

we're going to do evaluate output. This

37:36

is going to be a comprehensive evaluator

37:37

that assesses the JSON webpage

37:39

correctness against the input query and

37:41

the evaluation rules. It's going to

37:43

provide an output label of correct or

37:45

incorrect. So pretty simple binary.

37:47

Again, you can use multilel. And then

37:50

it's going to have the detailed

37:51

explanations as well. Um, and then we

37:54

have a rule checker. This is a more

37:55

specialized evaluator that performs a

37:57

granular rule by rule analysis. Um, and

38:01

it examines if each rule um was

38:04

compliant.

38:06

And then both of these are going to

38:07

generate feedback that goes into our

38:08

optimization loop uh to iteratively

38:11

improve the system prompt. Um,

38:13

explanation role violations guide. Um

38:15

and we'll get to this the prompt

38:16

learning optimizer and creating the more

38:19

effective prompts. So I have some

38:21

imports here. Let's take a look at what

38:22

the actual eval output has. Um so we do

38:26

have some rules that are in um in here

38:33

wait

38:36

um they're going to be in a repo. Um so

38:38

we're going to open that as a file. We

38:40

have this llm provider and we're using

38:41

open AI here. And then we're going to do

38:44

our classification evaluator. So, uh

38:46

we're just calling it uh evaluate

38:48

output. It's al we have an evaluation

38:51

template that we're reading from the

38:54

bottom here. Um then we just have

38:56

choices correct and correct. Now we're

38:57

mapping a label to a score. Sometimes

38:59

it's helpful to be able to like add or

39:00

score. Sometimes a number is easier than

39:02

just looking at a bunch of labels. Uh it

39:04

is optional you want to map these if you

39:06

have like a multiclass use case. You can

39:08

set the scores u accordingly. But these

39:10

are just going to be our choices like

39:11

the rails that we want our elements as a

39:13

judge to adhere to. And then all we're

39:16

doing here is getting our results. I

39:18

have it doing some printing so you can

39:20

kind of take a look. So this is going to

39:21

be slightly different than what you're

39:22

seeing in the notebook. So I'm just

39:23

going to pause here. Uh if you want to

39:25

make the code changes from what you're

39:27

seeing in probably your version, this is

39:29

a a good time for that.

39:32

Does kind of the setup of the evaluator

39:34

make sense to all kind of the key. It's

39:37

going to be the rails. It's going to be

39:38

the output. Uh and of course our

39:40

template. [clears throat]

39:44

>> Yeah, you will want to grab your own uh

39:46

OpenAI key here uh to set

39:52

[clears throat]

39:52

>> and we can help you if you want to use

39:53

different provider. We can help you swap

39:55

this out like that is helpful to

39:57

anybody.

39:59

>> Okay, I'm going to start walking you

40:01

through the output generation. So uh

40:05

this is just kind of you know you can

40:06

imagine this as your own agent logic or

40:08

the the part that you're kind of

40:10

testing. Uh this is just going to

40:12

function that actually generates the

40:13

JSON u outputs. We're using for one here

40:16

with JSON response format zero

40:19

temperature for consistent outputs. Um

40:21

it's taking a data set a system prompt

40:23

generates outputs for all rows returns

40:25

the results for evaluation. Um and it's

40:28

called during each iteration to produce

40:30

output. So this is like our

40:31

experimentation function that we're

40:32

writing. So as we're passing in data,

40:34

it's producing new uh prompts. We need a

40:37

way to test it, evaluate, understand uh

40:39

how we are kind of moving the needle

40:41

here. So that's all this is. So it's

40:43

pretty straightforward function just

40:44

called generate output. We have that

40:46

output model. Again, we're using OpenAI.

40:48

If anybody wants help switching things

40:50

around, happy to help. Uh we are using

40:52

response format because we are dealing

40:54

with JSON here. So uh we know that what

40:56

you just prompted. I mean some of the

40:57

the newer models are decent at it, but

41:00

using response format is really helpful.

41:02

And then we're also setting temperature

41:03

to zero. Um, and here is just kind of

41:06

where we're passing all the data in. So

41:08

the data set because again we want to

41:09

run this on all of the the testing data,

41:13

the system prompt that will be input. So

41:15

as we get to the optimization loop,

41:17

we're going to be passing in a new

41:18

prompt to this with the data set and

41:20

then evaluating. Um, we have our output

41:23

model that we've already passed,

41:24

concurrency, all that good stuff. And

41:25

it's just returning all of the outputs

41:28

there. Would you for the uh the current

41:32

generation of models since this one's

41:33

basically like in in AI terms ancient uh

41:36

would you like still recommend setting

41:38

the temperature to zero or would you

41:40

actually want to try to encourage some

41:41

of the creativity to like

41:43

>> I think it depends on the use case a

41:45

little bit and what you're you're trying

41:46

to do. You can definitely experiment

41:47

that and kind of take it through the

41:48

lens of how how important is consistency

41:50

to use something like I feel like JSON

41:52

web page I feel like consistency

41:53

probably like temperature zero makes

41:55

sense but I definitely think not for

41:57

every agent every use case do you want

41:58

to use zero

42:03

any other questions get moving all right

42:06

additional metric so we kind of talked

42:07

about before that we are kind of using

42:09

some score mapping uh this part is

42:11

optional you want to use the metrics

42:13

that make sense [clears throat] to you

42:14

we're not directly using this um as like

42:17

we are kind of like using it to know

42:20

whether or not we optimize but it's not

42:21

like we're you know using this as our

42:24

sole kind of indicator for the success.

42:26

Uh here we are just going to calculate

42:30

some very basic metrics. Um it's just

42:33

you can you know choose something like

42:35

accuracy, F1 precision, recall just some

42:38

basic kind of classification metrics for

42:40

us to understand and because we are

42:42

using binary mapping scores we can do

42:43

that. Um and so that's what you're

42:46

seeing happen here. We're mapping to

42:47

binary and then just based off the score

42:48

we calculate the metric. So very simple

42:52

uh helper function here.

42:57

All right, the good stuff the

42:58

optimization loop. We made it. Um okay,

43:00

so this cell implements the core prompt

43:02

optimization algorithm. It's a

43:04

three-part process. Uh so we want to

43:06

generate and evaluate. So generate

43:07

outputs using the current pump on the

43:09

test data set and evaluate their

43:10

correctness. Uh we want to train and

43:13

optimize. If results are unsatisfactory,

43:15

generate uh outputs on the training set,

43:18

evaluate them, use the feedback to

43:19

improve the prompt, and then iterate. So

43:22

we kind of want to repeat until either

43:23

the threshold is met or all the loops

43:25

are u kind of completed. So if you

43:28

remember above, um we're kind of setting

43:31

that to just like five loops. Um and

43:34

then you know we can kind of repeat um

43:37

based off of that or the thresh met um

43:41

it's going to track metrics across all

43:42

the iterations. So turn to detailed

43:44

results including a train test accuracy

43:46

scores the optimized prompts and the raw

43:48

value. So as I kind of mentioned at the

43:50

beginning as we're running these

43:51

different loops on the experiments we're

43:53

going to be producing a lot of different

43:54

prompts. Um and so we're kind of getting

43:56

that information back that you can use.

43:59

Um and these are our key parameters.

44:01

I'll kind of go through them, you know,

44:02

as we get to the code, but just to give

44:04

you a heads up. Uh, this is the target

44:06

accuracy score to stop optimizations.

44:09

Um, it could also be whatever other

44:10

metric you'll see, we have a score so

44:12

you can kind of determine the number of

44:14

loops of the optimization iterations.

44:16

We've set that score and then the number

44:18

of rules. Again, these are some

44:20

configurations we've already set.

44:23

Um, cool. So, optimization loop. This is

44:27

um going to take in all of those um you

44:30

know parameters that I've mentioned

44:32

there. Um it just kind of kicks off

44:34

saying hey we're starting um it's going

44:37

to do the initial evaluation so we

44:39

understand uh how things are starting

44:41

off. Again you can kind of pass in data

44:44

too. You can kind of skip this initial

44:45

evaluation. We're kind of running it um

44:48

at the start here. But if you were

44:49

running production setting, you might

44:50

already have evalu.

44:55

Um, and then it's going to assess the

44:58

threshold against kind of our initial

45:00

valuation. Again, this could kind of be

45:01

skipped when we're coming from a

45:02

production setting, but wanted to kind

45:04

of start us off from scratch so that we

45:06

can get a real feel for this. Um, and

45:08

then it starts the loop. So, we're

45:10

generating output. Um, it's setting that

45:13

as the train output. So, when I printed

45:15

train, you kind of saw the outputs. I

45:16

kind of skipped ahead there. Um and then

45:18

it also will set um you know

45:20

correctness, explanation, any rule

45:22

violations. Um and then we'll actually

45:25

use our prompt learning optimizer. So

45:27

this comes with like the SDK uh the prop

45:29

learning SDK that you can use um with

45:31

the rise. Uh so we're sending in that

45:33

prompt optimization the b choice um and

45:36

then that API. So under the hood as we

45:39

talked about in the slides taking in

45:40

that feedback um taking in the original

45:44

prompt and trying to optimize to get

45:46

better results and then spinning out

45:47

prompt um and then can also add an

45:50

evaluator. So again those three um kind

45:53

of feedback columns we're looking to get

45:55

back as correctness explanation for that

45:57

if there are any rule violations and

45:59

then from there we just kind of kicked

46:02

off the optimizer and optimize with our

46:04

train set output those feedback columns

46:06

again and then you know any context size

46:08

limitations you want to add um next step

46:12

so the optimizer again is going to take

46:14

our data produce a prompt we want to

46:15

evaluate so we understand how we're

46:17

doing what this code block doing is

46:19

doing here so trying to get that new

46:21

prompting again with all those details

46:23

getting our result and then we do that

46:27

with our test set as well and then we're

46:30

getting back like our score and our

46:32

metric value and then doing the checks

46:34

and then we repeat it all again till we

46:36

either get above our threshold or we've

46:38

hit the max number of loops and then

46:41

returning our results. So that's kind of

46:43

what's going to be happening other here.

46:45

Any questions on that?

46:53

uh just some result saving function more

46:55

helper functions here. So we do want to

46:57

obviously save all these results. We

46:58

don't want them just be ephemeral that

47:00

we can't ever access again. So just

47:01

saving them all. Um you can also save

47:03

all the single experimentation so you

47:05

have all of that data towards the end.

47:06

We'll be able to kind of pull this um

47:08

and determine what the best prompt is.

47:10

But these are just very basic helper

47:12

functions. I don't spend too much time

47:13

just saving them to CSV at the end of

47:15

the day. Now we execution it. Um, so

47:19

this cell runs the prop optimization

47:21

experiment, saves the results. We're

47:22

getting the JSON format, the CSV format.

47:24

Um, it includes calls for the iteration

47:26

number, the number of rules, test,

47:28

train, accuracy scores, all the data

47:30

that we're actually going to need to

47:31

evaluate uh whether or not this thing is

47:34

successful, and then we're going to

47:35

start getting uh results here. So, um,

47:39

this does take quite a while to run. So,

47:41

we'll run and I think this will be a

47:42

great point for discussion, but as you

47:44

kind of are running it, you're going to

47:45

start seeing the different loops. um

47:48

kind of outputs coming out as well. Um

47:52

and yeah, we'll just kind of like work

47:54

through it as it it runs. It's probably

47:56

going to take like 20 30 minutes for

47:57

things to run, but um happy to take any

48:00

questions and help anybody out as they

48:02

run into issues.

48:04

>> One thing, can you scroll back to the

48:05

part of code that we needed to change?

48:12

>> Change. It's gonna be

48:22

>> So, one reminder

48:24

are running into this. I don't think I

48:25

was

48:26

>> for this line here, like when you're

48:28

doing install, you do want to be equals

48:30

2.2. Um, because I think there's a a

48:33

little bit of a package issue. Um, so

48:36

just make sure that's you're hitting

48:38

errors with the eval. If not, let me

48:40

down and try to fix it. This is the

48:43

reason why [clears throat]

48:45

>> uses like a generic evaluation.

48:48

>> Yes. And you can kind of see the

48:50

evaluation problem if you go to the

48:52

We've kind of just taken that part out

48:54

of this, but we can definitely go

48:55

through that. Um so if you look here um

48:58

on this line here, we're reading in

49:02

um under prompts here,

49:05

you can find the evaluation if you're

49:07

curious. [snorts]

49:12

And this is the reason why everyone

49:14

hates on docker. This is why we use

49:19

all.

49:20

>> Yes, absolutely.

49:23

>> The notebook.

49:27

>> So I would also recommend uh patching

49:30

your code with nestio if you haven't

49:32

already. Helps it run a lot faster. Also

49:34

for the purpose of the workshop um I

49:37

switched our loops to one. uh that took

49:39

me six minutes to run. So would

49:41

recommend also doing that instead of

49:43

having five. Obviously wouldn't

49:45

recommend doing that when you're

49:46

actually optimizing your prompt, but for

49:48

now it'll help you get through the

49:49

workshop.

49:52

>> All right, I just want to kind of call

49:53

out the the last little bit here. Um

49:58

>> the last step

50:02

before folks, let's see. Okay. Um so the

50:06

the last little bit of code here um is

50:08

just to extract the prompt that achieves

50:10

the best test accuracy. So I mentioned

50:12

how we're kind of like saving up all the

50:13

results to use. Uh we just have a

50:15

function that essentially gets the last

50:17

or the best uh version of that kind of

50:20

showing you the original and then the

50:21

best optimized version uh which you can

50:24

then use to kind of [clears throat] pull

50:25

and put into your um code. I did want to

50:27

kind of just give one kind of call out

50:29

as you kind of saw today can be a little

50:32

bit um difficult to to manage and so I

50:35

want to call out for those of you who

50:36

are kind of maybe looking for more of

50:37

like an enterprise solution to this in

50:39

Arise uh you do have these prompt

50:40

optimization tasks. Uh you can have your

50:42

prompts living in our prompt hub um data

50:45

sets with all of your human annotations

50:46

or ebal that you can either create from

50:48

traces or just by ingesting it into

50:50

Arise. Um, and then from there, all you

50:52

really need to do is like give it a task

50:54

name, choose what you want your training

50:56

data set to be, where the output lives,

50:58

where all your feedback columns are. Uh,

51:00

you can adjust all of the parameters uh

51:03

that you'd like. And then from there,

51:05

you can just like kick it off and it

51:06

will produce an optimized prompt in the

51:08

hub for you. Um, so if I go over here, I

51:10

think I have some. No, maybe not.

51:17

it will basically just create a new

51:18

version here that says it's optimized

51:19

prompts with all the results and we are

51:22

building on this so you can add all your

51:23

ebots to it have that all running in the

51:25

loop but just wanted to call out that if

51:27

you're not interested in maybe

51:28

maintaining code loops and having to

51:31

build uh like a task infrastructure

51:33

yourself it is something that we do

51:34

offer in Arise um but yeah hopefully I

51:38

know some folks are hanging out we'll be

51:39

sticking around here for a little while

51:41

as we um can help you kind of work

51:43

through issues But uh thanks so much for

51:46

joining us. Um hopefully you learned

51:48

something useful.

51:50

[music]

51:55

[music]

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video introduces "prompt learning," a method for optimizing AI agent performance. It begins by outlining common reasons for agent failures, such as weak instructions and lack of robust planning. The speakers then differentiate prompt learning from reinforcement learning and metaprompting, emphasizing its unique incorporation of detailed English feedback from both LLMs and human experts to precisely identify and address issues. A case study on coding agents demonstrates prompt learning's effectiveness, showing a 15% performance improvement simply by adding specific rules to the system prompt, without needing complex architectural changes or fine-tuning. The presentation also clarifies that "overfitting" in this context is re-framed as building