HomeVideos

Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

Now Playing

Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

Transcript

1492 segments

0:13

[music]

0:21

Hey everyone, gonna get started here.

0:23

Thanks so much for joining us today. Um,

0:25

I'm Sally. I'm the director of RISE. I'm

0:28

going to be walking you through some of

0:29

crowd prompt learning. Uh we're actually

0:31

going to be building a driven

0:33

optimization loop for the part of the

0:34

workshop. Um I come from a technical

0:36

background and started off in data

0:38

science before I made my way over to

0:40

product. Uh I do like to still be

0:42

touching code today. I think one of my

0:44

favorite projects that I work on is

0:45

building our own agent Alex into our

0:47

platform. So I'm very familiar with all

0:49

of the pain points um and how important

0:51

it is to optimize your prompt. So I'm

0:54

going to spend a little bit time on

0:55

slides. I like to like just set the

0:56

scene, make sure everybody here has

0:58

context on what we're going to be doing

0:59

and then we'll jump into the code with

1:01

me. So, I'll let you do a little bit of

1:02

an intro.

1:03

>> Yeah, thank you so much, Ellen. Great to

1:05

meet all of you. Excited to be walking

1:07

through prompt learning with you all. I

1:09

don't know if you got a chance to see a

1:10

harness talk yesterday, but hopefully

1:12

that gave you some good background on

1:15

how powerful prompting and prompt

1:16

learning can be. Uh, so my name is I'm a

1:19

product manager here at Arise as well.

1:21

And like Sally said, we like to stay in

1:23

code. We'll be doing a few slides, then

1:25

we'll walk through the code and we'll be

1:26

floating around helping you guys debug

1:28

and things like that. My background is

1:30

also technical. So, I was a backend

1:32

distributed systems engineer for a long

1:34

time. So, no stranger to how important

1:36

observability infrastructure really is.

1:38

Um, and I think it's an appropriate

1:40

setting in AWS for that. So, yeah,

1:42

excited to dive deep into front loading

1:44

with you all. Thank you.

1:46

>> Awesome. All right, so we're gonna get

1:47

started. Just give you a little bit of

1:48

an agenda of the things I'm going to be

1:50

covering. Uh, so we're gonna talk about

1:51

why agents fail today. what is evening

1:53

prom learning? I want to go through a

1:55

case study kind of show youall why this

1:57

actually works. Uh and we'll talk about

1:59

learning versus GA. I think everybody I

2:01

had a few people come up to me over the

2:02

conference about like what about GEA? Uh

2:04

we have some benchmarking against that

2:05

and then we'll hop into our workshop. Um

2:08

but with this I want to ask a question.

2:09

How many people here are building agents

2:11

today?

2:12

>> Okay, that's what I expected. Um and how

2:14

many people actually feel like the

2:15

agents they're building are reliable?

2:19

>> Yeah, that's what I also thought. So

2:21

let's talk a little bit about why agents

2:22

fail today. So why do they fail? Well,

2:25

there's a few things that we're seeing

2:26

with a lot of our folks and we're seeing

2:27

even internally as we build with Alex

2:29

for why agents are b breaking. So um I

2:33

think that a lot of times it's not

2:34

because the models are weak. It's a lot

2:36

of times the environment um and the

2:38

instructions are weak. So uh having no

2:41

instructions um from their learned

2:43

environment uh no planning or very

2:45

static planning. I feel like a lot of

2:47

agents right now don't have planning. We

2:49

do have some good examples of planning

2:50

like we have cloud code cursor. Those

2:52

are really great examples but I'm not

2:53

seeing it make its way into every agent

2:56

that I come across. Uh missing tools big

2:58

one. Sometimes you just don't have the

2:59

tool sets that you need. Uh and then

3:01

missing kind of tool guidance on like

3:03

which of the tools we should be picking

3:04

and then context engineering continues

3:06

to be a big struggle for folks. If I

3:10

were to distill this out, I think it's

3:12

like these three core issues. So

3:14

adaptability and selfarning. Um so no

3:17

system instructions learned from the

3:19

environment touched on determinism

3:20

versus non-determinism balance. So

3:23

having the planning um or no planning

3:25

versus doing like a very static

3:26

planning. You want to kind of have some

3:28

flexibility there. And then context

3:30

engineering I think is a term that just

3:32

kind of emerged in the last like you

3:34

know six to eight months but it's

3:35

something that's really really important

3:36

that we're finding you know missing

3:38

tools tool guidance just not having

3:40

context or confirming your data and not

3:42

giving the LM enough context. So these

3:46

are um kind of the core issues to still.

3:48

But I think there's one other pretty

3:50

important thing. Um and that is kind of

3:52

this distribution of who's responsible

3:54

for what. So um there's these technical

3:56

users, your AI engineers, your data

3:58

scientists, developers, and they're

3:59

really responsible for the code

4:00

automation pipelines actually, you know,

4:03

managing the performance and costs. But

4:05

then we have our domain experts, subject

4:07

matter experts, AI product managers.

4:09

These are the ones that actually knew

4:10

what the user experience would be. they

4:11

probably are super familiar with um the

4:14

principles that we're actually building

4:16

to our AI applications. They're tracking

4:18

our evals and they're really trying to

4:20

ensure that the product success. So

4:21

there's this split between

4:22

responsibilities but everybody is

4:24

contributing but then there's this

4:26

difference um in terms of like maybe

4:29

technical abilities. And so with prompt

4:30

learning it's going to be a combination

4:32

of all these things. So everybody's

4:33

going to really need to be involved and

4:35

we can talk about that uh a little bit

4:37

more. So [clears throat] what even is

4:39

prompting? I'm going to first kind of go

4:41

through some of the um approaches that

4:45

we kind of borrowed when we came up with

4:46

prompt learning. So this is something

4:47

that Arise has been really really uh

4:49

dedicated to doing some research. And so

4:51

one of the first things we borrow from

4:53

uh which is reinforcement learning. How

4:55

many folks here are familiar with how

4:56

reinforcement learning works? All right,

4:58

cool. Um so if I were to give like a

5:00

really like silly kind of analogy, we

5:02

have a reinforcement model. Uh pretend

5:04

it's like a a student brain that we're

5:06

trying to kind of, you know, boost up.

5:08

And so they're going to take an action

5:10

uh which might be something like you're

5:11

just going to take a test an exam and

5:13

there's going to be a score. A teacher

5:14

is going to come through and actually

5:15

you know score the exam here um that's

5:17

going to produce this kind of like

5:19

scaler reward um and you know pretend

5:22

the student has an algorithm in their

5:23

brain that can just kind of take those

5:24

scores and update the weights in their

5:26

brain and kind of like the learning

5:28

behavior there and then we kind of

5:29

reprocess. So you know in this kind of

5:31

reinforcement one we're updating weights

5:33

based off of some scalers. Um, but it's

5:36

really actually difficult to update the

5:37

weights directly, especially in like the

5:40

LLM world. So, reinforcement learning

5:42

isn't going to quite work that well uh

5:44

when we're we're doing things like

5:46

prompting. So, then there's

5:48

metaprompting, which is very close to

5:50

what we do with uh prompt learning, but

5:53

still not quite right. So, here with

5:56

metal prompting, we're asking LM to

5:58

improve the prompt. Uh, so again, we use

6:00

that kind of like student example. We

6:02

have an agent which is our student. Um,

6:04

and it's going to produce some kind of

6:05

output like that's a user asking a

6:07

question getting an output. That's our

6:08

test in this example. And then we're

6:10

going to score. Eval is pretty much what

6:12

you can think of there. Uh, where it's

6:14

going to output a score and from there

6:17

we have like the metapromp thing. So now

6:18

the teacher is kind of like the metapar

6:20

prompt. It's going to take the result uh

6:22

from our scorer and update the prompts

6:25

based off of that. Um, but it's still

6:28

not quite what we want to do. And that's

6:31

where we kind of introduce this idea of

6:32

prompt learning. So prompt learning is

6:34

going to take the the exam going to

6:36

produce an output. Um we're going to

6:38

have our enlumm evals on there. But

6:40

there's also this really important piece

6:42

which is the English feedback. So which

6:44

answers were wrong? Why were the answers

6:46

wrong? Where the student needs to

6:47

actually study? Really pinpointing those

6:49

issues. And then we still aren't using

6:51

metapro. We still are asking an LLM uh

6:54

to improve the prompt. It's just the

6:55

information that we are giving that LLM

6:58

uh is quite different. And so we're

7:00

going to update uh the prompt there with

7:02

all of this kind of feedback. So from

7:05

our evals from a subject matter expert

7:07

going in and labeling and use that uh to

7:10

kind of boost our prompt with better

7:11

instructions and sometimes exams.

7:15

So this is kind of like the traditional

7:17

prompt optimization where it's like we

7:19

have we're kind of treating it like an

7:20

ML where we have our data and we have

7:23

the prompt. We're saying optimize this

7:24

prompt and maximize our like prediction

7:26

impulse. Um but that doesn't quite work

7:29

uh for Allens were missing a lot of

7:31

context. So what we really found um is

7:34

that the human instructions of why it

7:37

failed. So imagine you have your

7:38

application data, your traces, a data

7:40

set, whatever it is. Your subject matter

7:42

expert goes in and they're not only

7:44

annotating correct or incorrect. They're

7:46

saying this is why this is wrong. It

7:48

failed to adhere to this key

7:49

instruction. It didn't adhere to the

7:51

context. It's missing out whatever it

7:53

is. Um, and then you also have your ego

7:55

explanations from Ellen as a judge,

7:56

which is same kind of principle where

7:58

instead of just the label, it provides

7:59

the reasoning behind the label. And then

8:02

we're pointing it at the exact

8:03

instructions um to change. We're

8:05

changing the system prompt to help it

8:07

improve so that we then get, you know,

8:09

prediction labels, but we also get those

8:11

evals um and explanations of it. So,

8:14

we're just kind of optimizing more than

8:16

just um our outlet here. And I think a

8:19

really key learning that we've had is

8:21

the explanations in human instructions

8:23

or through your own as a judge. That

8:25

text is really really valuable. I think

8:27

that's what we see not being utilized in

8:29

a lot of other broad optimization

8:31

approaches. Um they're either kind of

8:33

optimizing for a score uh or they're

8:35

just paying attention to the output. But

8:38

you can think of it this way. It's like

8:39

these elements are operating in the text

8:40

domain. So we have all this rich text

8:42

that tells us exactly what it needs to

8:43

do to improve. why wouldn't we use that

8:45

to actually improve our so um that's

8:50

kind of the basics of prompt learning

8:53

but everybody always comes up to me and

8:54

like sounds great s but does it actually

8:57

work um it does and we have some

8:59

examples of when we do this so we did a

9:01

little bit of a case study um I think

9:03

coding agents everybody is pretty much

9:05

using them at this point there's a quite

9:07

a few that have been really really

9:08

successful I think cloud code is a great

9:10

example cursor but there's also client

9:12

uh which is more of a um an open version

9:15

of this and so we decided to take a look

9:19

and compare to see if we could you know

9:21

do anything to improve. So these are

9:22

kind of the the baseline of where we

9:24

started here. Um you can see the

9:26

difference between the different models.

9:28

U obviously using two and throttle kind

9:30

of the state-of-the-art there but we

9:32

also had this opportunity where CL was

9:34

using you know 45 and it was working

9:37

decently well at 30% versus 40. Um and

9:40

then there was kind of the conversation

9:42

around. So this is where we started um

9:45

and we took a pass optimizing the system

9:48

prompt here. So you can see this is what

9:50

the old one was looking like. It has

9:52

like no rules section. So it was just

9:54

very like you are a cloud agent. You're

9:56

built on this model. You're you're here

9:58

to do coding. Um but there was no rules

10:01

and so we took a pass at updating the

10:04

system. So there were all of these

10:05

different uh rules associated. So when

10:07

dealing with errors or exceptions,

10:08

handle them in a specific way. make sure

10:10

that the changes align with, you know,

10:12

the systems design. Um any changes to be

10:15

accompanied by appropriate test. So

10:17

really just kind of building in like the

10:18

rules that like a good engineer would

10:20

have uh which was completely missing

10:22

before. Um and so we found that plan

10:25

performs better with updated system

10:27

problem. Pretty kind of simple. It's

10:28

kind of the whole concept here. It's

10:30

like you can see these different

10:31

problems and we're seeing you know

10:33

things that were incorrect now being

10:34

correctly done just by simply adding

10:36

more instructions. [clears throat]

10:38

So it really demonstrates pretty well

10:40

here um how those system prompts can

10:43

improve and we benchmarked again with a

10:46

s bench light to get another just like

10:48

kind of coding uh benchmark for these

10:50

coding agents and we were able to

10:52

improve by 15% just through the addition

10:56

of rules. Uh so I think that that's

10:57

pretty powerful. So no fine-tuning, no

10:59

tool changes, no architecture changes. I

11:01

think those are the big things folks

11:02

like reach for when they're trying to

11:04

improve their agents. Uh but sometimes

11:07

it's just about your system prompt and

11:08

just adding rules. I think we've really

11:09

seen that and that's why we're really

11:10

passionate about prompt learning and

11:12

prompt optimization in general is it

11:14

feels like the lowest lift way to get

11:16

massive improvement gains in your agent.

11:18

Uh 4.1 achieved performance near 4.5

11:21

which is pretty much considered right

11:22

now state-of-the-art when it comes to

11:24

coding questions and it's twothirds of

11:26

the cost which is always uh really

11:28

beneficial. So uh these are some of kind

11:31

of the tables here. will definitely

11:32

distribute this so you can kind of take

11:33

a closer look. But I think the main

11:35

point I want y'all to come away with is

11:37

the fact that like, you know, 15% is

11:39

pretty, you know, powerful uh

11:41

improvement in our performance.

11:44

Now, a question we get all the time is

11:46

we're taking these examples of perform

11:48

learning. So, how this is really

11:49

important is we're going to take a data

11:50

set. A lot of time that data set is

11:52

going to be a set of examples that

11:54

didn't perform well. either a human went

11:56

through and uh labeled them and found

11:58

that they you know were incorrect or you

12:00

have your emails that are labeling them

12:02

incorrect and so you've gathered all

12:04

these examples and that's what we're

12:06

going to use to optimize our prompt. So

12:07

I get a question all the time like well

12:09

aren't we going to overfitit uh based

12:11

off of these bad examples but there's

12:14

this rule of generalization where

12:16

mending properly enforces high level

12:18

reusable coding coding standards rather

12:19

than repo specific fixes and we are

12:22

doing this train test split uh to ensure

12:24

that the rules are generalized beyond

12:26

just like local quirks and whatever our

12:28

uh training data set is. But if you kind

12:31

of think of this as like you hire an

12:33

engineer, right, to to be an engineer at

12:35

your company, you do kind of want them

12:36

to overfit to the database that they're

12:38

working on. So, uh we kind of feel that

12:40

overfitting is maybe a better term for

12:43

it is expertise. Uh we are again not

12:45

kind of training in the traditional

12:47

world. We are trying to build expertise

12:49

and as we'll talk about this is not

12:50

something we feel that you do once.

12:52

You're actually going to kind of

12:53

continuously be running this. So, um

12:55

more problems are going to come up.

12:56

we're going to kind of optimize our

12:58

prompt for what the application is

13:00

seeing now. Um, and then we'll kind of

13:04

So, we don't actually think it's a flaw.

13:06

We feel like it's expertise instead. Um,

13:08

we can kind of adapt as needed and kind

13:11

of mirroring what humans would do if

13:12

they were taking on a task themselves.

13:16

Um this is just another set of

13:18

benchmarking again kind of proving here

13:21

um that this diverse evaluation suite

13:24

that focuses on the task for those

13:26

difficult or tasks that are difficult

13:28

for relish language models um and we're

13:30

seeing again success with our

13:32

improvements.

13:34

Now Ga just kind of came out recently

13:36

and I think that's something everybody's

13:37

really excited about. I think the

13:39

previous uh DSPI optimizers were a

13:42

little bit more focused on optimizing a

13:43

metric and as we talked about like we

13:45

really want to be using uh the text

13:47

modality that these applications are

13:48

working in um that have a lot of the the

13:52

reasons or how we need to improve and so

13:54

we definitely wanted to do some

13:56

benchmarking here. So how many people

13:57

are familiar with Gered about it? All

14:00

right, cool. Well, I'll just give like

14:01

sort of high level. I just kind of noted

14:03

that the main difference between their

14:05

other like new pro optimizers is that

14:07

they are actually um using this positive

14:12

reflection and evaluation while they are

14:14

are doing the optimization. So it's this

14:16

evolutionary optimization um where

14:19

there's this parentto-based candidate

14:20

selection and probabilistic merging of

14:22

prompts. What this really does under the

14:23

hood is we take candidate cross uh we

14:26

evaluate them. Then there's this

14:27

reflection LM that's reviewing the

14:29

evaluations and then kind of making some

14:31

mutations some changes um and kind of

14:34

repeating until it feels like it has the

14:36

right set of prompts. So I think

14:38

something that is important to notice

14:39

about GABA is it doesn't really choose

14:40

kind of just one. It does try to keep

14:42

the top candidates um and then you know

14:44

do the merging from there. But we

14:48

benchmarked it and proper learning

14:50

actually does do a little bit of a

14:51

better job. And I think something that's

14:53

really key is it does it in a lower

14:56

number of loops. And I think something

14:58

that we'll we'll talk about in just a

15:00

second here is that it does actually

15:02

matter what your emails look like and

15:04

how reliable those are. I think that's

15:05

something we really feel strongly about

15:07

at Arise is uh you definitely want to be

15:09

optimizing your agent prompts, but I

15:11

think a lot of people forget about the

15:12

fact that you should also be optimizing

15:14

your email prompts because if you're

15:15

using emails as a signal, um you can't

15:18

really rely on them if you don't feel

15:19

confident in them. So, it's just as

15:20

important to invest there, making sure

15:22

you're kind of applying the same

15:23

principles that you are to your agent

15:25

prompt as your email prompts so you have

15:26

a really reliable signal that you can

15:28

trust and then feed that into your

15:30

prompt optimization. But, um in both of

15:32

these graphs, the pink line is prompt

15:34

learning. Uh we did also benchmark it

15:36

against me pro their older optimization

15:38

technique that I was mentioning kind of

15:39

functions off like um optimizing around

15:42

score

15:44

and eval make the difference. So it kind

15:46

of I I highlighted on this slide here

15:48

like the with eval engineering we were

15:50

able to do this. So we did have to make

15:52

sure that the eval part of prompt

15:54

learning uh were really high quality

15:57

because again it's this only works um if

16:00

the eval itself is working.

16:03

So, yep, emails make all the difference.

16:05

Kind of spend some time optimizing a

16:06

prompt here. Um, again, it's all about

16:09

making sure you have proper instruction.

16:10

The same kind of rules apply.

16:13

So, I want to kind of walk through. I

16:15

know there's a lot of content. I think

16:17

it's really important to have context.

16:19

But before we jump into any of the

16:20

workshops, any questions I could answer

16:22

about what I discussed so far?

16:27

>> Uh, I have a question comment. So I I

16:30

think you know coding is the greatest

16:31

example in terms of having the structure

16:33

and evals. Uh one thing I'm sort of

16:35

curious about is if you have other

16:36

examples sort of general prompts

16:37

forational

16:39

interactions with systems that are not

16:41

as easily quantifiable. I'm just curious

16:43

about any experience you guys have

16:44

there.

16:44

>> Yeah. Is that for like eval

16:47

general?

16:47

>> Well I think it's just clear how you

16:49

would set up what the eval would look

16:50

like and I'm just wondering how you

16:51

would do that for other types of so the

16:54

question is like is there any kind of

16:56

instruction for how you should set up

16:57

your evals? coding seems like a very

16:58

straightforward example. You kind of

17:00

want to make sure the code's correct,

17:01

right? But where some of these other

17:02

agent tasks um it's a little bit harder.

17:05

I think the advice that I usually give

17:06

folks is we do have a set of like out of

17:08

box. You can always start with things

17:09

like QA correctness or focus on the

17:11

task. But what I always suggest is like

17:13

getting all the stakeholders kind of in

17:15

the room. So getting those you know

17:17

subject matter experts and security you

17:19

know leadership and really defining what

17:21

success would look like and then start

17:23

kind of converting that to different

17:25

evaluations. So um I think an example is

17:27

Sterling and Alex. Um I have some task

17:30

level evaluation. So like I really care

17:32

did it find the right data uh that it

17:34

should have. Um should it did it create

17:37

a filter using semantic search or

17:39

structured like making the right tool

17:40

call? Um and then I care did it call

17:42

things in the right order? Was the plan

17:44

correct? So kind of thinking about like

17:45

what each step was and then like even

17:47

security will be like well we care how

17:48

often people are trying to jailbreak

17:50

Alex. So, it's just taking each of those

17:52

success criteria, converting it to eval.

17:54

Um, and we do have different tools that

17:56

can help you, but that's usually the

17:57

framework I give folks is like start

17:59

with just success and then worry about

18:00

converting into an email after.

18:02

>> Yeah. Just to add to that, maybe like

18:05

more of like a subjective use case is

18:08

like for example like Booking.com is one

18:10

of our clients and so when they do like

18:13

what is a good posting for a property

18:16

like what is a good picture?

18:19

[clears throat] Defining that is really

18:20

hard, right? Like to you, you might

18:23

think something is a very attractive

18:24

posting for like a hotel or something,

18:26

right? But to someone else, it might

18:28

look really different. And sometimes, as

18:30

kind of Sil was alluding to, it's

18:32

sufficient to just gate it as a good,

18:34

bad, and then kind of iterate from

18:36

there. So like, is this a good picture

18:37

or bad picture? Let decide and then gate

18:40

from there into specific background

18:41

like, oh, this was dimly lit, the layout

18:44

of the room was different, etc., etc.

18:46

Yeah.

18:46

>> Yeah. That's that you're actually

18:48

building on the question I was going to

18:49

ask which is that they end up with that

18:51

binary outcome which doesn't necessarily

18:53

give you a gradient to advance upon are

18:55

you then effectively using those

18:56

questions like digitally lit not to like

18:58

get like a more continuous space is that

19:01

>> exactly right and then from there as you

19:03

get more signal you can refine your

19:05

evaluator further and further and then

19:06

use those states and you can actually

19:08

put a lot of that in your prompting

19:10

itself right so yeah

19:13

>> I have two questions and I'm not sure if

19:15

I should ask both of them or maybe your

19:16

workshop will answer it. One is about

19:19

rules and the rule section or like

19:21

operating procedures. I'm curious how

19:24

you uh do you just continuously refine

19:28

that in the English language and uh

19:31

maybe reduce the friction of any

19:33

contradictory rules. That's the first

19:35

question. And then the other was I would

19:36

love to see the slide on eval. if you

19:39

could just say a little bit more on how

19:40

you approach that because my issue

19:42

[clears throat] in doing this work is um

19:46

whether or not to have like an a

19:47

simulator of the product and then the

19:49

simulator is evaluating or to do what

19:52

I'd like to do which is like an end

19:54

toend evaluation that I build but I

19:57

would love to see you talk about that if

19:59

you could.

19:59

>> Yeah, absolutely. So from the first one

20:01

about like how the instructions it's

20:03

definitely something I think that like

20:05

you iterate over time on them. So a lot

20:07

of times I think we take our best bet

20:09

like we write them by hand, right? And I

20:10

think what we're trying to do with

20:11

proper optimization is like leverage the

20:13

data uh to dynamically change them. Uh

20:16

and is I think great at like removing

20:18

redundant instructions, things like

20:19

that. But the goal is is we want to move

20:21

away from static instructions. We feel

20:23

very confidently that like that is not

20:25

going to really scale. It's not going to

20:26

lead to like sustainable um performance.

20:30

So the idea exactly with pump learning

20:31

is something that you can kind of run

20:33

over time. We see this even like a long

20:34

running task eventually uh where you're

20:37

building up examples of incorrect things

20:40

uh maybe having a human annotate them

20:41

and then the task is kind of always

20:43

running producing optimized prompts that

20:45

you can then pull in production and it

20:46

it kind of is like a cycle that repeats

20:48

over time.

20:49

>> Sorry just to intervene. So, are you

20:50

saying that when you're doing this over

20:52

a long period of time and then you have

20:54

examples, you're just running the shots

20:56

back into your rules section?

20:58

>> Kind of. It's going to pass it like when

21:00

we get to the optimization actual like

21:02

loop we're going to build, you'll kind

21:03

of see it as like you are feeding the

21:05

data in that's going to build a new set

21:07

of instructions that you would then, you

21:09

know, push to production to use.

21:11

>> Okay.

21:12

>> Uh I think your second question was

21:14

around evals and like how to where to

21:16

start, how to like write them and like

21:18

how to optimize those. Is that right?

21:19

>> Yes.

21:20

>> Yeah. So, it's a very similar approach.

21:22

I think it's like the data that you're

21:23

reviewing is almost a little bit

21:25

different. So, uh I should have pulled

21:27

up the the loops. I don't know if you

21:28

can find it.

21:33

Let me just try something really quick

21:35

to kind of show this.

21:40

There we go.

21:42

So, this is kind of like how we we see

21:44

it is you have two co-evolving loops.

21:47

I've been talking about the one on the

21:48

left, the blue one a lot about we're

21:50

improving agent, we're collecting

21:51

failures, kind of setting that to do

21:53

kind of fine-tuning or prompt learning,

21:55

but you basically want to do the same

21:57

thing with your evals where uh we're

21:59

collecting the data set of failures, but

22:01

instead of thinking about the failures

22:02

being the output of your agent, we're

22:04

actually talking about the eval output.

22:06

So having somebody go through and you

22:08

know evaluate the evaluators or using

22:11

things like log props as confidence

22:12

scores or jury as a judge to determine

22:15

where things are not confident. We're

22:16

kind of doing the same thing. So

22:17

figuring out where your eval is low

22:19

confidence and then you're collecting

22:21

that annotating maybe having somebody go

22:23

through and say okay this is where the

22:24

eval went wrong. And so it's the same

22:25

pretty much process of optimizing your

22:28

eval prompt. It's just you know I think

22:30

folks think they can just grab something

22:32

off the shelf or write something once

22:34

and then they can just forget about it.

22:35

But this loop, I've said it a few times,

22:37

but the the left loop only works as well

22:39

as your eval.

22:41

>> Sorry, I think my question is actually

22:42

way more static and basic. It's like do

22:44

you are you talking about this orange

22:46

circle as like are you building a system

22:49

or simulator for the eval or are you

22:51

just talking about like system prompt,

22:53

user prompt, eval?

22:54

>> Yeah, I think it's more right now what

22:55

we're talking about is just like kind of

22:56

the different prompts. You could

22:58

definitely do simulation, but I think

22:59

that's a whole different workshop.

23:00

>> Thank you. Any [clears throat]

23:03

more maybe questions before we get to

23:05

the bridge club? Any

23:08

switch back?

23:13

All right. Um, so here is going to be a

23:17

QR code uh for our prompt learning repo.

23:20

Um, so I'll give everyone a few minutes

23:22

to get such with that. Get it on your

23:25

laptops. I know it's a little bit clunky

23:27

to add this QR and like airdrop it. was

23:30

not sure a better way. Um I can just

23:33

show you also here if you want to find

23:35

it. Um it is going to be in our Rise AI

23:39

uh repo here and under prompt learning

23:42

and you just want to kind of clone that.

23:43

We are going to kind of be running it uh

23:45

locally here.

23:48

>> You go back to the page with the URL.

23:50

>> Yes. Sorry about that.

23:55

Oh

23:55

>> no, the page with the URL. Oh,

24:08

>> we'll give folks just a few minutes to

24:09

get

24:10

>> What do you What's your process when

24:12

you're building a new agent or work for

24:16

anything that could be evaluated? Do you

24:18

guys start by just like, oh, try

24:22

something prototype and then see where

24:23

it's bad and then do eval?

24:25

[clears throat]

24:26

>> Yeah, I think there's different

24:27

perspectives on this. Our perspective is

24:29

EOS should never block you. Like you

24:30

need to get started and you need to just

24:32

build something really scrappy. We don't

24:34

think like you should, you know, waste

24:35

time doing eval. I think it's helpful to

24:37

pull something out of the box sometimes

24:39

in those situations just because it's

24:41

hard to comb through your data. like

24:42

that's something we've experienced with

24:43

Alex of like when you're getting started

24:45

just running a test manually reviewing

24:47

like it it's kind of painful. Um so I

24:50

think that having eval is helpful but

24:52

shouldn't be a blocker. Pull something

24:53

off the shelf maybe start with that then

24:55

as you're iterating you're understanding

24:56

where your issues are then you're

24:57

starting to refine your evals as you're

24:59

refining your agent.

25:01

>> Yeah.

25:03

One last question. Yeah.

25:06

>> So it makes sense to like optimize the

25:07

system

25:10

like sub aents or commands or how are

25:13

you thinking about this like multi-

25:15

aent?

25:16

>> Yeah. So the question is is like are you

25:18

just doing one single prompt or how do

25:20

you think about this in a multi- aent? I

25:22

think we're kind of thinking that this

25:23

right now is kind of independent tasks

25:25

that can optimize your prompts kind of

25:27

independently and then running tests um

25:29

to get into like the agent simulation of

25:30

running them all together. But right

25:32

now, our approach is a little bit

25:33

isolated, but I definitely see a future

25:35

where we're going to kind of meet the

25:37

the standard of like sub agents and

25:39

everything else that's going on right

25:41

now.

25:43

>> No, I think that's pretty accurate. And

25:44

also like I mean even in a single agent

25:47

use case versus like a multi- aent use

25:50

case like ultimately like each of those

25:51

agents may be specialized. They may have

25:53

their own prompts that they need to

25:55

learn from. So I think doing this in

25:58

isolation still has benefits for the

26:00

multi- aent system as a whole that can

26:02

pass on over time in scenarios like hand

26:04

off etc and making something like really

26:07

really specialized. So I guess like what

26:09

we're talking about with like the

26:11

overfitting as well which is again like

26:13

question we get all the time but really

26:15

you want to be over fit on your code

26:17

base as an engineer. Um you don't want

26:19

to be so generalized that you're no

26:21

longer good at picking up specific works

26:23

in your code base.

26:25

Yeah.

26:28

>> All right. Everybody kind of getting to

26:30

read the mode. Okay. Anybody need any

26:32

help?

26:36

>> All right. So, we are going to be using

26:38

OpenAI for this. So, I think the next

26:40

thing that I'll have everyone do is

26:41

probably spend some time just grabbing

26:43

your API key. We'll get to it and then

26:44

I'll just kind of start walking through

26:47

our notebook here.

26:51

So, we are going to be doing a JSON

26:52

webpage prompt example. So, you're going

26:54

to find that under notebooks here. Um,

26:57

and so we'll give everybody a second to

26:59

pull it out. There's going to be just

27:00

some slight adjustments we're going to

27:01

add to this example uh just to make it

27:04

run a little faster and work a little

27:05

better. The first is um what this is

27:09

even doing this is going to be a very

27:10

simple example uh for just a JSON web

27:14

page prompts. If anybody has like a

27:15

prompt or use case that they want to

27:17

kind of like code along, Van and I are

27:19

absolutely help like glad to help kind

27:21

of adapt what you're working on to the

27:23

use case here. It's something very

27:24

simple just to kind of demonstrate um

27:27

the the principles and we are going to

27:29

be using we can definitely experiment.

27:32

If you want to swap out any other

27:33

providers that you want to use, we can

27:34

also definitely help you do that.

27:38

Um but the the goal of this is

27:39

essentially going to be to iterate

27:41

through different versions of a prompt

27:43

using a data set. Um

27:46

and we will optimize. So the first thing

27:48

is obviously we need to do some

27:49

installs. Um I am just going to have you

27:51

all update it. It says like greater than

27:54

2.00.

27:55

Uh but we're going to actually just use

27:57

I think 22 today.

28:02

And then the next thing is just to make

28:03

this run a little faster. So we're going

28:05

to run things in async which is missing.

28:08

So you can go ahead and add these lines

28:11

in the cell as well. All right, everyone

28:14

kind of follow along and I never know

28:16

want to move too fast. Seems to head

28:18

nuts. Cool. Let's talk about

28:19

configuration. So um I kind of talked

28:21

about it a little bit when I was going

28:23

through the slides. So we are going to

28:24

be doing some looping. So the general

28:26

idea is is we start out with the data

28:28

set uh with some feedback in it and

28:30

we'll we'll look through the data set

28:31

once we get it. Um, but you're going to

28:33

want to have either human evaluation.

28:36

Um, so like annotations, either free

28:38

text, labels, um, or you're going to

28:40

want to have some evaluation data. But

28:41

the feedback is really important. That's

28:43

what makes this kind of work. Um, we're

28:45

going to then, you know, pass that to

28:48

Allen to do the optimization and then

28:50

it's going to basically have eval. So as

28:53

it's optimizing, it's using that kind of

28:55

data set to then run and assess whether

28:57

or not it should, you know, kind of keep

28:59

optimizing. Um, and then it also

29:02

provides you data that you can kind of

29:03

like use to gauge which of the prompts

29:05

that it outputs um, in you know a

29:09

production setting. So we're going to do

29:11

some configuration. Um, so I've kind of

29:13

wrote out here kind of what each of

29:14

these means. So we have the number of

29:16

samples. So this controls how many rows

29:18

of the sample data set. Um, you can, you

29:21

know, set to zero to use all data or you

29:23

can, you know, use a positive number to

29:25

limit for, you know, faster

29:26

experimentation. So I think that

29:28

sometimes folks use different um

29:30

approaches here. Sometimes you want to

29:32

just move really quick so you set a low

29:34

sample. Sometimes you want to be a

29:35

little bit more representative so you up

29:37

it. Um I have it here set as 100. Feel

29:40

free to adjust. Um and then the next

29:42

thing is train split. Um so I think

29:45

folks are probably pretty familiar with

29:47

the concept here of like a train test

29:49

split, but it's just how much of the

29:50

data do we want to use into our

29:52

training? Again, that's what we're using

29:53

to actually optimize. Then how much of

29:55

it do we want to use when we're testing

29:57

when we're running the eval um on the

29:59

new prompt? Um

30:02

and there's number of rules. Uh

30:04

basically the specific number of rules

30:06

to use for evaluation. This just

30:09

determines which prompts to use. Um and

30:12

so this is like as we're running these

30:13

loops, we're outputting, you know, a

30:15

bunch of different prompts. So this is

30:17

just saying how many um we should use

30:19

for evaluation. And then key one here,

30:22

number of optimization loops. So this

30:24

sets how many optimization iterations to

30:26

run per experiment. Um and each loop

30:30

basically generates those outputs,

30:31

evaluates them and refineses the prompt.

30:34

And so these just control the experiment

30:35

scope the data splitting um just went

30:38

through the whole prompt learning loop

30:39

and and how much data we want to use. So

30:42

you can kind of just run these as you

30:45

are or if you want to adjust them feel

30:47

free. Uh and then the next step pretty

30:49

simple. We're just going to uh grab that

30:51

open AI key if you haven't already uh

30:54

set that up. So, get passage is going to

30:56

like pop up. Um I'll show you here

30:58

quick. It's going to pop up there. You

30:59

can just paste in your API key there

31:03

before we start looking at the the data

31:05

a little bit.

31:12

Just if anybody runs into any issues,

31:14

you just give this away. All right. I

31:16

think this particular

31:18

we get through this

31:26

>> I'm doing good but if you have a free

31:27

one you want to give me

31:30

that

31:31

>> I wish

31:34

>> all right let's talk about the data so

31:36

we provided data with you with queries

31:39

um you can see here that we're doing the

31:40

8020 split based off of kind of

31:43

configuration we set above I'm just

31:45

going to pull this um train set here and

31:48

let's just

31:49

>> Yeah, I run because in the minus

31:54

50

31:55

>> Oh, yep. You're right. That's a mistake

31:57

on my part.

32:01

Yeah, it is the 50. Um

32:04

let's take a look at what this data set

32:08

looks like. No. Uh just so folks can

32:10

kind of understand. Um so kind of

32:13

starting here with some just basic input

32:15

and output. Um

32:19

transcept we don't have any of the the

32:21

feedback in these rows that I printed

32:23

out here but you can imagine you can

32:24

have different uh correctness labels

32:26

here explanations any real validation

32:28

data can be whatever it is that um you'd

32:30

like it to be. Some folks use multiple

32:32

eval

32:34

feedback sometimes it's a combination

32:36

but you really want to have you know the

32:38

input and output that will use that way.

32:40

Should my output of train set be the

32:42

same as you?

32:44

>> Not necessarily. Depends on

32:46

>> I didn't know if head was sort or not.

32:50

>> It all depends on kind of what the the

32:52

same but we could look at like you know

32:53

if I did this this should be the same

32:55

for you maybe just to make sure.

33:04

>> Yeah.

33:06

>> That's what you're saying. Okay. Yeah.

33:09

Quick question. [clears throat]

33:10

Um, is it possible for the input to be

33:13

like a chat history and not just

33:17

>> Great question. So, I think it depends

33:19

on like what it is you're trying to do.

33:21

If you're doing just like a simple kind

33:22

of uh system of the input, you kind of

33:24

want it to be one to one. You don't want

33:26

to give it a ton of um like conversation

33:28

data that's not relevant to the prompt

33:30

that you're optimizing. um we we

33:32

generally just use like the single input

33:34

but I think that there are applications

33:35

that you could do like conversation

33:37

level um inputs.

33:39

>> Yeah. Because because quite often the

33:41

failure is somewhere middleation

33:45

right. So so if you put just the

33:47

original task in uh then the probability

33:51

of you hitting you know a failure in the

33:53

middle of the

33:56

>> totally. So in that case, what we

33:57

generally see is like different rows of

33:58

like having each of like the back and

34:00

forths be like kind of independent rows

34:02

because you're probably going to

34:03

evaluate each of them and um honestly

34:07

probably like get the human feedback on

34:08

each of them. So we usually separate

34:10

them out in that way.

34:12

>> But it's a good point. If you just

34:13

always are focusing on the first turn,

34:15

there's probably a lot of redundancy

34:17

there. uh you definitely will have to

34:19

like say over parts of the conversation

34:21

>> and and how we can biferate like

34:24

instructions and we have some context

34:26

also. So

34:27

>> should not touch the context. It should

34:29

only uh whatever the manipulate the

34:32

system instruction or the prompting

34:35

context it should be the static it

34:37

should not be like

34:39

based on the answer it will change my

34:41

context.

34:42

>> Yes. What you're saying is like looking

34:43

at the input there might be like a tool

34:44

volume context you're kind of passing

34:46

that in. You can absolutely include that

34:47

in your data set um so that the

34:49

application kind of understands what

34:51

other or not the application

34:52

[clears throat] but the prompt learning

34:53

um LM can understand all of the data

34:57

that's kind of like available. So you

34:58

can just have that passed in as extra

35:00

column if you want. Most people start

35:02

with just kind of input and the

35:04

feedback. Um but you can absolutely add

35:06

what other data you think is relevant

35:10

and if when for the rerunning when we're

35:12

doing the experiment of testing you'll

35:14

definitely always want to have the data

35:15

that would be required to answer

35:17

>> any even very simple some call some call

35:21

or some context it is pulling some API

35:24

call

35:26

whatever the prompt engineering it

35:28

should be based on the out

35:31

getting the output right and whatever

35:33

the context front my plus whatever the

35:37

tool call I have done API call all the

35:40

uh contact engineering and then last

35:43

finalize

35:44

>> totally yeah so again at this point

35:47

we're we're testing just like one prompt

35:48

and not that kind of end to end but you

35:50

definitely want to have everything that

35:52

like is flowing into the prompt that

35:53

you're optimizing so uh if your system

35:55

prompt takes in the user input for

35:56

example some data from an external API

35:59

you would definitely want to provide all

36:01

of that data does that make sense

36:04

Because because you're saying that like

36:06

the the like trajectories

36:08

[clears throat] the like tool calls and

36:09

what the agent's going to do depending

36:11

on what the tool call was is what you're

36:13

trying to proper to.

36:14

>> Yeah, exactly. We want to just like

36:15

because we're kind of trying to replay

36:16

and optimize one step of it. We

36:18

definitely don't want to do it

36:19

completely in isolation. So if there's

36:21

like data that flows into that prompt um

36:23

that's context that's using that's

36:25

producing the output, right? So we want

36:26

to be sure that we're including that. We

36:28

don't want to exclude anything. But if

36:29

it's data that comes like at a different

36:32

step probably not then you don't want to

36:34

do that that way. It's just like think

36:36

about what's relevant for the the step

36:37

that we're trying to optimize in this.

36:44

All right. Any other questions coming?

36:48

All right. Cool. So we're going to set

36:49

up our initial system prompt. You can

36:51

see this is something very very basic.

36:53

Uh we'll definitely I think we can do a

36:54

whole lot better than this, but I just

36:56

kind of want to illustrate something uh

36:58

that we're going to test and optimize.

37:01

So we're just saying you are an expert

37:02

in JSON web page creation. Your task is

37:04

input. And then so all these inputs that

37:06

we're seeing are going to be what we're

37:08

actually generating outputs for and

37:10

trying to optimize. Now I already kind

37:12

of touched on this. Um evaluators are

37:15

extremely important to make all of this

37:18

work, right? Um so we're going to uh

37:20

initialize two evaluators that use

37:23

elements as a judge to assess the

37:24

quality of generated outputs. So we are

37:26

using elements a judge. If you have any

37:28

other like codebased evaluations,

37:29

whatever you need to do to evaluate, you

37:32

can definitely swap those out. Uh but

37:34

we're going to do evaluate output. This

37:36

is going to be a comprehensive evaluator

37:37

that assesses the JSON webpage

37:39

correctness against the input query and

37:41

the evaluation rules. It's going to

37:43

provide an output label of correct or

37:45

incorrect. So pretty simple binary.

37:47

Again, you can use multilel. And then

37:50

it's going to have the detailed

37:51

explanations as well. Um, and then we

37:54

have a rule checker. This is a more

37:55

specialized evaluator that performs a

37:57

granular rule by rule analysis. Um, and

38:01

it examines if each rule um was

38:04

compliant.

38:06

And then both of these are going to

38:07

generate feedback that goes into our

38:08

optimization loop uh to iteratively

38:11

improve the system prompt. Um,

38:13

explanation role violations guide. Um

38:15

and we'll get to this the prompt

38:16

learning optimizer and creating the more

38:19

effective prompts. So I have some

38:21

imports here. Let's take a look at what

38:22

the actual eval output has. Um so we do

38:26

have some rules that are in um in here

38:33

wait

38:36

um they're going to be in a repo. Um so

38:38

we're going to open that as a file. We

38:40

have this llm provider and we're using

38:41

open AI here. And then we're going to do

38:44

our classification evaluator. So, uh

38:46

we're just calling it uh evaluate

38:48

output. It's al we have an evaluation

38:51

template that we're reading from the

38:54

bottom here. Um then we just have

38:56

choices correct and correct. Now we're

38:57

mapping a label to a score. Sometimes

38:59

it's helpful to be able to like add or

39:00

score. Sometimes a number is easier than

39:02

just looking at a bunch of labels. Uh it

39:04

is optional you want to map these if you

39:06

have like a multiclass use case. You can

39:08

set the scores u accordingly. But these

39:10

are just going to be our choices like

39:11

the rails that we want our elements as a

39:13

judge to adhere to. And then all we're

39:16

doing here is getting our results. I

39:18

have it doing some printing so you can

39:20

kind of take a look. So this is going to

39:21

be slightly different than what you're

39:22

seeing in the notebook. So I'm just

39:23

going to pause here. Uh if you want to

39:25

make the code changes from what you're

39:27

seeing in probably your version, this is

39:29

a a good time for that.

39:32

Does kind of the setup of the evaluator

39:34

make sense to all kind of the key. It's

39:37

going to be the rails. It's going to be

39:38

the output. Uh and of course our

39:40

template. [clears throat]

39:44

>> Yeah, you will want to grab your own uh

39:46

OpenAI key here uh to set

39:52

[clears throat]

39:52

>> and we can help you if you want to use

39:53

different provider. We can help you swap

39:55

this out like that is helpful to

39:57

anybody.

39:59

>> Okay, I'm going to start walking you

40:01

through the output generation. So uh

40:05

this is just kind of you know you can

40:06

imagine this as your own agent logic or

40:08

the the part that you're kind of

40:10

testing. Uh this is just going to

40:12

function that actually generates the

40:13

JSON u outputs. We're using for one here

40:16

with JSON response format zero

40:19

temperature for consistent outputs. Um

40:21

it's taking a data set a system prompt

40:23

generates outputs for all rows returns

40:25

the results for evaluation. Um and it's

40:28

called during each iteration to produce

40:30

output. So this is like our

40:31

experimentation function that we're

40:32

writing. So as we're passing in data,

40:34

it's producing new uh prompts. We need a

40:37

way to test it, evaluate, understand uh

40:39

how we are kind of moving the needle

40:41

here. So that's all this is. So it's

40:43

pretty straightforward function just

40:44

called generate output. We have that

40:46

output model. Again, we're using OpenAI.

40:48

If anybody wants help switching things

40:50

around, happy to help. Uh we are using

40:52

response format because we are dealing

40:54

with JSON here. So uh we know that what

40:56

you just prompted. I mean some of the

40:57

the newer models are decent at it, but

41:00

using response format is really helpful.

41:02

And then we're also setting temperature

41:03

to zero. Um, and here is just kind of

41:06

where we're passing all the data in. So

41:08

the data set because again we want to

41:09

run this on all of the the testing data,

41:13

the system prompt that will be input. So

41:15

as we get to the optimization loop,

41:17

we're going to be passing in a new

41:18

prompt to this with the data set and

41:20

then evaluating. Um, we have our output

41:23

model that we've already passed,

41:24

concurrency, all that good stuff. And

41:25

it's just returning all of the outputs

41:28

there. Would you for the uh the current

41:32

generation of models since this one's

41:33

basically like in in AI terms ancient uh

41:36

would you like still recommend setting

41:38

the temperature to zero or would you

41:40

actually want to try to encourage some

41:41

of the creativity to like

41:43

>> I think it depends on the use case a

41:45

little bit and what you're you're trying

41:46

to do. You can definitely experiment

41:47

that and kind of take it through the

41:48

lens of how how important is consistency

41:50

to use something like I feel like JSON

41:52

web page I feel like consistency

41:53

probably like temperature zero makes

41:55

sense but I definitely think not for

41:57

every agent every use case do you want

41:58

to use zero

42:03

any other questions get moving all right

42:06

additional metric so we kind of talked

42:07

about before that we are kind of using

42:09

some score mapping uh this part is

42:11

optional you want to use the metrics

42:13

that make sense [clears throat] to you

42:14

we're not directly using this um as like

42:17

we are kind of like using it to know

42:20

whether or not we optimize but it's not

42:21

like we're you know using this as our

42:24

sole kind of indicator for the success.

42:26

Uh here we are just going to calculate

42:30

some very basic metrics. Um it's just

42:33

you can you know choose something like

42:35

accuracy, F1 precision, recall just some

42:38

basic kind of classification metrics for

42:40

us to understand and because we are

42:42

using binary mapping scores we can do

42:43

that. Um and so that's what you're

42:46

seeing happen here. We're mapping to

42:47

binary and then just based off the score

42:48

we calculate the metric. So very simple

42:52

uh helper function here.

42:57

All right, the good stuff the

42:58

optimization loop. We made it. Um okay,

43:00

so this cell implements the core prompt

43:02

optimization algorithm. It's a

43:04

three-part process. Uh so we want to

43:06

generate and evaluate. So generate

43:07

outputs using the current pump on the

43:09

test data set and evaluate their

43:10

correctness. Uh we want to train and

43:13

optimize. If results are unsatisfactory,

43:15

generate uh outputs on the training set,

43:18

evaluate them, use the feedback to

43:19

improve the prompt, and then iterate. So

43:22

we kind of want to repeat until either

43:23

the threshold is met or all the loops

43:25

are u kind of completed. So if you

43:28

remember above, um we're kind of setting

43:31

that to just like five loops. Um and

43:34

then you know we can kind of repeat um

43:37

based off of that or the thresh met um

43:41

it's going to track metrics across all

43:42

the iterations. So turn to detailed

43:44

results including a train test accuracy

43:46

scores the optimized prompts and the raw

43:48

value. So as I kind of mentioned at the

43:50

beginning as we're running these

43:51

different loops on the experiments we're

43:53

going to be producing a lot of different

43:54

prompts. Um and so we're kind of getting

43:56

that information back that you can use.

43:59

Um and these are our key parameters.

44:01

I'll kind of go through them, you know,

44:02

as we get to the code, but just to give

44:04

you a heads up. Uh, this is the target

44:06

accuracy score to stop optimizations.

44:09

Um, it could also be whatever other

44:10

metric you'll see, we have a score so

44:12

you can kind of determine the number of

44:14

loops of the optimization iterations.

44:16

We've set that score and then the number

44:18

of rules. Again, these are some

44:20

configurations we've already set.

44:23

Um, cool. So, optimization loop. This is

44:27

um going to take in all of those um you

44:30

know parameters that I've mentioned

44:32

there. Um it just kind of kicks off

44:34

saying hey we're starting um it's going

44:37

to do the initial evaluation so we

44:39

understand uh how things are starting

44:41

off. Again you can kind of pass in data

44:44

too. You can kind of skip this initial

44:45

evaluation. We're kind of running it um

44:48

at the start here. But if you were

44:49

running production setting, you might

44:50

already have evalu.

44:55

Um, and then it's going to assess the

44:58

threshold against kind of our initial

45:00

valuation. Again, this could kind of be

45:01

skipped when we're coming from a

45:02

production setting, but wanted to kind

45:04

of start us off from scratch so that we

45:06

can get a real feel for this. Um, and

45:08

then it starts the loop. So, we're

45:10

generating output. Um, it's setting that

45:13

as the train output. So, when I printed

45:15

train, you kind of saw the outputs. I

45:16

kind of skipped ahead there. Um and then

45:18

it also will set um you know

45:20

correctness, explanation, any rule

45:22

violations. Um and then we'll actually

45:25

use our prompt learning optimizer. So

45:27

this comes with like the SDK uh the prop

45:29

learning SDK that you can use um with

45:31

the rise. Uh so we're sending in that

45:33

prompt optimization the b choice um and

45:36

then that API. So under the hood as we

45:39

talked about in the slides taking in

45:40

that feedback um taking in the original

45:44

prompt and trying to optimize to get

45:46

better results and then spinning out

45:47

prompt um and then can also add an

45:50

evaluator. So again those three um kind

45:53

of feedback columns we're looking to get

45:55

back as correctness explanation for that

45:57

if there are any rule violations and

45:59

then from there we just kind of kicked

46:02

off the optimizer and optimize with our

46:04

train set output those feedback columns

46:06

again and then you know any context size

46:08

limitations you want to add um next step

46:12

so the optimizer again is going to take

46:14

our data produce a prompt we want to

46:15

evaluate so we understand how we're

46:17

doing what this code block doing is

46:19

doing here so trying to get that new

46:21

prompting again with all those details

46:23

getting our result and then we do that

46:27

with our test set as well and then we're

46:30

getting back like our score and our

46:32

metric value and then doing the checks

46:34

and then we repeat it all again till we

46:36

either get above our threshold or we've

46:38

hit the max number of loops and then

46:41

returning our results. So that's kind of

46:43

what's going to be happening other here.

46:45

Any questions on that?

46:53

uh just some result saving function more

46:55

helper functions here. So we do want to

46:57

obviously save all these results. We

46:58

don't want them just be ephemeral that

47:00

we can't ever access again. So just

47:01

saving them all. Um you can also save

47:03

all the single experimentation so you

47:05

have all of that data towards the end.

47:06

We'll be able to kind of pull this um

47:08

and determine what the best prompt is.

47:10

But these are just very basic helper

47:12

functions. I don't spend too much time

47:13

just saving them to CSV at the end of

47:15

the day. Now we execution it. Um, so

47:19

this cell runs the prop optimization

47:21

experiment, saves the results. We're

47:22

getting the JSON format, the CSV format.

47:24

Um, it includes calls for the iteration

47:26

number, the number of rules, test,

47:28

train, accuracy scores, all the data

47:30

that we're actually going to need to

47:31

evaluate uh whether or not this thing is

47:34

successful, and then we're going to

47:35

start getting uh results here. So, um,

47:39

this does take quite a while to run. So,

47:41

we'll run and I think this will be a

47:42

great point for discussion, but as you

47:44

kind of are running it, you're going to

47:45

start seeing the different loops. um

47:48

kind of outputs coming out as well. Um

47:52

and yeah, we'll just kind of like work

47:54

through it as it it runs. It's probably

47:56

going to take like 20 30 minutes for

47:57

things to run, but um happy to take any

48:00

questions and help anybody out as they

48:02

run into issues.

48:04

>> One thing, can you scroll back to the

48:05

part of code that we needed to change?

48:12

>> Change. It's gonna be

48:22

>> So, one reminder

48:24

are running into this. I don't think I

48:25

was

48:26

>> for this line here, like when you're

48:28

doing install, you do want to be equals

48:30

2.2. Um, because I think there's a a

48:33

little bit of a package issue. Um, so

48:36

just make sure that's you're hitting

48:38

errors with the eval. If not, let me

48:40

down and try to fix it. This is the

48:43

reason why [clears throat]

48:45

>> uses like a generic evaluation.

48:48

>> Yes. And you can kind of see the

48:50

evaluation problem if you go to the

48:52

We've kind of just taken that part out

48:54

of this, but we can definitely go

48:55

through that. Um so if you look here um

48:58

on this line here, we're reading in

49:02

um under prompts here,

49:05

you can find the evaluation if you're

49:07

curious. [snorts]

49:12

And this is the reason why everyone

49:14

hates on docker. This is why we use

49:19

all.

49:20

>> Yes, absolutely.

49:23

>> The notebook.

49:27

>> So I would also recommend uh patching

49:30

your code with nestio if you haven't

49:32

already. Helps it run a lot faster. Also

49:34

for the purpose of the workshop um I

49:37

switched our loops to one. uh that took

49:39

me six minutes to run. So would

49:41

recommend also doing that instead of

49:43

having five. Obviously wouldn't

49:45

recommend doing that when you're

49:46

actually optimizing your prompt, but for

49:48

now it'll help you get through the

49:49

workshop.

49:52

>> All right, I just want to kind of call

49:53

out the the last little bit here. Um

49:58

>> the last step

50:02

before folks, let's see. Okay. Um so the

50:06

the last little bit of code here um is

50:08

just to extract the prompt that achieves

50:10

the best test accuracy. So I mentioned

50:12

how we're kind of like saving up all the

50:13

results to use. Uh we just have a

50:15

function that essentially gets the last

50:17

or the best uh version of that kind of

50:20

showing you the original and then the

50:21

best optimized version uh which you can

50:24

then use to kind of [clears throat] pull

50:25

and put into your um code. I did want to

50:27

kind of just give one kind of call out

50:29

as you kind of saw today can be a little

50:32

bit um difficult to to manage and so I

50:35

want to call out for those of you who

50:36

are kind of maybe looking for more of

50:37

like an enterprise solution to this in

50:39

Arise uh you do have these prompt

50:40

optimization tasks. Uh you can have your

50:42

prompts living in our prompt hub um data

50:45

sets with all of your human annotations

50:46

or ebal that you can either create from

50:48

traces or just by ingesting it into

50:50

Arise. Um, and then from there, all you

50:52

really need to do is like give it a task

50:54

name, choose what you want your training

50:56

data set to be, where the output lives,

50:58

where all your feedback columns are. Uh,

51:00

you can adjust all of the parameters uh

51:03

that you'd like. And then from there,

51:05

you can just like kick it off and it

51:06

will produce an optimized prompt in the

51:08

hub for you. Um, so if I go over here, I

51:10

think I have some. No, maybe not.

51:17

it will basically just create a new

51:18

version here that says it's optimized

51:19

prompts with all the results and we are

51:22

building on this so you can add all your

51:23

ebots to it have that all running in the

51:25

loop but just wanted to call out that if

51:27

you're not interested in maybe

51:28

maintaining code loops and having to

51:31

build uh like a task infrastructure

51:33

yourself it is something that we do

51:34

offer in Arise um but yeah hopefully I

51:38

know some folks are hanging out we'll be

51:39

sticking around here for a little while

51:41

as we um can help you kind of work

51:43

through issues But uh thanks so much for

51:46

joining us. Um hopefully you learned

51:48

something useful.

51:50

[music]

51:55

[music]

Interactive Summary

The video introduces "prompt learning," a method for optimizing AI agent performance. It begins by outlining common reasons for agent failures, such as weak instructions and lack of robust planning. The speakers then differentiate prompt learning from reinforcement learning and metaprompting, emphasizing its unique incorporation of detailed English feedback from both LLMs and human experts to precisely identify and address issues. A case study on coding agents demonstrates prompt learning's effectiveness, showing a 15% performance improvement simply by adding specific rules to the system prompt, without needing complex architectural changes or fine-tuning. The presentation also clarifies that "overfitting" in this context is re-framed as building

Suggested questions

7 ready-made prompts