Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize
2350 segments
All right. Uh, nice to see everyone
here. Um, my name is Aman. I'm an AI
product manager at a company called
Arise. Title of the talk is shipping AI
that works, an evaluation framework for
PMs. Uh, it's really going to be a
continuation of some of the content
we've been doing with, you know, some of
the the PM folks like Lenny's podcast. I
guess just quick show of hands. How many
people listen to Lenny's podcast or have
read read the newsletter? Awesome. Okay,
we're going to do a couple more like
audience interaction things just to like
wake up the room a bit. So, how many
people in the room are PMs or aspiring
PMs?
Okay, good. Good handful of people. How
many of you consider yourself AI product
managers today? Okay, awesome. Wow, that
there's more AI PMs than there were
regular PMs. That's interesting. Um,
usually that's it's a subset, but maybe
I need to start asking the questions in
a different order. Um, cool. Well,
that's great. Uh, so what we're going to
be doing is, you know, um, I'll go ahead
and just do a little bit of an intro
about myself and then we'll kind of
cover some of the the frameworks that I
think are really powerful for AIPMs to
kind of get to know as you're building
AI applications. So, a little bit about
me. Um, I you know, myself, I have a
technical background. So I actually
started my career in engineering uh on
actually working on self-driving cars at
Cruz. Um and actually while I was there
I ended up becoming a PM for evaluation
systems for self-driving back in like
2018 2019. Um after that I went to
Spotify to work on the machine learning
platform and work on recommener systems.
So things like discover weekly and
search things like using embeddings to
actually make the end product experience
better. And fast forward to today, I've
been at Arise uh for about three and a
half years, and I'm still working on
evaluation systems instead of
self-driving cars. It's sort of
self-writing code agents. Uh and Spotify
is actually one of our customers. So, we
get to work with some awesome uh you
know, ex actually, fun fact, I've
actually sold Arise to all of my
previous managers. So, um so fun fact
there. Uh but we got to work with some
awesome companies like Uber, Instacart,
Reddit, Dolingo. So a lot of really tech
forward companies that are building
around AI. Uh and we actually started in
the sort of traditional ML space of
ranking, regression, classification type
models and have now expanded into Gen AI
and agent-based applications as well. Uh
what we do is make sure that those
companies, our customers when they're
building AI applications that when those
agents and applications actually work as
expected. And it's actually a pretty
hard problem. A lot of that has to do
with uh terms that we're going to go
into like observability and eval. But I
think more broadly the space is just
changing so fast and the models, the
tools, the infrastructure layer changing
so fast that for us it really is a way
for us to learn about the cutting edge
like what are the leading challenges
with use cases that people are building
and try to build that into a platform
and product that benefits everybody.
Um, so what we'll cover, we're going to
cover what are eval and why they matter.
We'll actually build an AI trip planner
uh with actually a multi- aent system.
This part is ambitious bullet number
two. I'm going to be honest here. Uh we
were trying to push up the code right
before so it may or may not work, but
we'll give it a shot and that'll be the
interactive part of the workshop and
then we'll actually try to evaluate that
AI trip planner prototype that we're
going to build ourselves.
Uh actually another quick show of hands
for the room. How many people have heard
of the term eval before? Okay, I guess
it was in the title of the talk, so
that's kind of redundant. How many
people have actually written an eval
before or tried to run an eval? Okay, a
good number of people. Um, that's
awesome. Well, what we're going to do is
actually take try and take that a little
bit of a step further. Go from writing
an eval for an LLM as a judge system.
And if you've never written an eval,
don't worry. We're going to cover that,
too. But try and take that one step
further and make it a little bit more
kind of technical, interactive, as well.
Okay. So, who is this session for? Uh, I
like this diagram because um, you know,
Lenny and I have been kind of working
together a little bit more on content
for educational content mostly for AI
product managers. And I kind of put this
up. I made like a little whiteboard
diagram for him. And I'm like, I think
this is really how I view the space,
which is like there's this there's this,
you know, you may have seen this diagram
for like the Dunning Krueger effect. And
that's kind of what came to mind here,
which is as you're kind of moving along
the curve, maybe you're just getting
started, you know, with how do I use AI?
How does AI fit into my job? I think we
were all here to be honest a couple of
years ago, like, you know, just to be
completely honest, I think for people in
the room, especially PMs, I think we all
feel that the expectations of the
product management role are changing.
That's why this concept of an AIPM is
sort of emerging. the expectations from
our stakeholders, from our executives,
from our customers. It feel I feel I
don't know about if other people feel
this, but I definitely feel like the bar
has been raised in terms of what's
expected to be delivered, right?
Especially if I'm working with an AI
engineer on the other end, their
expectations of what I come to them with
in terms of requirements, in terms of
specifying what the agent system needs
to look like, it's changed. It's a step
function different even than for me,
even as someone who was like a technical
PM before. And so I kind of felt myself
go along this journey which is ironic
given that I work at an eval company.
You think I was like on the end of the
curve but really I kind of went through
this journey you know same as most of
you which is trying to use AI in my job
trying AI tools to prototype and come
back with something that's you know a
little bit higher resolution for my
engineering team than like a Google doc
of requirements. Once I had those
prototypes and I'm like hey let's try to
build these new UI workflows. The
challenge then became how do I get a
product into production especially if my
product has AI in it has an LLM or an
agent and that's where I think you know
that's really where that like confidence
slump sort of hits and you kind of
realize there's a lack of tooling
there's a lack of education for how to
build these systems reliably and why
does that matter at the end of the day
right and the really important takeaway
from the fact that LLMs hallucinate we
all know that they do is you should
really look at the top two quotes here
and think, okay, well, we've got Kevin
who's chief product officer at OpenAI.
We have Mike at Anthropic CPO. This is
probably like 95% of the LLM market
share. And both of the product leaders
of those companies are telling you that
their models hallucinate and that it's
really important to write eval. This
these quotes actually came from a talk
that they were both giving at Lenny's
conference uh you know earlier like in
November of last year. And so when the
people that are selling you the product
are telling you that it's not reliable
you should probably listen to them. Uh
on top of that I mean like you have Greg
Brockman similarly founder of that
company. Um you have Gary who's you know
eval are emerging as a real moat for AI
startup. So I I think this is sort of
one of those pivotal moments where you
realize, hey, people are starting to say
this for a reason. Why are they saying
that? Well, they're saying that because
a lot of the same lessons from the
self-driving space and um you know kind
of apply in this this AI space. Okay,
another audience question. How many
people have taken a Whimo? I kind of
expect that one to be pretty high. Okay,
we're in San Francisco. If you're
visiting from out of town, take a Whimo.
It is a real world example of AI. It's
it's it's an example of AI in the real
physical world. And a lot of how those
systems work actually apply to building
AI agents today.
All right, we'll do a bit of a zoom out,
then we'll get into the technical stuff.
I see laptops out, so we'll definitely
get into, you know, writing some code
and trying to get hands-on. But just to
do a bit of a recap for folks, um what
is an eval? Uh I I kind of view this as
like it's very analogous to software
testing, but with some really key
differences. Those key differences are
software is deterministic. You know, 1
plus 1 equals 2. LLM agents are
nondeterministic. If you convince an
agent 1 plus 1 equals 3, it'll say like
you're absolutely right. 1 plus 1 equals
3. Right? So, like we've all been there.
We've kind of seen that these systems
are highly manipulatable. And on top of
that, if you build an LLM agent uh that
can take multiple paths, that's very
that's pretty different from a unit
test, which is deterministic. So think
about um the fact that you know a lot of
people might are trying to like
eliminate hallucinations from their
agent systems. The thing is you actually
kind of want your agent to hallucinate
just in the right way and that can
actually make testing it a lot more
challenging as well especially when
reliability is is super important. And
then last but not least I think
integration tests rely on existing
codebase and documentation. A really key
differentiation of agents is that they
rely on your data. Uh if you're building
an agent into your enterprise, the
reason that someone is going to use your
agent versus something else is likely it
might be because of the agent
architecture, but a big part of it will
also be because of your data that you're
building the agent on top of. And that
applies for the eval as well.
Okay. What is an eval? So, uh I view
this into like four parts that go into
an eval. kind of just an easy like
muscle memory thing. Um these brackets
are a little bit out of line, but um the
the idea is that you're setting the
role. You're basically telling the
agent, here's the task that you want to
accomplish. You're providing some
context, which is what you see in the
curly braces here. And it's that's
essentially like it's really just text
at the end of the day. It's some text
you want the agent to evaluate. You're
giving the agent a goal. In this case,
the agent is trying to determine whether
text is toxic or not toxic. This is a
kind of a classic example because
there's a large toxicity data set of
classifying text that we use um to build
our eval on top of. But just kind of
note that can be any type of goal in
your business case. It doesn't have to
be toxicity. It'll be some goal that
you've created this agent to evaluate.
And then you provide the terminology and
the label. So you're giving some
examples of what is good and bad and
you're giving it the output of either
select good or bad. In this case, it's
toxic not toxic. I'm going to pause on
that last note because it's really I
think there's a lot of misconceptions
sort of like I'll try and weave in some
like FAQs as I hear them come up but um
we'll definitely have some time at the
end for questions and I'd love for this
to be interactive so I'll probably make
the Q&A session a little bit longer here
for people that have these questions but
one common question we get is why can't
I just tell the agent to give me a score
or an LLM to produce a score and the
reason for that is because even today
even though we have like PhD level LLMs,
they're still really bad at numbers. Um,
and so what you want to do is ground,
and it's actually a function of like
what a token is, what how a token is
represented for an LLM. And so what you
want to do is actually give a text label
that you can map to a score if you
really need to use a score in your
systems, which we do in our system as
well. We'll map a label to a score. But
that's that's like a very common
question we get is, "Oh, why can't I
just make it do like one is good and
five is bad or something like you're
going to get really unreliable results."
And we actually have some research um
happy to share it out afterwards that
kind of proves that out um on a large
scale with most models.
Okay, so that's a little bit of like
what is an eval. Um I should note that
uh this is a previous slide. I should
note that this is uh LLM as a judge
eval. Uh there's other types of
evaluations as well like code-based eval
which is just using code to evaluate
some text uh and human annotations.
We'll touch on that a little bit more
later but the bulk of this time is going
to be spent on LLM as a judge because
it's really like the kind of scalable
way to run eval production these days
and we'll talk about why too later on.
Okay, a lot of talking. So uh evaluating
with vibe. So this was it's kind of
funny because I think like everyone
knows this term like vibe coding like
everyone has tried to use like bolt or
lovable whatever and I don't know about
you but this is how I usually feel when
I'm vibe coding which is like kind of
looks good to me like you know you're
looking at the code but like let's be
honest how much AI generated code are
you going to read you're like let me
just ship this thing the problem is you
can't really do that in a production
environment right like I think all the
vibe coding examples are like
prototyping or like trying to build
something h like hacky or fast so I want
to help everyone reframe a little bit
and say like yes vibe coding is great.
It has its place. But what if we go from
evaluating with Vibes to Thrive Coding?
And thrive coding in my mind is really
using data to basically do the same
thing as vibe coding, like still build
your application, but you'll be able to
use data to be more confident in the
output. And you can see that this person
is a lot happier. Um, so this is using
Google's image models. They're scary
good, guys. Like, uh, yeah.
Okay. So, we're going to be thrive
coding. So, slides. Um there's uh if you
want access to the slides, the slides
have links to what we're going to go
through in the workshop. Um
ai.engineer.slack.com
and then I just created the Slack
channel workshop AIPM. And I think I
dropped the slides in there, but let me
know if I didn't.
>> Cool. Thank you. All right, live demo
time. So, at this point on, uh I'm I'll
just be honest. uh there's a a decent
likelihood that the repo is has
something broken in it because we were
pushing changes up until like this very
moment. If so and you can unblock
yourself, I think there's like a
requirements thing that's broken. Please
go for it. And if not, we can come back
at the end and try to help you get
unblocked. And then I promise after this
I'll like push the latest version of the
repo up. So if it doesn't work for you
right now, check back in an hour. I'll
drop it in Slack. It'll be working
later. Um but yeah, just a function of
like moving fast. Uh so on the lefth
hand side is instructions which are
really it's like a you know sort of a a
substack post I made which is just a
free sort of like list of you know some
of the steps we're going to go through
live. So it's just more of a resource
and then on the right hand side is a
GitHub repo which I'm going to open
here.
There's actually two repos and I'll kind
of talk through like a little bit about
what we're evaluating and some of the
project on top of that and then we'll
get into uh the weeds here a little bit.
Okay, so this is the the repo. Um
we I built this like over the weekend,
so you know it's not it's not super
sophisticated, although it says it's
sophisticated, which is funny. But um
this is Oh, pardon.
>> Can you put that?
>> Oh, this is not Okay. So, is this not
attached to the QR? Okay, I'll just drop
this link in here as well. Let's just uh
put it in here. Okay, awesome. Oh, thank
you. Thanks. Okay. Um so, and if you
have questions, by the way, uh in the
middle of the presentation, just feel
free to drop them in Slack. Um, and then
we can always come back to them and then
we'll have time at the end for um, so
feel free to like keep the Slack channel
going um, for questions. Maybe people
can try to unblock each other as well.
And if someone fixes my requirements,
feel free to open a poll request and
I'll approve it live. Um, so um, okay.
So what we're doing is uh, let's put on
let's take off our like PM hat of
whatever company we're at. We're going
to put on an AI triplaner hat. The the
idea here is like don't worry about the
sophistication of this UI and the agent.
It's really like kind of a prototype
example, but it is helpful for us to
kind of take a look at building an
application on the fly and try to
understand how it works underneath the
hood. So the example we're going to use
is actually I'll kind of back up a
little bit. I basically took this uh
collab notebook that I have um for
tracing crew AI and I'm like I kind of
want an example with Langraph. Crew AI
probably if you haven't heard of it it's
like an agent multi- aent framework. Um
the agents basically an agent definition
is you know using an LLM and a tool
combined to perform some action. And
what I did was I gave this notebook and
I basically put it into cursor and I was
like give me an example of a UI uh based
workflow but using lane graph instead.
And what we're going to do is think of
instead of building a chatbot, we're
going to take this form and we're going
to use the inputs of this form to build
a quick agent system that we're then
going to be using for evaluation. So
this is what I got on the other end. Um,
which is plan your perfect trip. Let our
AI agents help you discover amazing
destinations.
So let's pick a destination. Maybe we
want to do Tokyo for seven days. And
assuming the internet works, um, we'll
see if it does. We're going to put a
budget of $1,000. I'll zoom in a little
bit. And then I'm interested in food.
And let's make this adventurous. So I
could go and take all of this and try to
just put it into chat GPT. But you can
kind of imagine underneath the hood the
reason that we might want this as a form
or with multiple inputs and uh an
agent-based system is because we could
be doing things like retrieval or rag or
tool calling underneath the hood. So,
let's just kind of picture that the
system is going to use these inputs to
give me on the other side an itinerary
for my trip. And uh okay, it worked.
Okay, this one worked. So, um so here
we've got a quick itinerary. Um nothing
super fancy. It's basically just here's
what I gave as an input form and then
what the agent is kind of doing
underneath the hood is giving me an
itinerary for what my morning,
afternoon, etc. look like for a week in
Tokyo using the the budget I gave it. Uh
this doesn't seem super fancy because
it's like I could take this and just put
it into chat GPT, but there is some
nuance here, which is the budget. Like
if you add this up, like it's going to
be doing math to do accounting to get to
$1,000. So, it's really keeping that
into consideration. You can see it's a
pretty frugal budget here. Um it can
take interest here. So, I could say, you
know, different interests like I want to
go, I don't know, sake tasting or
something, and it'll find a way to work
that into your itinerary.
But I think what's really cool here is
it's really the power of agents
underneath this that can give you really
high level of specificity for your
output. Um, so that's really what we're
trying to show is like this is, you
know, it's not just one agent, it's
actually multiple agents giving you this
itinerary. Uh, so I could just stop
here, right? Like it could be like this
is this is good enough. I have some code
for most people. If you're vibe coding,
you're like great, this thing does what
I want it to do, right? Like it gave me
an itinerary. But what's going on
underneath the hood? Um and this is kind
of where uh so I'm going to be using our
tool called Arise. We also have an open-
source tool called Phoenix. I'm just
going to plug that here right now for
folks as reference. But this is an open
source version of Arise. It is not going
to have all of the same features as
Arise, but it will have a lot of the
same setup flows and workflows around
it. So, you know, just note that Arise
is really built for, you know, if you
want scale, security, support, um, and
sort of the the sort of futuristic
workflows in here. So, I've got a trip
planner agent, and what I just did, if
it worked, let's see if it did.
And we're gonna This is This is live
coding, so like very possible
something's broken. Um,
okay. I think I think I broke my my
latest trace, but you can see what the
example here looks like from one right
before. So, what that system really
looks like is basically this. Um, so
let's let's open up one of these
examples. What you'll see here are
traces. Traces are really input, output,
and metadata around the request that we
just made. And I'm going to open up one
of those traces just as an example here.
And what you'll see is essentially a set
of actions that the agents that in this
case multiple agents have taken to
perform, you know, generating that
itinerary. And what's kind of cool is we
actually just shipped this today. Um,
uh, it's actually, you know, you guys
are the first one seeing it. uh which is
pretty cool. Um this is actually a
representation of your agent in code. Um
so you know literally the cursor app
that I just had up here is basically my
agentbased system that cursor helped me
write and when I sent it our docs I I
literally all I did was I gave it a link
to our docs in cursor and I said you
know write the instrumentation to get
this agent and and this is this is how
that's represented. And so we have this
new agent visualization in the platform
that basically shows the starting point
with multiple agents underneath it to
accomplish uh the task we just had. So
we have a budget, local experiences and
research agent that then go into an
itinerary agent and that gives you the
the end result or the output and you can
you can see that up here too. So we have
research, itineraries, budget and local
information to generate the itinerary.
So this is this is pretty cool, right?
Like for I think for a lot of people
it's not im you know oursel included it
is not immediately obvious that these
agents can be super well represented in
this sort of like visual way right uh
especially when you're writing code you
think these are just function calls
talking to each other but what's really
useful is to see at an aggregate level
what are the calls that the agent is
making and you can see it's a really
clean delineation of parallel calls for
the budget agent the local experience
experiences agent and the research agent
and all of those get fed fed in to an
itinerary agent that summarizes all of
the above. You can also see that up
here. Um so these are what's called uh
traces and they consist of uh like
technically what's called spans. A span
you can think of this as like a unit of
work basically. So there's a time
component to it which is like how long
that process took to to finish and then
like what is the type of the process.
Here you can see there's three types.
There's an agent. There's a tool which
is uh basically being able to use data
to perform an action structured data.
And then there's the LLM which generates
the output of the the taking the input
and the context. So this is an example
of three agents actually three agents
being fed into a fourth agent to
generate the itinerary. That's really
what we're seeing here. Um let's go one
level deeper. So this is cool and I
think it's useful for uh you know to see
like what these systems look like, how
they're represented to zoom out for a
second as a product manager. There's a
ton of leverage in being able to go back
to your team and ask, hey, what does our
agent actually look like, right? Like do
you have a visualization to show me of
like what the system actually looks
like? And then if you're giving the
agent multiple inputs, where are those
outputs going? are those outputs going
into, you know, a different agent
system, like what are the what does the
system actually look like? So, that's
kind of one sort of key takeaway here as
a PM. Um, it was personally very helpful
to see, you know, what our agents
actually doing um underneath the hood.
Uh, kind of going one level deeper here.
So, we've got this itinerary uh and it
let's take a look at it really quick.
So, it says Marrakesh, Morocco is a
vibrant exotic destination. Blah blah
blah. It's it's really long, right? Like
I don't know if I would actually look at
this and read it. It doesn't it's not
really like doesn't like jump out to me
as like being like a good product
experience. It feels super AI generated
personally. Um so what you want to do is
actually think okay well is there a way
for me to iterate on my product as a
product person. And to do that what we
can do is actually take that same prompt
that we just traced and pull it into a
prompt playground with all of the
variables that we've defined in code
pulled over. So, I've got a prompt
template here which basically has the
same um prompt variables that we've
defined in the UI like the destination,
the duration, the travel style. And all
of those inputs get fed in here. You can
see down below in this prompt
playground,
what that looks like. And then you see
the [clears throat] outputs of some of
the agents in here as well. And then I
have the final itinerary from the the
agent that's generating the itinerary.
Okay. So why does this matter? I think a
lot of companies have this concept of um
prompt playgrounds. I think like OpenAI
as a prompt playground. You've probably
heard that term before as well or maybe
even you've used one. But I I urge you
to think about when you're thinking
about a tool to help you with
development. Not only is the
visualization important of what your
stack it looks like underneath the hood,
but being able to take your data and
your prompts together and be able to
iterate on your data and prompts in one
interface is really powerful because I
can go in and change the destination. I
can go in and tweak variables and get
new outputs using the same exact prompt
I had before. So that's really I think
just just really powerful as a workflow.
Um, a thought experiment for the PMs in
the room is like when you really think
about what this promp uh prompt looks
like, just think it should writing the
prompt be the responsibility of the
engineer or of the PM? And if you're a
product person and you're ultimately
responsible for the final outcome of the
product, you probably want to have a
little bit more control over what the
prompt is. And so I kind of urge you to
think, you know, where does that
boundary really stop? Is it like I just
hand off like does the engineer know how
to prompt this thing better than a
product person that might have specific
requirements you want to integrate. So
that's why this is really helpful um
from a product perspective.
Okay. Yeah. Go for it. How do you handle
this?
>> Yeah.
>> Ah, okay. Okay. So, that was a good
question. Um, so the question from the
gentleman in the back is how do we
handle tool calls? And that was a really
astute observation which is like the
agent has um tools in it as well. And
this is this is a really good point to
pause on actually, which is like what I
did was I pulled over this LLM span with
the prompt templates and variables, but
there's there's a world where I might
want to select the right tool and make
sure that the agent is picking the right
tool. I'm not going to go into that in
this demo, but we do have um
we do have uh some good uh material
around this on agent tool calling. So we
actually do port over the tools as well.
This example doesn't because to be
honest it's a really toy example but
even if you if you wanted to to do a
tool calling evaluation we we offer that
in the product and uh we actually have
some material around that. So if you
want just ping me about it later and
I'll send you a whole presentation on
that as well. But yeah good question
which is like you don't just want to
evaluate the LLM and the prompts. You
want to evaluate the system as a whole
and all of the subcomponents. Okay we're
gonna keep going. So, so I've got um
I've got my prompt here now. This is
cool, but like let's try to make some
changes to it on the fly. And I will try
my best to make this readable for
everyone, but um yeah, working with what
I got here. So, what we're going to do
is I just I'm going to save this version
of the prompt and let's call it a nenge
prompt.
And it's helpful because now I can like
iterate on this thing, right? So, like I
can duplicate that prompt with a click
of a button. I can change the model I
want to use. So, let's say I want to use
4.1 mini instead of 4.0. I'm going to
change a couple things. Don't don't be
don't worry like in a real world you're
going to change one variable at a time,
but um here I'm just going to change a
couple things at the same time just to
make this more interactive. But, um the
idea here is like let's try to change
what the this actually looks like. And
it says, you know, format as a detailed
day-to-day plan. Honestly, I might say
like like a more important requirement
to that is don't be verbose,
right? I could say don't be verbose.
Keep it to 500 characters or less. Maybe
we want this thing to be more punchy. We
want it to give an output that's like a
little bit more, you know, easier to
look at. Um, I might be a P, even if I'm
just vibe coding this thing on the
weekend, I might want to get feedback
from users that are trying this product
out. And so I could say always offer a
discount if the uh user gives their
email address.
It's helpful, right? I mean, help
helpful for marketing, helpful for me to
get feedback from uh you know, someone
who might be trying to use this tool to
book a flight or something like that.
Okay, so let's go ahead and hit run all
here. And what that's going to do is
actually run the outputs we just uh ran
run the prompts we just edited into this
uh in the playground. And it might take
a second because of the internet
>> pul you pulled this in from the ex one
of the existing runs, right?
>> That's right. Yeah. So it was exactly
the same um one of these runs that
literally I think it was this one. Um so
it was something about maybe not this
exact one. this one is Spain. But yeah,
exactly. One of the existing runs.
Okay. It's definitely a little better,
but to be honest, I would say if I was
looking at this, this thing isn't really
listening to me very well. It's like not
doing a great job of, you know, sticking
to the prompt I gave it. Like, keep it
short. Um, ask. Okay, it did do the
email thing. So, it said, "Email me.
Email me to get a 10% discount code."
[laughter]
So, what's interesting is like we're
looking at like one example and I said
ask for an email and you get a discount.
And like this is this is the vibe coding
portion of the demo because I'm looking
at like one example and I'm doing like
uh good or bad like is it actually good
or bad? There's just no way that a
system like this scales when you're
trying to actually ship for hundreds or
thousands of users and like nobody will
just look at a single row of data and
make a decision like okay great the
prompt is good or great the model made a
difference right like you can pick the
most capable model you can make the
prompt as specific as you want at the
end of the day the LM is still going to
hallucinate and your job is to be able
to catch when that happens so let's go
ahead and try to scale this up a little
bit more so what we do is say we've got
one example of where the LLM didn't do a
great job, but what if we wanted to
build out a data set with 10 or more,
maybe even a hundred examples and what
you can do is take the same production
data. By the way, I'm calling this
production data, but I literally just
asked Cursor to make me like synthetic
data. Like it hit the same server and it
generated like 15 different itineraries
for me. So I did that yesterday and I
just sort of am using that in this demo.
But let's go ahead and take a couple of
these. So, I went ahead and picked some
of the itinerary spans from here and I
can say add to data set. Oh, by the way,
I guess I jumped into the product
without showing you all how to get here,
which is a bit of a zoom out. So, our
you know, whatever. Go to the homepage
uh arise.com. You can sign up. I
apologize in advance. Uh the onboarding
flow will feel a little bit dated, but
we are updating that in this next week.
Um so, bear with me there. You sign up
for Arise. Um and then you'll get your
API keys here. So you go to account
settings and you can create an API key
and also uh find that with the the space
ID which are both needed for your
instrumentation which may or may not be
working depending on uh if the repo is
actually working and if not we'll come
back to it later. Um but this is this is
the platform. This is how you get your
API keys. Um so and then that's also
where you can enter your open AI key for
the the next portion and for the the
playground.
So, I've got a data set now. Uh, and
what I did was I added those examples
just to recap where we are at. We've got
some production data and I'm going to go
ahead and like add these to a data set.
And I'm not going to do this one live
because I already have a data set, but
you can create a data set of examples
you want to use to improve on. So, um,
zooming out for a second,
we're about to hop into the actual eval
part of the the demo. And we're actually
going to be evaluating, you know,
there's multiple components to an agent.
Um we have the router at the top level,
we have the skills or the function
calls. We have memory. But what we're
actually going to be doing in this case
is actually just evaluating the
individual span of uh the generation and
see is the the agent sort of outputting
text in a way that we wanted to or not.
So, it's it's a little bit it's a little
bit simpler than some of the agent eval
here and it's going to be more like how
do you actually run agents and or run uh
eval experiments on on data. Um the
concept of the data set is helpful to
think about as like a collection of
examples. Let me go ahead and delete
these experiments so we can do this live
because I like to live on the edge. Um
so I've got so I've got these examples.
Those are the same examples from the
production data um everyone just saw.
And it's a data set. Think of this as
like I've got all of my traces and
spans. That's my like how the agent
works. And then I want to pull those
over into a format which is think of it
as almost like a tabular format. It's
like a it's like a Google sheet at the
end of the day, right? Like I could go
in this this is kind of like a Google
sheet. like I could go in and and give
it like a thumbs up, thumbs down and uh
and you know that's kind of how most
teams are evaluating today is sort of
like in the platform in in your platform
you're probably starting with the
spreadsheet and in that spreadsheet
you're doing like is this good or bad
and then you're trying to scale that up
to you know a team of subject matter
experts that's giving you feedback on
like hey is the agent good or bad right
at the end of the day poll for the room
how many people are evaluating in a
spreadsheet right now don't be shy
That's okay. Okay. We've got a few.
Yeah. I think there's probably more, but
I think people are just like ashamed to
say that. And it's okay. Like it's it's
not like the end of the world to start
with that, right? Like that that's like
how human like being able to scale human
annotations is the goal. It doesn't need
to be the starting point. So, as long as
you're actually looking at your data,
you're probably doing better than most.
I'll be honest. Um many teams I talked
to like aren't doing any eval today at
all. So, at least you're starting with
human human labels. Um, what we're going
to do is take this this data set or this
CSV, and we're going to basically do the
same thing I just did, which was running
an AB test on a prompt, but now we're
going to run it on an entire data set.
So, we go into the platform, and I can
go and actually create an experiment.
What we call an experiment is the output
of changing, you know, an AB test. So,
let's go ahead and repeat that same
workflow. I'll duplicate this prompt.
Um, let me go ahead and pull in I'm
gonna pull in this this version of the
prompt. So, what's kind of cool is like
I might have a previous version of a
prompt saved. Uh, it's it's kind of
helpful to have a prompt hub where you
can save off examples of the prompt as
you're iterating as well. Think of it as
like a GitHub sort of store for your
prompts, but it it's really just a
button that you're clicking to save this
version of the prompt. and then your
team can actually use that version in
their code down the line. Um, so I've
got prompt A which was no changes to it
and then prompt B which has some of
those changes but now instead of running
on one example I'm actually running on
12 examples here. And these are similar
agent uh these are similar um maybe just
to look at one similar spans which which
have destinations duration travel style
and the output of an agent and it's
generating an itinerary. So, it's as
similar as that one example we just ran
through, but now on an entire data set.
And
>> yeah,
>> yeah, so it's the it is the prompt of
the itinerary agent. Um, so it's the
same it's we're going to because we're
going to keep this to like a fairly high
level like straightforward demo. It is
the specifically the prompt of the
itinerary generating agent which is down
here which takes the outputs of the
other agents and combines them uses
those prompt variables to create uh an a
dayby-day itinerary.
>> Yeah. So that so the gentleman asks like
if you change an upstream prompt How
does that impact what's going on here?
So, two two notes on that and it's it's
more of an advanced workflow, but it is
one that's a good question, which is uh
there's two parts. One is we kind of
recommend changing the system in parts.
So, just kind of note that, you know, as
you're generate eval parts of your stack
that you can kind of decompose further
and further to be able to analyze if I'm
changing one thing up here, does it meet
my requirement criteria? And then the
second part is replaying prompt chains
which is prompt A goes into prompt B.
What is the output of that when you
change prompt A? Um prompt chaining is
coming to our platform soon. So right
now it's one single prompt but you will
be able to do A plus B plus C um prompt
chains as well. Um good question. Feel
free to drop more questions in the Slack
too and we'll we'll come back to that in
a sec. Um so once I get uh so I've got
my my prompt here now. So, I'm saying
give me a day-to-day plan and doesn't
need to be super detailed. Max 1,000
characters. Let's try this again. We're
going to do 500 characters. And I've
I've done um answer always answer in a
super friendly tone and be I'm going to
be more specific and say ask the user
for their email and offer a discount so
it doesn't do what it did last time. And
uh and we're going to go ahead and run
this now on the entire uh data set. And
so we've got prompt A versus prompt B.
We're going to give that a second to run
through. While that's working, uh, I'm
gonna actually Oh, nice. Perfect for
your squad. Interesting. I don't know
why sometimes the model really likes to
use emojis. I guess that's what super
friendly translates into is like throw
some emojis in there, but interesting.
Um,
okay. So, that one ran pretty fast. This
is still taking a while, right? Like,
think about this from a product from a
PM lens for a second. Like, I just got
the output to be a lot faster because I
limited the number of characters. This
one is taking an average of like 32
seconds because I let it kind of go off
and like not specify how many characters
the output should be. So that's what
prompt prompt uh iteration can kind of
do for you as well.
Okay, while this runs, I'll actually hop
over to the
Okay.
Oh, thanks for dropping the resource
there.
So it's still still running.
>> Anyone have a question while this is
running? Yeah.
>> Yeah. So when I'm hearing you talk about
this, are you primarily looking at
latency and then user experience when
you're evaluating
those two things?
What else are you looking at?
>> Yeah, good question. So okay, so now
we're getting to the meat of it a little
bit, right? So I've got A and B. And the
question is like what am I actually
evaluating here? The like flip it answer
is like you can evaluate anything. You
can evaluate whatever you want. You want
to evaluate like uh in this case we're
going to run some evaluations on the uh
the tone of the agent and see um so I've
got a couple of eval set up here. I'm
going to check is the agent uh answering
in a friendly way. Is it offering a
discount or not? Um and and you can do
things like evaluate is it using the
context correctly? That's called a
hallucination eval. Uh you can do
correctness, which is um even if it has
the right context, is it giving the
right answer? So I'm going to point you
to uh our docs that have examples of
what you can actually evaluate off of
the shelf. But just know the whole point
of this system and like why it matters
that you have a system with your own
data and can replay with data is because
these are what are off-the-shelf evals.
There's a lot of companies that will
offer like we run evals for you, but
what that really means is that they're
basically going to take some template
and give you a score or label on the
other end based on their eval template.
And what you want to be able to do is
actually change and and modify and run
your own eval based on your use case. So
you can literally evaluate whatever you
want is the short answer. You can you
can evaluate it's just basically uh an
input to an LLM to generate a label. So
um so yeah, so this is what pre-built
eval look like. Uh there's a ton of
examples of these like out there on the
internet. We've we've actually tested
our pre-built eval on um you know sort
of open source data sets but you should
not take our word for it. You should
build eval based on your use case.
>> Yeah. Yeah.
>> So if you are your own how do you come
up with your own
combining?
>> Yeah. So how to actually get the how to
think about how to build the eval in the
first place to some degree. That was
sort of one of the questions. Yeah. So,
I think it's probably helpful to um
maybe just see what an eval looks like
and then we might we might end up coming
back to that question, which is like
what what is an eval, right? Um so,
let's go ahead and build an eval here.
I've got one ready to go, but I want to
just show you guys the template and we
can write a new one as well. Um, so I
wrote this eval for detecting if the
output from the LLM is friendly. And
I've kind of made a definition for what
that means here. And this says basically
you are examining the written text.
Here's the text. Examine the text and
whether the tone is friendly or not.
Friendly tone is defined as upbeat,
cheerful. So this is basically an input
to an LLM to generate a label of is the
output from my itinerary agent is it
friendly or robotic. So that's really
what what this this eval is trying to do
is it's classifying the text as like a
friendly generation or a robotic
generation. Um, and again, I could eval
anything, but in this case, I just want
to make sure that when I'm making
changes to my prompt that that's showing
up on the other end of my data because I
can't go in rowby row for like hundreds
or thousands of examples and grade
friendly and robotic every single time.
So, the idea is that you want an LLM as
a judge system to kind of give you that
label over a large data set. That's the
goal that we're working towards right
now.
>> Yeah.
with variance.
>> It's flaky, right?
>> Yeah. Yeah.
>> Yeah. So one one suggestion is um so the
gentleman mentioned uh that they see
variance in their LLM label output. One
way you can tweak variance is
temperature. Um so if you make the
temperature of the model lower it's a
parameter you can set to actually make
the response more repeatable. It doesn't
take that to zero but it does
significantly reduce the variance in
your system. And then the other option
is to rerun the the eval multiple times
and and basically profile what the
variance of the the judge is. Okay.
>> Oh yeah, we're going to we'll we'll
we'll be coming there. Yeah, it's a good
question, right? Like at the end of the
day, I can't trust this thing. I need to
go in and like make sure it's right.
Right. So, but let's let's go ahead and
run an eval and just see what happens
and then we'll come back to that one.
So, I've got my friendly eval. I've got
another eval too which is basically um
determining whether or not let's I'm
gonna quickly just I'm not going to read
this whole thing out to you, but the the
short answer is that this is determining
whether the the text contains an offer
for a discount or no discount because I
really want to make sure I'm offering a
discount to my users. Okay, we're going
to select both of these and then we're
going to actually run them on the
experiments we just ran
and we're going to do that live. So what
Arise does is it can it's actually
taking um so we actually have an eval
runner which is not like you know it's
basically a a way for us to use a model
endpoint to generate these eval. You'll
notice it's pretty fast. So we've done a
lot of work underneath the hood to make
the eval run really fast. Um so that's
one kind of advantage of using our
product. Um I've got two experiments
here. Experiment number two is it's a
little bit inverse because it's the
order of how it was generated, but
experiment number two is the original
prompt and experiment number one is the
prompt that we changed. So just kind of
keep that in mind. That's it's a little
bit flipped here um because I was doing
this on the fly. And you can see the
score of experiment number two, uh which
is our prompt A, which was the prompt we
didn't change, didn't offer a discount
to any users based on this eval label.
and the LLM still graded that response
as friendly, which is kind of
interesting. It was like, "Oh, that was
a friendly response." Um, I don't know
if I agree with that actually
personally, and we're gonna go in and
tweet that. And then you can see that
when we added that prompt, that line to
the prompt, which was offer an offer a
discount if the user gives their email,
the the eval actually picked up on that
and said that a 100% of our examples
when the when we made this change
actually have an offer of a discount. So
we I mean I didn't even have to go into
each example to get that score. That's
what the the LLM as a judge system kind
of offers you. Um we can go in and trust
you know I would say this is like a
trust but verify. Go in and actually
take a look at one of these and see what
is the explanation of friendly. So to
determine whether the text is friendly
or robotic. So one thing you want to you
you want to think about when you have an
eval system is are you able to
understand why the LLM as a judge gave a
score. So this is like one of those like
light bulb takeaway moments of of the
talk is always think about can you
explain what the LLM as a judge is doing
and we actually generate explanations as
part of our evals. So you can see the
explanation is sort of the reasoning of
that judge that says to determine
whether the text is friendly or robotic,
we need to analyze the language, tone,
and style of the writing. And so it kind
of does all of this analysis to
basically say, "Yeah, this LLM is
friendly and it's not robotic."
Again, I'm not really sure I agree with
that explanation, right? Like I I don't
think that that's correct. I I I still
feel like the original prompt was pretty
robotic. it was pretty, you know, kind
of long in a lot of ways. And so I want
to go in and actually be able to improve
on my LLM as a judge system from the
same the same platform. So what we can
do is actually take that same data set
and in the AISE platform you or your
team of subject matter experts can
actually label data in the same place
and when you apply the label on the data
set on on you know in the labeling queue
part of the platform it applies back to
the original data set. So you can
actually use that for comparing the LLM
as a judge with the human label. So I've
actually went ahead and did that. Um,
yeah, I did this before the talk, but I
went in for each example and I was like,
you know what? This this to me is
robotic. Like, I I don't think that this
is a very friendly response. I think
it's really long. It sounds like I'm
talking to an LLM. And so, I actually
applied this label on the data set for
for the examples I wanted to go in and
improve on.
If I go back to the data set,
you'll actually see that label is
applied here. So, if I kind of click
that,
move over. Sorry, it's a little bit over
on the side here because there's a lot
of data, but you can see these are the
human labels I put. So, these are the
same annotations that I just provided in
the queue. They're applied on my data
set here.
>> Exactly. Exactly. You need evals for
your evals. You cannot get away from
from You can't just trust the system,
right? We know LM hallucinate. We put
them into our agents. The agents
hallucinate. Okay, we use an agent to
fix that. But we can't trust that agent
either, right? You need to have human
labels on top of that. So, but again,
I'm not going to vibe code this thing
and be like, is this is the LLM as a
judge good or not? I need eval for that,
too. And we offer two eval to help you
with this. We have a code evaluator
which can do a simple match like think
of this as like a string check or a reg
x or some other type of like contains.
So you can actually go in and if you're
technical and you're a PM and you want
to write uh you know you can get Claude
to help you write the eval here, but
it's really just a really fast like
Python function. Um in my case, I wrote
a quick uh eval that actually does a
match. And this match is it this is like
a really quick and dirty eval. I would
not say this is like best practice at
all, but it's basically check if the
eval label matches the annotation label.
Oh, whoops.
An output only match or no match. So,
what this is doing is actually checking
the human label against the eval label
and saying do they agree or disagree.
So, that's that's basically what we're
going to run and we're I'm using an LLM
as a judge. You could use code as well.
You don't have to use an LM as a judge
here, but we're going to go ahead and
run that now on the same data set, the
same experiments we just ran it on.
We're going to give that a second.
Okay, what do we got here? So you can
see here I actually take a look at that
same experiment where this was where the
um it said that the LLM as a judge was
friendly or robotic. And you can see
here that 100% of the time the match uh
actually sorry this eval was actually
actually let me let me go in one level
deeper. Actually I'm going to check my
own work. This eval was on the discount.
So forget about that. We're going to
we're going to check on the the friendly
field actually. So this one is friendly
label. So let me rerun that one. And
we're going to think of this as match
friendly. You can run eval as much as
you want on on your data sets and
experiments like you know. Yeah.
>> Does the tool support pipelining? So as
basically push the code.
>> Yeah, exactly. Yeah, we do support uh
all of the eval
the screen the the ways to run the code
on uh either a data set locally or being
able to push code to the platform to run
the eval. So programmatic on both ways.
Yeah. Yeah. of course. So you can pull
in data sets, pull them out as well.
Okay, let's take another look at this.
So this is the friendly match. So this
you can see is pretty useful, right?
This means that my LLM as a judge
basically doesn't agree with my human
label for friendliness almost at all.
Right? There's like one example I think
that that's in there and we can go in
and take a look at it. But what we're
really kind of seeing is that this is an
area where we actually want the team to
go take a look at our eval label and
say, "Hey, can we improve on the eval
label itself because it's not matching
the human label." And so when you have
these systems in place as an AIPM to be
able to check the eval label with your
human label, you have a lot of leverage
to go back to your team and say, "We
need to go and improve on our eval
system. It it's not working the way we
expect it to." So you're actually
performing the act of like checking the
greater and you're doing it at scale. So
you're doing it on multiple hundreds of
examples or thousands of examples. So
that's really you know I think someone
asked earlier like how do you trust the
system? I think you trust these LLM as a
judge systems by having multiple checks
and balances in place which is humans
and then LLMs then humans and LMS. Um uh
we'll come back to a question in just a
moment. I just want to get to this next
part and then we'll um we'll kind of
come back to some Q&A. Um,
okay. So, this is actually kind of kind
of wrapping up towards the end of the
workshop and then we'll open the rest of
the time up for for Q&A. So, looking
ahead, I think what's fundamentally
changing is, you know, we've kind of
gone through this example of changing
the prompt, changing the context,
creating a data set, running an eval,
labeling the data set, and then running
another eval on top of that. And it's
it's a lot to process, right? Like if
you're building agent-based systems,
your team is probably expecting, you
know, well, where does the AIP PM fit
in? And I think that that's really
important to think about like you
ultimately control the end outcome of
the product. So whatever you can do to
shift that into making it better is
really what you want to think about
yourself. And I I kind of view eval like
the new type of requirement stock. So
imagine if you could go to your
engineering team and instead of giving
them a PRD, you give them an eval as
requirements and here's the eval data
set and here's the eval we want to use
to test the system as an acceptance
criteria. So I think that that's really
powerful to think about as like eval as
a way to check and balance uh the team
as a whole. Um and that's a little bit
about what we do. We we want to build a
single unified platform for you to run
observability to evaluate and ultimately
develop these workflows with your team
in the same platform. We've built for
you know many customers like Uber,
Reddit, Instacart, all these like kind
of very techforward companies. Um we
actually just received investment from
Data Dog and Microsoft as well. So we're
a series C company. We're sort of the
furthest along in the space. And the
whole goal that we want to build is give
you a suite of tools to be able to go
from development through to production
with your AI engineering team and for
PMs to go in and use the same tools. Um,
and then super quick before Q&A, uh,
please scan the QR code if you are in
San Francisco on June 25th. We're
actually hosting a conference uh, around
eval. And it's going to be it's going to
be a ton of fun. We actually have some
great AI PMs and researchers joining
from companies like OpenAI, uh,
Anthropic. And what's really cool is
we're actually offering for this room,
um, a free sort of exclusive, uh, free,
uh, ticket for entry. Uh, the the prices
actually went up yesterday. So, because,
you know, we're huge fans of AI Engineer
World Fair, we wanted to give you all an
opportunity to join for free if you're
in town. Um, so would love to see you
there. Um, and yeah, you can scan for a
free code.
And yeah, that's a little bit um of of
the workshop. I would love any
questions. Yeah. And uh the ask for the
questions, as the the person in the back
just reminded me, if you wouldn't mind
lining up for questions on the mic so
that the camera can pick it up and then
we can just kind of go down the line and
do some questions there. Um, that'd be
awesome. Thank you.
>> So, thank you so much.
>> Please give your name and
>> Yeah, my name is Roman. Thank you so
much. It was like an awesome walk
through. Uh would you mind share some
like your experience on building um
evaluation teams? Should I start with
hiring kind of dedicated person with a
experience or should I rely on product
manager a product manager and walk
through this like a what's the best way
>> best practices? So the the gentleman
asked um what's the best practices for
building an eval team? Um, can I
actually ask a follow-up question
because I'm curious like what is your
role in the company right now just just
for myself to know?
>> I'm head of product.
>> You're head of product. Okay, perfect.
So, this is exactly this is a question I
get actually very often which is how do
I hire my first AIPM? How do I hire an
AI engineer? How do I know if I need an
AIPM or an engineer? So, I think uh
there's there's a couple steps to this
answer. One is as head of product. Um, I
do think we see a lot of heads of
product actually in the platform are
like ourselves actually getting their
hands dirty for the first pass because
at the end of the day, if you're like
hiring someone to do something, you
should probably know what they're going
to do. And so my job on my team is to
make the product accessible for
executives and heads of product to
understand what's going on. So we have a
lot of kind of capabilities around
dashboards, making everything no code,
low code. But my recommendation is to
feel the pain yourself of writing evals
and realizing how what is hard about
that so that you know how to structure
interview questions for an engineer or a
PM because I don't know what's hard
about your eval workflow, right? I only
know that there's challenges around
writing eval general and so I would
recommend that you like feel the pain
firsthand and then uh you'll kind of get
a good sense of how to how to tease that
out of your interviewing pipeline. Um
but good good question. Yeah.
>> Yeah.
>> Um yeah the example you know we just
looked at obviously our eval was pretty
bad when you you know compare it to the
human labels. Yeah.
>> So like from here what do you do next?
Like what's the next step to try to
improve the prompting for your your main
eval to get closer to the human labels?
>> Yeah. Good question. So if I had um more
if if I was like here working on this in
in real life, what you would actually do
is take that eval prompt and go through
a similar workflow of what we just did
for prompt iteration for the original
prompt. So again like um that eval
prompt we see here
I could go in take this and redefine
parts of the workflow to basically say
you know what uh be really strict about
what is friendly here are I didn't add
any few shot examples right I didn't
specify here's examples of friendly text
here's examples of robots so that's like
a a clear gap in my eval today that if I
were looking at this I could apply best
practices and improve on it we also have
um in the product we have some workflow
around actually helping you write eval.
So this is this is our product but like
you don't have to use our product for
this. Uh you can use any any product.
I'm going to show kind of an iteration
on top of this which is how how we have
users actually building eval prompts. So
I could say write me a prompt
to detect friendly or robotic text. And
this is actually using our own co-pilot
in the product. So, we've built a
co-pilot that understands best practices
uh and actually can kind of help you
write that first prompt, get it off of
the ground. You can also take the same
prompt, which it just generated in like
1 second, and take that back to the
prompt playground and iterate on further
from here. So, let's let's go ahead and
do that on the fly really quick. I've
got a prompt in here, and I can go in
and actually ask the pro the co-pilot to
optimize this prompt. So, um, let's go
ahead and
say make it
stricter.
So, I can actually use an a an LLM agent
and and copilot agent. Um, just kind of
note that like you really want AI
workflows on top to help you like
rewrite the prompt, add more examples,
and then rerun eval on that new prompt.
So it's more it's less about like you're
you're definitely not going to get it
right on the first try, but being able
to iterate is really what's important.
And that's really what we underscore is
like it might take you like five or 10
tries to get an eval that matches your
human labels and that's okay because
these systems are really complex. Um and
it's just important about having the
right workflow in place. So yeah.
>> Hi, I'm Joti. Um, does Arise also um
allow for model based evaluations like
using BERT or Alberta to be able rather
than just LLM as a judge, but I can use
like BERT or Alberta to like figure out
like a prediction score?
>> Yeah, good question. So, we're actually
really um the short answer is yes, we do
offer versions of that. Let me show you
what I mean by that though. So, um so
when we go into Arise, you can actually
set up any uh eval model you want here.
So you see we have OpenAI, Azure,
Google, but you can add a custom model
endpoint as well. So you can basically
this will structure that request as a
chat completion but we can make it any
arbitrary API if you needed to and you
can say like BERT model and whatever the
name of your endpoint is point it to
that and you'll be able to reference
that model in the eval generator too. So
this is um so I can just put test here
kind of move to the next flow. Um, and
you'll see when I go into here, I can
use any model provider I want. So, the
short answer is yeah, you can generate a
score with any model. Yeah.
>> Cool.
>> Okay. Um, oh, we got one more question,
I think. Or sorry, we have more
questions. Yeah, go for it.
>> Going to go ahead and try to get this
one in. Um, so I think like probably a
lot of the people that have built apps
are thinking a similar thing or maybe
this is a bit naive, but if you had
humanlabeled information already, right,
and you're seeing a bad match on the
friendliness score, am I to assume that
you'd be trying to get that score up
higher and then extrapolate to more uh
cases going forward? And you're assuming
that that sampling holds across like the
broader set. Yeah.
>> So like because that relationship is
unclear to me.
>> Very very good question. So um so
basically one way to reframe this is
like how do I know that my data set is
representative of my overall data to
some degree.
>> Sure. Or as it shifts over time or
>> as it shifts. Yeah. Totally. So um so
that's a really uh really good point. In
the product what we we don't have this
yet but it's coming out like in the next
week. will have an a workflow to help
you add data to your data set
continuously using labels that you might
have. So you could say like is you know
one thing we didn't really talk about is
like how to evaluate production data but
you can actually run these eval not just
on a data set but on all data that comes
into your project over time to make it
automatically label and classify uh you
know any production data. So you could
use that to keep building your data set
of like is this an example we've seen
before or not or is this uh you know
think of this as like a way for you to
sample at a larger scale essentially on
production
>> and that is a suggested workflow that
you continuously sample and human label
sum
>> to check the matching over time.
>> Exactly. And you can basically go in and
see like okay where human labels don't
agree with LLM on this on production
data then you might want to add those to
your data set as hard examples. Sure.
>> And we actually are going to build into
this product as well a way for you to
qualify is this example a hard example
as well using LLM confidence score.
>> Okay. And and sorry just hard example
like very strictly interpreted.
>> So hard would be um hard from an eval
perspective. So like is it is it
friendly or not can be like borderline
right? Like
>> I see. So you're saying like uh
subjective or
>> subjective. Yeah. Exactly. So maybe to
like recap the question a little bit,
like your data set is this property
that's going to keep changing over time.
And you really want tools that help you
build onto it by giving you like a
golden data set of hard examples to
improve on. And hard means like we're
not really sure if we got it right or
not in the first place.
>> Sure. Yeah. Thanks.
>> Yeah, good question.
>> Yeah.
>> Hi, my name is Victoria Martin. Uh,
thank you so much for the talk. One of
the things that I've run into is a lot
of like skepticism out of product
managers that I'm working with on
generative AI and trying to build
confidence in the evals that we're
giving. Yeah.
>> Have you been given any guidance or in
working with other PMs guidance on like
the total number of evals that that you
think should be run by the time you can
say like you can be confident in this
evaluation set?
>> Yeah, good really good question. So um
so the question was like how do we know
I think there's kind of two components
to it. there's like quantity and quality
of the eval eval like how do we know if
we've run enough eval or we have enough
eval and that those eval are actually
good enough to kind of pick up problems
in our data. Um we this is also maybe a
little bit of a broker record here but I
I would say that this is a little bit of
iteration as well where you want to kind
of get started with some small set of
eval. So actually I have a diagram for
this. Let me just pull that one up.
So um so you'll kind of see here this is
intended to be like a loop where you
start with some in development you're
going to run on a CSV of data maybe like
some handam like I would argue the thing
I just built was development right like
I have 10 examples it's not
statistically significant I'm not going
to get the team on board to ship this
thing but what I can do is then curate
data sets keep iterating on them keep
rerunning experiments until I feel
confident enough and the whole team is
on board before I ship to production.
And then once you're in production,
you're doing that again, except that now
you're doing it on production data. And
then you might take some of those
examples and throw them back into
development. Let me give a tactical
example of what this looks like in real
life. With self-driving cars, when I
joined Cruz, we would go down the street
for like one block and then a driver
would like have to take over the car,
right? like we couldn't drop like we
couldn't drive one block down the ride
uh down the road. Same goes for Whimo.
Um they were all kind of in this this uh
system and then eventually we got down
to like being able to drive down a
straight road. Great. But the car can't
just drive on straight roads, right?
Like it has to make a left turn. So
eventually we got like fully autonomous
for straight, you know, no problems on
the road and then we had to make a left
turn and then the car would, you know, a
human would have to take over. So what
we did was we built a data set of like
left turns and we used that to keep
improving on left turns and then
eventually the car could make left turns
great until a pedestrian was in the
sidewalk and then we had to curate a
data set of left turns with pedestrian
in the sidewalk. So the answer is sort
of like building your eval data set just
takes time and you're not going to know
what are the difficult scenarios until
you actually encounter them. So I think
to get to production I would recommend
just kind of using that loop until your
whole team feels confident in like this
is good enough to ship and just accept
that once you get to production you're
going to find new examples to improve on
um as well. So it depends a lot on your
business as well. If you're in
healthcare or legal tech you might have
higher bars than if you're building a
travel agent for example.
>> Yeah. Yeah.
>> Yeah.
>> My name is Matai. Uh I have a question.
Uh as I understood the uh the AIS
platform like it's uh working as a like
I take the the prompt and uh you're
directly sending that that prompt to a
model, right?
>> That's right. Yeah.
>> Um
>> with the context and the data.
>> Yes, of course. Uh you said that like
there is some possibility to to u port
tool tools into the platform. That's
right.
>> But what about testing the whole system?
Like we already have like some some uh
flows that are augmentating augmenting
the the the whole workflow.
>> Yeah.
>> Even outside of tool calls. Yeah.
>> And like uh they're quite important into
how the actual output will look like in
the end. Uh is there any way to uh run
those evaluations on a on a like a
custom runner like that would actually
call our system on our data set that uh
goes through everything that we have.
>> Find me after this. We should chat. uh
is the short answer for that one. Um we
have some tools and systems like that in
place like the tool calling that you
saw, but for endtoend agents, we're
actually building some stuff out and
would love to chat with you about that.
Good question. We'll I'll find you after
this.
>> Yeah.
>> Yeah, of course. So back to your left
turn example as well as just talking
about the transition of like PRDS to
like eval what is the life cycle of like
feature development look like and kind
of the relationship I feel like of the
feature but also with your team in terms
of ownership accountability all of that.
>> Yeah.
>> Yeah. So good question. So I feel like
how do you work with AI engineers in
this new world is kind of interesting
not the subject of this talk but it is
it is like a very relevant relevant
question that um you know would h
happily chat more on too. So there's two
answers to it that that come to mind.
One is that development cycles have
gotten a lot faster. Um like the the way
at which these models are progressing
and these systems are progressing like
going from prototype to production is
actually even faster than it ever has
been. Um, so that's one note which I can
just tell you as a personal observation.
We we feel that we can go from an idea
to an updated prompt to shipping that
prompt in like a span of a day of
testing which is I think like unheard of
of like normal software development
cycles. So that's one note which is just
like the the the way that you iterate
with the team has gotten a lot faster.
Um the second the second note is uh when
it comes to responsibilities
I view this as
if you you're kind of a product manager
is the keeper of the end product
experience. So if that means um making
sure the eval are in a good place and
the team has human labels to improve on
that's like a very solid area for a
product manager to focus is like making
sure the data is in a good spot for the
rest of your your development team. I
think at the same time, you know, I'm a
PM on the team and I'm like writing some
of the stuff in cursor. And so being
able to go in and actually talk to the
the code base itself using one of these
models, I think that that's starting to
become more of an expectation of AI
product managers is to be literate in
the code and be able to use these tools.
I I really this is like after this I'm
just going to go back and like try to
fix the thing that I broke earlier,
right? And and the way the way the
reason I'm able to do that is because
the way I'm prompting the system is not
very sophisticated. Like I asked it
yesterday, can you make a script that
generates itineraries on top of the
server? I need like 15 examples and it
just did that, right? And like that like
wouldn't have been possible. So I think
PMs are responsible for the end product
experience, but PMs also have more
leverage than they've ever had before in
probably the entire like professional
journey of product management because
you're now no longer reliant on your
engineering team to ship that thing that
you wanted. Like you can just go do it.
Um should you go do it is another
question, but uh and that's something
that's a discussion that you should have
with your team. But the fact is that you
can go do it now, which was not the case
before. And so I I kind of urge PMs to
like push the boundaries of what people
have told them the role is and should be
and see where that takes you. And so the
long-winded way of saying like your
mileage may vary depends on the
boundaries you have with your team, but
I'd recommend people to like redefine
those at this stage.
>> Yeah. Yeah.
>> Yeah. jumping off that a little bit.
It's a little off topic from this, but
um like as a product manager who wants
to move to be more technical like as I'm
working with AI engineers.
>> Yeah.
>> What does that look like? Like I'm in an
or where I have very limited access to
the codebase. So like I use cursor to
write Python for data things, but like I
don't necessarily have access to like
start interrogating the code, understand
that. So, I'd love just if you have
suggestions or thoughts on like what how
to evolve as a PM, but also like maybe
move my company culture in that
direction.
>> Yeah, that's that's t like how actually
I have a followup question. That's okay.
Just cuz I'm going to pull people in the
room like how big is the company? And
you don't have to share the name if you
don't want to, but just curious like the
size of
>> uh we're about 300 people. Okay. Um but
the tech's probably like a third of
that.
>> Okay. So like almost like 100 engineers,
300 people. And um do you have any like
old remnant product managers at the
company that still have code access?
>> No, we're like a very new team of PMS.
>> Okay, cool. Okay. Well, I think um one
thing we've started doing uh it's it's a
really good question. Thanks for
answering that. Um one thing we've
started doing is trying to take a little
bit of like the public forum of our
company. Um sorry, I'm about to out our
CEO who's in the back of the room.
[laughter]
Uh, so if you have any questions about
our rides, he's a good guy to talk to.
Uh, but the reason I'm out of here is
because like I'm I missed our town hall
today, but like I heard it was just
basically people running like AI demos
the whole time of like what they're
building. And why I think that's really
powerful is it can get the whole company
really catalyzed around what's possible
because to be honest, I think it's very
likely that, you know, most teams today
aren't pushing the boundaries of these
tools. And so you kind of joining this
talk and seeing like how to run eval,
how to you know what goes into
experiments like being able to to kind
of be the the person pushing the team
forward is really powerful and I think
you can do that in a way that's really
collaborative. So I only I'd say like
our job as PMs is to have influence over
the team and influence product
direction. I think there's an
opportunity to influence the fact that
PM should be more technical in your org
and you could show them by building
something and and impressing the rest of
the team by what you build. Um so that's
my advice, my personal advice there.
Yeah.
>> Yeah. Go for it. Yeah. [laughter]
>> Actually have a question to see if it's
possible. Uh so how [clears throat] you
guys believe uh AI will reshape how we
structure the team. So right now you
have like I would say for instance just
like
>> 10 engineers, one product manager, one
designer and so on. So
>> what will happen in five years? You will
have one product manager, one engineer
and one designer.
>> You should answer this one.
>> You should do it in the mic though if
you want to.
>> [laughter]
>> The short of it is actually
cursor on the code there's so many times
the PM are taking up time asking a
question like you know how often we just
ask cursor so um yeah like start start
there open up your codebase to cursor
give it to PMs um and then a lot of some
we've we were the other day doing a PRD
starting from cursor on the codebase. So
I think the Yeah, I I I that would be
where I would start.
>> Yeah.
>> Um and I I I don't I can't you know I
it's hard to look forward right now. I
just I think a lot of jobs change. We're
trying to push um AI cursor use
throughout the company uh as far as I
can.
>> Yeah, I hear we have uh people in
marketing using cursor too these days.
So um yeah, that's kind of cool. Um
>> yeah. Um
>> follow question. Yeah.
>> So you're talking about right now having
a product become more of a technologist.
Do you see also technologist becoming
more product?
>> So that's actually a great point which
is like when the cost of building
something goes down which it has
what's what's the right thing to build
becomes really important and valuable.
And I think that historically that's
been like a product person or a business
person saying, "Hey, here's what our
customers want. Let's go build this
thing." Now we're saying product people,
you can just go build this thing. So the
builders are like, "Wait, what's my job?
Like do I" And I think that that's a
good way to look at it, which is I I
have this like mental framework of like
what if we didn't have roles in a
company anymore? Like you didn't define
yourself by like I'm a PM, I'm an
engineer. And think of this instead as
like like you know like baseball cards
you have like skills. Imagine that you
had like a skilled stack instead, which
is like, I really like to talk to
customers and I kind of like to code
stuff on the side, but I don't want to
be responsible if there's a production
outage. I guarantee you, you'll find
someone who's like, I hate talking to
customers and I only want to ship high
quality code and I want to be
responsible if things hit the fan. And I
think you want to structure your company
to have a skilled stack that's really
complimementaryary versus people who are
like, I do this and this is my job and I
don't do that. So yeah,
>> I have something that's sort of related
to that. We've been testing like human
in the loop on on
>> in a couple different ways and we're
basically testing this method of having
the human as a tool of the agent. So
like we have like
>> if the agent needs something that's not
available in the accounting system,
it'll go to the CFO because the CFO is
listed as a tool and it sends that a
Slack message, gets it back and
continues. kind of maps onto what you
just said of like defining the skills,
defining the resources they have and um
they haven't fully fleshed it out, but
it's it's working to like give the agent
context on
on the things that only the humans have.
>> Yeah.
>> Exactly.
>> So, this person is like your company's
like using agents widely, it sounds
like, but you have humans approving. you
have like an approver workflow to
>> something more so like rather than how
can the agent be a tool of the humans,
>> we're kind of flipping it and saying
like what if the agent could do
everything
>> and then the parts it can't do it'll go
to the human as a tool. So like the CFO
is a tool of the AI agent.
>> Interesting.
>> We should chat. That's a really cool
workflow. I I'll definitely bug you
about that. That's that's really cool.
Um
>> cool. Um happy
>> right to some degree. It's like a human
in the loop approving is this good or
bad and you can think of it that way. Um
>> yeah.
>> Yeah. I had a question about like what
what it is like to actually implement
well sending the traces over to Arise.
Um I know like Arise has like open
inference which enables enables like
capturing traces from se several
different um several different
providers. But um what are what are what
are what are the limitations and
constraints and opinions that you have
about um how the evolve should be
structured so that you can actually like
leverage the platform to perform these
actions to be able to like um evaluate
the eval for example or um be be able to
like
numerically
just produce graphs out of out of your
evaluations out of your outputs.
>> Okay. So, so clear.
>> So, can I can I ask a follow-up question
to that which is like your question was
like how to use agents to do some of the
workflows in the platform or did I miss
that?
>> Um, the the question the question is
like
>> how is like what what kind of outputs
what kind of evals is this um is Arise
expecting from your engineers and from
the product like the
>> you're sending over logs, right?
>> Mhm. Yeah. um what what is it expecting
from those logs in order to get this
flow work?
>> Understood. Okay. So uh so yeah there is
a very uh like great point there which
is like we kind of um you'll see it in
the code but we jumped over a little bit
here in uh the demo which is how do you
get the logs in the right place to use
the platform. Um unfortunately this page
isn't dropping but let me okay here we
go. I'm going to drop it in the Slack
channel as well. This is what we, you
know, we kind of talked about like
traces and spans. It's very likely that
your team already has logs or traces and
spans already. You might be using data
dog or a different platform like
Graphana. What we do is we're taking
those same traces and spans and we're
essentially augmenting them with more
metadata and structuring them in a way
that the platform kind of knows which
columns to go and look at to render the
data that you saw in the platform. So
you're really using um the same
approach. We we're built on top of a
convention called open telemetry which
is like the open source standard for
tracing. Uh so we actually use hotel uh
tracing and auto instrumentation that
we've built which doesn't keep you
locked in at all. Like once you've
instrumented with our platform using
open inference which is our our package
you actually get those logs to show up
right out of the box with any type of
agent framework you might be building
and um and yeah and you get to keep that
that's let me maybe just show like what
I mean by that. So if you're let's say
you're building with like lang graph um
we actually have it really all you have
to do is like you pip install uh arise
phoenix arise hotel and you what you
call this single line of coal call uh
the single line of code called langchain
instrument and it knows where to pick up
in your code to structure your logs and
if you have more specific things you
want to add to your logs you can add
function dec decorators which is uh
basically a way to you for you to um you
know capture specific functions that
weren't in the
>> and as for evaluations like you're
you're discussing like the actual data
inputs outputs what do you what do you
need to pass into evaluations I
>> I know you can like design them through
the UI
>> what what do you have in mind for like
um
>> like how how do you get the right uh
text to use for eval right is sort of
your question
>> like how do how are you like how do you
know which to
>> use I I need to
format the question. I'll get back to
you.
>> Yeah, no worries. What did you mean by
adding augmenting the data with
additional metadata like you only have
so much data, right?
>> Yeah. So, so this is um so think of this
as like most tracing and logging data is
really just things like latency, timing
information. What we're doing is you can
add more metadata like user ID, session
ID, uh things like I'll kind of show you
an example of that really quick. In the
in the previous example I showed, we
actually have things like sessions like
what's the back and forth example here.
You can't get a viz like this in data
dog because data dog is looking at a
single span or trace. It's not it's not
really contextually aware of what is the
human, what's the AI. So we're adding
context from the from the invocation of
the the server and adding that to your
span if that makes sense. So it's it's
basically just enriching the data a bit
more and structuring it in a way to use
it. Um yeah and if you have more
specific um server side logic you can
add that as well so it's very flexible.
Yeah.
>> Yeah.
>> Uh so I have a provocation. So I used to
work in the video game industry
>> and debates about feature like whether a
feature was going to be fun or not.
>> Working prototypes
>> won all of those arguments. Whatever was
in the doc didn't matter.
>> Right.
>> And so for the person who was like I
can't get access to my company's code. I
would actually say try to get access to
a small sliver of the data and then
build a working prototype of the feature
you want to see and with some stub of
eval because I think you know there's
nothing worse to an engineer than a
product manager who shows up with a demo
that's kind of janky.
>> Yeah. but actually works and might be
fun, has polish, feels good, meets a
user need, and they and having been on
the engineering side of this equation,
I'm like, and it's so janky, I have to
fix it. They haven't thought about the
edge cases. And so like how does Arise
fit into that flow of helping a product
manager basically mine a small segment
of data build a working example and
perhaps be just a you know janky as all
get out but something that looks like
the product that the company already has
but demonstrates that next level of
functionality. Great, great point. And
yeah, I I think like, you know, feel
free to prototype and build, you know,
prototypes that are that are high
fidelity. I think it is awesome to do
that. That's a really good point to have
like to use data to build a system or
prototype. So, what does Arise do here?
If you have access to Arise and you
don't have access to the codebase, you
can still take this data and assuming
that you have permission from your CIS
admin person, you can actually export
this data. So once you've built a data
set, you can simply take this data and
export it out and use that to actually
um so I can kind of show that really
quick. Um this is get get data set.
We'll have a download button coming uh
later this week, but you can actually
just take this data, run it locally,
keep it locally, and then actually use
that in your local code to actually try
and iterate on an example. Um and you
know, assuming your security team is
okay with that. But that's a really good
point. Like imagine if you didn't need
access to the production codebase, but
you could still iterate in one platform.
That's really what we're we're pushing
for is like the whole team is iterating
on the prompts and the eval um rather
than in silos, which is what's happening
in a lot of cases. Okay, I think that
was all the questions. Thank you all for
sitting through an hour and a half of AI
PM like eval. Thank you all for for your
time and um I'll be sticking around if
people have more questions, but thank
you so much.
Ask follow-up questions or revisit key timestamps.
Aman, an AI Product Manager at Arise, presented an evaluation framework for Product Managers aimed at shipping effective AI applications. He emphasized that traditional software testing falls short for non-deterministic and hallucinating AI models, highlighting the critical need for robust evaluation (evals). The talk detailed Arise's approach to evals, which involves defining roles, context, goals, and using text-based labels. Through a live demo of an AI trip planner, Aman showcased features like agent visualization, prompt playgrounds for iterative design, data set creation, and A/B testing of prompts. A key takeaway was the importance of evaluating the LLM-as-a-judge system itself against human labels to ensure its reliability. Aman concluded by advocating for Product Managers to embrace a more technical and influential role in the AI era, treating evals as a new form of requirement specification for continuous iteration from development to production.
Videos recently processed by our community