HomeVideos

Build a Real-Time AI Sales Agent - Sarah Chieng & Zhenwei Gao, Cerebras_v2

Now Playing

Build a Real-Time AI Sales Agent - Sarah Chieng & Zhenwei Gao, Cerebras_v2

Transcript

603 segments

0:21

Hi everyone, we're about to start the

0:23

next session. Thank you guys so much for

0:25

coming out today. Um, this is going to

0:28

be a build your own sales agent

0:29

workshop. So, we're going to be walking

0:31

through everything you need to know to

0:33

build your own voice agent. My name is

0:34

Sarah Chang from Cerebras and I am

0:36

excited to be joined by Genway. Um, and

0:38

we are both part of the DevX team at

0:40

Cerebras.

0:41

>> Yeah, thanks Sarah. Um, so today we're

0:44

going to walk through how to build a

0:45

voice sales agent that can actually have

0:48

a natural conversations with customers

0:50

and our sales agents will pull product

0:53

contacts from an external source to

0:54

respond in real time. So, we're going to

0:57

be building an AI agent that can speak,

0:59

listen, and respond intelligently

1:02

um to your company's sales materials.

1:06

And we have the full code for you to

1:08

follow along with. We have a notebook

1:10

that you can scan later um to step ghost

1:13

and we'll walk you through it step by

1:15

step in just a moment.

1:18

So, before we get started, let's go

1:20

through what you will get out of this

1:22

workshop. So you will get free API

1:24

credits for Cerebrris livekit cartisia.

1:26

You will have the quick start. We'll

1:28

have again have a full code notebook for

1:30

you to follow along with and at the end

1:32

you will have your very own sales agent

1:33

that you can hook up to your company's

1:35

materials so that you can you know

1:38

implement this in production.

1:41

So here's the starter code that I would

1:43

recommend scanning just so you can

1:45

follow along. Um, again, this is what

1:47

we'll be walking through step by step

1:49

today. And there will be individual

1:50

modules that you'll be able to just run

1:52

and see some good outfits.

1:56

So, I'll give you a few seconds for

1:57

that.

2:01

We'll have the QR code later as well, so

2:03

not to worry. So, before we get started,

2:05

I wanted to talk a little bit about

2:07

Cerebrus and, you know, Cerebrus

2:09

inferences secret sauce. So, for those

2:11

of you who are unfamiliar, we are a

2:13

hardware company. We are building an AI

2:15

processor that is much larger and much

2:18

faster than what you are probably

2:19

familiar with with Nvidia GPUs. So out

2:23

of curiosity, I'm wondering how many

2:25

people here have heard about Cerebras

2:27

hardware. Not bad. Okay. Higher than

2:31

last year. Okay. Okay. So before we do

2:33

go, I want to share um I want to show

2:37

everyone the speed of what we're talking

2:39

about here. So So

2:42

this is just a chat. It's running on

2:45

Cerebras. You can choose any. So, we can

2:47

host any different model on our

2:48

hardware. So, I'm going to choose an

2:50

example model like a llama model. And

2:53

I'm going to give it a prompt. So, I'm

2:54

going to give it a prompt that it's

2:55

intentionally asking it to respond

2:57

something a little longer. This go

3:02

funny dad jokes, but make each joke a

3:07

couple sentences.

3:10

Sentences.

3:12

And that's how fast it generates. Does

3:14

anyone else have a prompt you want to

3:16

try? A longer prompt.

3:30

>> Amazing. There you go.

3:34

So, really quickly before we get

3:36

started, I know we have a lot of

3:37

software geeks here, but I do want to

3:39

for a second talk about hardware. And I

3:42

want to talk a little bit about what

3:43

hardware innovations

3:45

um make such fast inference possible

3:49

especially as we build a new generation

3:51

of AI products.

3:54

And so we're going to a little bit of a

3:55

hardware segment, but one of the main

3:58

secret sauces for Cerebras is that

4:00

Cerebras chips do not have memory

4:02

bandwidth issues. And I don't know how

4:05

familiar you guys are with, you know,

4:06

GPU architecture, but we're actually

4:09

gonna de deep dive really quickly into

4:12

how GPU architecture works and how it

4:14

compares to what people are doing today.

4:17

And so for context, this is the hardware

4:20

that, you know, all of our inference

4:22

runs on. It's the wafer scale engine 3.

4:24

It is quite literally the size of a

4:26

dinner plate. And this has 4 trillion

4:29

transistors, 900,000 cores, and very

4:31

significant amounts of onchip memory.

4:34

And so this is the comparison of what

4:36

our hardware looks like next to the

4:38

NVIDIA GPU. So you can see some of those

4:40

metrics line up. So significantly more

4:43

transistors.

4:45

But to actually understand what Cerebras

4:48

did with their hardware that is makes it

4:50

able to achieve 20x 30x f 70x faster

4:54

speeds than in inference on Nvidia GPUs.

4:57

We're going to actually start by taking

4:59

a look at the Nvidia GPU. So this is a

5:02

diagram of an H100.

5:05

And if you look at the red rectangle,

5:07

that is a core. And so on the H100

5:10

there's about 17,000 cores and each of

5:13

these cores is the is what is actually

5:16

doing all of the mathematical

5:18

computations needed in training or

5:20

inference or whatever computation you

5:22

need to do. So every core has a subset

5:25

of the computations um that is assigned.

5:29

So when you run inference what are some

5:32

of the types of things that a core will

5:34

need access to to do its computation? it

5:36

needs its weight, activations, KV cache,

5:39

etc. On the H100, all of these values

5:43

are stored offchip. So, they're stored

5:45

in an offchip memory. And so, as you can

5:48

imagine, during inference, each of these

5:50

cores, there's thousands of computations

5:53

happening constantly. And each core is

5:55

needing to constantly load and offload

5:58

the KV cache, activation, weights, etc.

6:01

from an off-memory location. And as you

6:04

can imagine this creates a very

6:06

significant memory channel um memory

6:09

bandwidth bottleneck.

6:11

What Cerebrris has done instead is that

6:14

instead of storing all these values off

6:17

chip every single core on the Cerebrus

6:20

hardware the WSC3 there's 900,000 cores

6:23

which in comparison to 17,000 is already

6:25

a lot larger. Um every single core has

6:29

direct its own direct onchip memory. So

6:32

its own SRAMM. So every single core on

6:35

this wafer has a memory right next to

6:37

it. And what that means is that all of

6:40

the values that every single core needs

6:42

for computations like weights, KB cache,

6:45

etc. is directly accessible and much

6:47

faster to accessible and it's right

6:49

there.

6:50

And so as you the other and so that's a

6:53

little bit that's one example of what

6:54

Cerebrus has done on the hardware side.

6:56

Um, but going back to software, I also

6:58

want to talk about really quickly one

6:59

thing that Cerebrus implements on the

7:01

software side to accelerate inference.

7:04

And so one way that you can accelerate

7:06

inference is through a technique called

7:08

spec um standard decode or speculative

7:11

decoding. So in standard decoding you

7:14

have one model generate every single

7:15

token one at a time. And this is

7:17

sequential, right? You have to wait for

7:18

the previous token to be generated to

7:20

generate the next token.

7:22

So in speculative decoding, you combine

7:26

two models. And what you're doing is you

7:29

use a smaller model that's like a draft

7:32

model that can generate all of the

7:34

tokens very quickly. And then you use

7:36

your larger model to go back and verify

7:39

that the output of the smaller model is

7:41

correct. And by combining these two

7:44

models, you're able to get the speed of

7:46

the smaller model and the accuracy of

7:48

the larger model. And if you think about

7:51

it, your speed is capped by this uh your

7:54

like this the speed um is capped by the

7:59

speed of the larger model. So you will

8:01

up to the large like the speed will be

8:02

up to the larger model um but it will

8:05

never go beyond it. So it will only be

8:06

ever be faster.

8:10

So as a kind of a short recap, hardware,

8:12

memory, bandwidth, we talked through

8:13

that software, specular decoding, but

8:16

that was a little side moment and I want

8:19

to go

8:21

and now back to the workshop. Now that

8:22

you have all the context that you need.

8:25

>> Awesome job.

8:26

>> Yeah, thanks Sarah. Um, for those who

8:29

folks who join in late, you guys can

8:31

scan the QR code to get the starter

8:32

code. We had it in the early slide, but

8:35

um since we'll be teaching you guys how

8:37

to build these sales agents, you can

8:39

follow along with our code. Um yeah, so

8:42

I think in the future, most customer

8:44

interactions will probably be AI

8:46

powered, but you know, instead of just

8:47

typing back and forth with the chatbot,

8:50

what the best way to kind of really have

8:52

these customer interactions is really

8:54

through real conversations, which is why

8:56

voice agents are so powerful.

9:00

So before we dive deep into it, what

9:02

exactly is a voice agent?

9:04

>> Absolutely. Um so voice agents are

9:07

stateful intelligent systems that can

9:09

simultaneously run inference while

9:11

constantly listening to you when you're

9:13

speaking and they can actually engage in

9:15

real and very natural conversations. Um

9:18

I would like to highlight four key uh

9:20

capabilities. First, they understand and

9:23

respond to spoken language. um they

9:26

don't just spit out answers based on

9:28

string matching or keywords but rather

9:30

they can actually understand the meaning

9:31

behind what people are saying. Um this

9:34

also means that they can handle a lot of

9:36

complex tasks. So someone might ask like

9:39

I'm looking for a product recommendation

9:41

and the agent can subsequently kind of

9:43

look into the users's purchase history,

9:45

the shops's current stock levels and

9:48

recommend something that they actually

9:49

like. And you actually might see this

9:51

referred in some places called multi-

9:53

aent or workflows. Um speech is

9:57

obviously the fastest way to communicate

9:58

your intent in any system. We're

10:00

speaking now I guess but you can just

10:03

say what you want. There's like no

10:05

typing, no clicking through menus and no

10:06

learning learning curves. And lastly um

10:09

none of this would be possible unless

10:11

the agent can keep track of the state of

10:13

the conversation. uh which means the

10:15

communications obviously is very highly

10:17

contextual and your agents needs to have

10:19

like state so they can actually hold a

10:21

coherent conversation across time.

10:24

So as you can imagine this makes um

10:27

voice agents perfect. You see a lot of

10:29

startups happening right now especially

10:31

in customer service, sales, tech support

10:33

etc. And so today we're going to be

10:35

focusing on the sales agent use case.

10:39

So, first let's talk about what's

10:40

actually happening inside a voice agent

10:42

when you're having a conversation and

10:44

break it down.

10:46

>> Yeah. So, you guys can see on this

10:48

diagram on the right, once speech is

10:51

detected, the voice data is forwarded to

10:54

ST or that's called speech to text. This

10:57

listens and converts to your your words

10:59

to text in real time. And the last step

11:01

in this process is end of utterance um

11:04

or end of turn detection. um being

11:07

interrupted by AI every time you pause.

11:09

It's like very annoying. So, while VAD

11:12

can help the system know when you are

11:14

and you aren't speaking, it's also very

11:16

important to analyze like what you're

11:17

saying, the context of your speech, and

11:19

to predict like whether you've done

11:21

sharing your thoughts. So, we have

11:23

another small smaller model here that

11:25

runs quickly on the CPU, which will

11:27

instruct the system to wait if it

11:29

predicts you're still speaking. So, once

11:31

your turn is done, the final text

11:33

transcription is forwarded to the next

11:35

layer.

11:38

And then after that phase, we have the

11:41

thinking phase. So your entire question

11:43

is now passed onto the large language

11:45

model. Um, and this is basically, you

11:47

know, the brain like understands what

11:49

you're asking. So it might need to look

11:51

things up, which we'll walk through

11:52

later. Um, like checking in this case,

11:54

if we're doing a sales call, we'll want

11:56

to pull additional context like

11:58

documents, your other like more

12:00

information about your company

12:01

basically.

12:04

>> Yeah. And then the third and the final

12:06

step is the speaking phase. So as LM

12:08

streams response back to the agent, the

12:10

agent will immediately starts forwarding

12:12

these LLM tokens to the TTS engine or

12:14

text to speech. Um this generated um

12:17

audio from TTS streams back to your

12:19

client's application in real time and

12:21

that's why the agent can actually start

12:23

responding when it's still thinking.

12:26

So the final result is that all of these

12:28

components tied together is what's

12:30

making, you know, an AI agent that feels

12:32

very responsive, that feels very

12:33

cohesive and immediate, even though

12:36

there's a lot of complex processing

12:37

happening behind the scenes. So there's

12:39

a lot of moving pieces. In this case,

12:41

we're going to be using LiveKit's agent

12:43

SDK to handle all this orchestration for

12:45

us. Um, it's going to manage the audio

12:48

streams, keep track of the context, and

12:50

coordinate all these different AI

12:52

services that we've just talked about.

12:54

So, now that we have a little bit of

12:56

context, um you can access the starter

12:58

code here. We shared it already. And if

13:00

you want to run the first section right

13:02

here, it'll allow you to install all of

13:04

the necessary packages. So, if you click

13:06

on it, um you'll be able to see some of

13:08

the output of the packages being

13:10

downloaded. And so, this is going to use

13:12

live kit agents with support for

13:14

Cartisia, Cilero for voice activity

13:16

detection, and openAI compatibility.

13:21

And so we've very briefly talked about

13:23

Cerebras. It is 50 times faster than

13:25

GPUs. And

13:29

um I'll skip here. And so as a final

13:31

note, so for this um for this workshop,

13:35

we're actually going to be using Llama

13:36

3.3. And if you see in the chart on the

13:40

bottom right, this is a chart from

13:42

artificial analysis. Artificial

13:43

analysis, if you're unfamiliar, is an

13:45

independent benchmark that benchmarks a

13:48

lot of different models, API providers,

13:50

etc. um on intelligence, speed, latency,

13:53

everything. And so you can see a

13:55

comparison here of Cerebrus on the very

13:57

left in terms of tokens per second and

13:59

any of your other providers like Nvidia.

14:05

Awesome. Um going back to our code, um

14:09

hopefully everyone has had a second to

14:11

kind of install the packages. Um, and

14:13

now let's also in we can also install

14:16

the live CLI. This is optional for our

14:18

work workshop today, but if you want to

14:20

use live kit beyond this, um, here are

14:22

the commands depending on your system.

14:24

Um, in general, we're obviously using

14:26

Python notebook today. So, no one has to

14:28

battle around your environment when

14:31

we're getting started. But again, if you

14:33

want to continuously build and deploy uh

14:35

the voice agent, the CLI probably is the

14:37

easy way easiest way to do it. So just

14:39

uh type in LK app create and you can

14:42

instantly clone a pre-built agent like

14:44

this one.

14:49

Cool. And um let's talk a little bit

14:51

about what exactly LifeKit is and why we

14:54

need it for a voice agent. So the

14:57

existing internet isn't exactly designed

14:59

to build voice agent a uh application.

15:02

So HTTP stands for hypertext transfer

15:06

protocol. So it was designed for

15:07

transferring text over a network and

15:10

obviously for what we're building we

15:11

need to transfer voice data instead of

15:12

just text over a network with low

15:15

latency. Um and kit is a real-time

15:17

infrastructure platform for doing just

15:19

that. So instead of using HTTP actually

15:21

uses a different protocol called web RTC

15:24

to transport voice data between your

15:25

client application AI model with less

15:28

than 100 millisecond of latency anywhere

15:30

in the world which is awesome. It's very

15:32

resilient, handles a lot of concurrent

15:33

sessions and it's fully open source. So

15:35

you can kind of dig into the code and

15:37

you can see how it works or even host

15:39

infrastructure yourself as well.

15:42

Um

15:44

yeah, so you can use live kit to build

15:45

any of type of like voice agents, the

15:47

ones that can join your meetings, the

15:49

ones you're answering phone calls and

15:50

sell centers and call centers and in our

15:53

case today an agent that can speak to

15:54

prospective customers on your website on

15:56

your behalf. And here you can see

15:59

connecting it to the original diagram

16:01

that we showed. So you see like the LLM,

16:03

TTS, ST and all the AI components that

16:06

we talked about earlier. And now you can

16:08

see, you know, how these actual tools

16:09

like Live Kit, Tart, Cartisia, your

16:11

inference provider, all of these things

16:13

are actually playing together to help

16:15

you create a voice agent. And so the

16:17

final component as I mentioned is the

16:19

actual speech processing um which so in

16:22

addition to cerebrus and lifkit and as I

16:24

mentioned we'll be using cartisia to

16:26

turn the voice into text and then at the

16:28

end text back to voice.

16:32

So now that our API keys are set up step

16:35

two is all about teaching our AI sales

16:37

agent about our business. So when you

16:39

train a new employee you have to give it

16:41

information and context on your

16:42

business. And so that's what we're going

16:43

to be doing now.

16:45

>> Yeah. Um, I think the challenge a lot of

16:48

the times with LLMs is that they know a

16:50

lot about everything, but they might not

16:51

know many specific things or domain

16:54

things about your company. Um, and

16:55

they're only really as good as their

16:57

training set. So, if we want to respond

16:58

with any information that isn't common

17:00

public knowledge, we should really try

17:02

and load it into the LLM's context to

17:04

minimize hallucination or any sort of

17:05

canned responses such as, "I can't help

17:07

with that."

17:10

So, in this case, we're just going to be

17:11

feeding the LLM a document with

17:13

additional information. So, for example,

17:15

we can load our pricing details if

17:17

someone asks about pricing. But we can

17:19

also load information like product

17:21

descriptions, pricing info, key um key

17:23

benefits. And another big thing that we

17:26

can do is write pre-written responses to

17:28

common objections. So, for example, if

17:31

it's common that someone says it's too

17:32

expensive, you can write a pre-written

17:34

message so that our agent will always

17:36

stay on message and it has the correct

17:38

context. So, if you look at the

17:40

notebook, you can see what that context

17:42

looks like in practice, right? you don't

17:44

have to just give it access to a full

17:46

document. Um you can see that we've in

17:49

um organized all the information that

17:50

our sales agent needs into a very simple

17:53

structured format for the AI to

17:55

understand and reference.

17:58

So you can see everything that you um a

18:00

good salesperson would need like the

18:02

descriptions and then as we mentioned it

18:04

has these pre-written messages as well

18:06

so that you can control the out um the

18:09

behavior of your voice agent more

18:10

closely.

18:13

Um, now we're off to the more exciting

18:16

part, even more exciting part, step

18:18

three, where we actually create our

18:19

sales agent. So, this is where

18:21

everything that we've just talked about,

18:23

the components, and we're going to wire

18:24

them all together into a working system.

18:27

Um, and before you run anything, let's

18:30

actually walk through what is happening

18:31

in the sales agent class. So, in the

18:34

code, you can see we start by loading

18:35

our contacts by using the load context

18:37

function we defined earlier. And this

18:39

gives us our agent access to all the

18:41

product information, pricing, and

18:44

objection handlers that we set up.

18:50

Oh, sorry.

18:54

So, and finally, I want to look at how

18:56

we're implementing everything in code in

18:58

terms of creating the actual sales

18:59

agent. So the there's way more of the

19:02

code in the notebook, but as a high

19:05

level um you want to start there's kind

19:07

of four components. So you want to start

19:09

by you know telling your sales agent

19:10

your voice agent communicating um your

19:13

sales agent commun communicating by

19:15

voice um and give it proper rules like

19:18

you know don't use bullet points because

19:20

everything is spoken aloud. So you want

19:21

to do um a bit of prompting and then

19:23

most importantly only use information

19:26

from the context that you provided. So

19:28

you want to make be very careful

19:29

especially with voice agents that you

19:31

are not allowing um that you're reducing

19:32

the risk of hallucinations as much as

19:34

possible. And then the super call is

19:36

what's initializing our agent and passes

19:38

all of our configurations to the parent

19:40

agent. And this is setting up our agent

19:42

with the LMC TTS VA and all the

19:46

instructions working together. And then

19:48

the last thing that we're going to do is

19:49

we're also going to define an onenter

19:50

method which is what's going to start

19:52

the actual conversation. So, as soon as

19:54

someone joins the conversation with the

19:56

agent, instead of sitting in silence, it

19:59

immediately um or this is triggered as

20:01

soon as someone joins the conversation.

20:02

So, instead of ever sitting in silence,

20:05

you're going to immediately generate

20:06

that grading um and the good salesperson

20:08

will help.

20:11

Yeah. And then we're off to our step

20:14

four. We're actually launching a

20:15

sequence and running the agent. Um,

20:18

think of this entire kind of uh entry

20:21

point function as a start button to our

20:23

agent. And when someone wants to have a

20:25

conversation, obviously it kicks off

20:26

every in the gear and gets the agent

20:28

ready to talk. So this entry point

20:31

function is doing three main things. So

20:33

it's connecting the agent to a virtual

20:35

room where the conversation will happen.

20:37

So it's like dialing into a conference

20:39

call. Um, then it's going to create an

20:41

instance of our sales agent with the

20:42

setup that we just configured. And so

20:44

finally, it's going to start a session

20:47

that manages the back and forth

20:48

conversations. And so that is it for the

20:51

basis or like I guess the main framework

20:53

for how you would set up a sales agent.

20:55

But to make this project a little more

20:58

robust, we're actually going to talk

20:59

about one a few ways that you can expand

21:01

your sales agent. So

21:04

here's one example.

21:07

Yeah. So one thing you can do um to

21:10

expand our single agent into a multi-

21:12

aent system is um to just you know if

21:17

someone calls asking really deep

21:20

technical questions about API

21:21

integrations you really want them

21:23

talking to your best technical person

21:24

and not just your spicing pricing

21:26

specialist. Um again all limbs have

21:29

limited context windows which means that

21:31

similar to people they have limits on

21:33

the amount of things that they can

21:34

actually specialize. Um and here are the

21:36

three other agents in addition to that

21:38

single agent that um the the starter co

21:41

has just helped you guys run. Um three

21:43

of the different agents that we propose

21:45

in this case are um greeting agents um

21:48

our main sales agent who qual qualifies

21:50

leads. We have a technical specialist

21:52

agent as you can see on the left um who

21:54

are obviously specialized in sol solving

21:56

technical issues is the intent and then

21:59

finally we have the pricing specialist

22:01

agent on the right which handles budget

22:04

ROI and also deal negotiations. So the

22:07

main thing that you want to think about

22:08

here is you know on a real sales team

22:10

you want or any like multi- aent system

22:12

you want all of your agents to be able

22:13

to do very different things. And so one

22:15

of the key things in this um

22:17

implementation is that we have a um is

22:21

that we have a handoff. So our greeting

22:23

agent is what figuring out what the

22:25

customer actually needs and then being

22:26

able to route to the um to the relevant

22:29

sub agent.

22:32

And the code for all of these different

22:33

agents is fully fleshed out in the

22:35

notebook as well. And then the last

22:37

thing of course is you can is adding

22:38

tool calling. So for example when

22:40

someone a customer asks about technical

22:42

details you know we can properly route

22:45

and then this is also implemented as

22:47

well in the code notebook

22:50

and that is it. So thank you guys so

22:53

much for coming. Um all again all of the

22:56

notebook with all the instructions and

22:58

the step by step is in the notebook that

22:59

we're provided and have built. Um and

23:01

we'll be up here to answer any questions

23:02

that you guys might have. Thank you

23:04

guys.

Interactive Summary

In this workshop, Sarah Chang and Genway from Cerebras guide participants through building an AI-powered voice sales agent. The session covers the importance of low-latency voice interactions, how Cerebras hardware achieves high performance through on-chip memory to solve bandwidth bottlenecks, and the use of technologies like LiveKit, Cartisia, and Llama 3.3. The instructors explain the architecture of a voice agent—including speech-to-text, text-to-speech, and LLM processing—and demonstrate how to provide domain-specific context to reduce hallucinations and implement multi-agent workflows for better customer interaction.

Suggested questions

4 ready-made prompts