Build a Real-Time AI Sales Agent - Sarah Chieng & Zhenwei Gao, Cerebras

Watch on YouTube

Now Playing

Transcript

607 segments

0:13

[music]

0:21

Hi everyone, we're about to start the

0:23

next session. Thank you guys so much for

0:25

coming out today. Um, this is going to

0:28

be a build your own sales agent

0:29

workshop. So, we're going to be walking

0:31

through everything you need to know to

0:33

build your own voice agent. My name is

0:34

Sarah Chang from Cerebras and I am

0:36

excited to be joined by Genway. Um, and

0:38

we are both part of the DevX team at

0:40

Cerebras.

0:41

>> Yeah, thanks Sarah. Um, so today we're

0:44

going to walk through how to build a

0:45

voice sales agent that can actually have

0:48

a natural conversations with customers

0:50

and our sales agents will pull product

0:53

contacts from an external source to

0:54

respond in real time. So, we're going to

0:57

be building an AI agent that can speak,

0:59

listen, and respond intelligently

1:02

um to your company's sales materials.

1:06

And we have the full code for you to

1:08

follow along with. We have a notebook

1:10

that you can scan later um to step ghost

1:13

and we'll walk you through it step by

1:15

step in just a moment.

1:18

So, before we get started, let's go

1:20

through what you will get out of this

1:22

workshop. So you will get free API

1:24

credits for Cerebrris livekit cartisia.

1:26

You will have the quick start. We'll

1:28

have again have a full code notebook for

1:30

you to follow along with and at the end

1:32

you will have your very own sales agent

1:33

that you can hook up to your company's

1:35

materials so that you can you know

1:38

implement this in production.

1:41

So here's the starter code that I would

1:43

recommend scanning just so you can

1:45

follow along. Um, again, this is what

1:47

we'll be walking through step by step

1:49

today. And there will be individual

1:50

modules that you'll be able to just run

1:52

and see some good outfits.

1:56

So, I'll give you a few seconds for

1:57

that.

2:01

We'll have the QR code later as well, so

2:03

not to worry. So, before we get started,

2:05

I wanted to talk a little bit about

2:07

Cerebrus and, you know, Cerebrus

2:09

inferences secret sauce. So, for those

2:11

of you who are unfamiliar, we are a

2:13

hardware company. We are building an AI

2:15

processor that is much larger and much

2:18

faster than what you are probably

2:19

familiar with with Nvidia GPUs. So out

2:23

of curiosity, I'm wondering how many

2:25

people here have heard about Cerebras

2:27

hardware. Not bad. Okay. Higher than

2:31

last year. Okay. Okay. So before we do

2:33

go, I want to share um I want to show

2:37

everyone [clears throat] the speed of

2:39

what we're talking about here. So So

2:42

this is just a chat. It's running on

2:45

Cerebras. You can choose any. So, we can

2:47

host any different model on our

2:48

hardware. So, I'm going to choose an

2:50

example model like a llama model. And

2:53

I'm [snorts] going to give it a prompt.

2:54

So, I'm going to give it a prompt that

2:55

it's intentionally asking it to respond

2:57

something a little longer. This go

3:00

[clears throat]

3:02

funny dad jokes, but make each joke a

3:07

couple sentences.

3:10

Sentences.

3:12

And that's how fast it generates. Does

3:14

anyone else have a prompt you want to

3:16

try? A longer prompt.

3:30

>> Amazing. There you go.

3:34

So, really quickly before we get

3:36

started, I know we have a lot of

3:37

software geeks here, but I do want to

3:39

for a second talk about hardware. And I

3:42

want to talk a little bit about what

3:43

hardware innovations

3:45

um make such fast inference possible

3:49

especially as we build a new generation

3:51

of AI products.

3:54

And so we're going to a little bit of a

3:55

hardware segment, but one of the main

3:58

secret sauces for Cerebras is that

4:00

Cerebras chips do not have memory

4:02

bandwidth issues. And I don't know how

4:05

familiar you guys are with, you know,

4:06

GPU architecture, but we're actually

4:09

gonna de deep dive really quickly into

4:12

how GPU architecture works and how it

4:14

compares to what people are doing today.

4:17

And so for context, this is the hardware

4:20

that, you know, all of our inference

4:22

runs on. It's the wafer scale engine 3.

4:24

It is quite literally the size of a

4:26

dinner plate. And this has 4 trillion

4:29

transistors, 900,000 cores, and very

4:31

significant amounts of onchip memory.

4:34

And so this is the comparison of what

4:36

our hardware looks like next to the

4:38

NVIDIA GPU. So you can see some of those

4:40

metrics line up. So significantly more

4:43

transistors.

4:45

But to actually understand what Cerebras

4:48

did with their hardware that is makes it

4:50

able to achieve 20x 30x f 70x faster

4:54

speeds than in inference on Nvidia GPUs.

4:57

We're going to actually start by taking

4:59

a look at the Nvidia GPU. So this is a

5:02

diagram of an H100.

5:05

And if you look at the red rectangle,

5:07

that is a core. And so on the H100

5:10

there's about 17,000 cores and each of

5:13

these cores is the is what is actually

5:16

doing all of the mathematical

5:18

computations needed in training or

5:20

inference or whatever computation you

5:22

need to do. So every core has a subset

5:25

of the computations um that is assigned.

5:29

So when you run inference what are some

5:32

of the types of things that a core will

5:34

need access to to do its computation? it

5:36

needs its weight, activations, KV cache,

5:39

etc. On the H100, all of these values

5:43

are stored offchip. So, they're stored

5:45

in an offchip memory. And so, as you can

5:48

imagine, during inference, each of these

5:50

cores, there's thousands of computations

5:53

happening constantly. And each core is

5:55

needing to constantly load and offload

5:58

the KV cache, activation, weights, etc.

6:01

from an off-memory location. And as you

6:04

can imagine this creates a very

6:06

significant memory channel um memory

6:09

bandwidth bottleneck.

6:11

What Cerebrris has done instead is that

6:14

instead of storing all these values off

6:17

chip every single core on the Cerebrus

6:20

hardware the WSC3 there's 900,000 cores

6:23

which in comparison to 17,000 is already

6:25

a lot larger. Um every single core has

6:29

direct its own direct onchip memory. So

6:32

its own SRAMM. So every single core on

6:35

this wafer has a memory right next to

6:37

it. And what that means is that all of

6:40

the values that every single core needs

6:42

for computations like weights, KB cache,

6:45

etc. is directly accessible and much

6:47

faster to accessible and it's right

6:49

there.

6:50

And so as you the other and so that's a

6:53

little bit that's one example of what

6:54

Cerebrus has done on the hardware side.

6:56

Um, but going back to software, I also

6:58

want to talk about really quickly one

6:59

thing that Cerebrus implements on the

7:01

software side to accelerate inference.

7:04

And so one way that you can accelerate

7:06

inference is through a technique called

7:08

spec um standard decode or speculative

7:11

decoding. So in standard decoding you

7:14

have one model generate every single

7:15

token one at a time. And this is

7:17

sequential, right? You have to wait for

7:18

the previous token to be generated to

7:20

generate the next token.

7:22

So in speculative decoding, you combine

7:26

two models. And what you're doing is you

7:29

use a smaller model that's like a draft

7:32

model that can generate all of the

7:34

tokens very quickly. And then you use

7:36

your larger model to go back and verify

7:39

that the output of the smaller model is

7:41

correct. And by combining these two

7:44

models, you're able to get the speed of

7:46

the smaller model and the accuracy of

7:48

the larger model. And if you think about

7:51

it, your speed is capped by this uh your

7:54

like this the speed um is capped by the

7:59

speed of the larger model. So you will

8:01

up to the large like the speed will be

8:02

up to the larger model um but it will

8:05

never go beyond it. So it will only be

8:06

ever be faster.

8:10

So as a kind of a short recap, hardware,

8:12

memory, bandwidth, we talked through

8:13

that software, specular decoding, but

8:16

that was a little side moment and I want

8:19

to go

8:21

and now back to the workshop. Now that

8:22

you have all the context that you need.

8:25

>> Awesome job.

8:26

>> Yeah, thanks Sarah. Um, for those who

8:29

folks who join in late, you guys can

8:31

scan the QR code to get the starter

8:32

code. We had it in the early slide, but

8:35

um since we'll be teaching you guys how

8:37

to build these sales agents, you can

8:39

follow along with our code. Um yeah, so

8:42

I think in the future, most customer

8:44

interactions will probably be AI

8:46

powered, but you know, instead of just

8:47

typing back and forth with the chatbot,

8:50

what the best way to kind of really have

8:52

these customer interactions is really

8:54

through real conversations, which is why

8:56

voice agents are so powerful.

9:00

So before we dive deep into it, what

9:02

exactly is a voice agent?

9:04

>> Absolutely. Um so voice agents are

9:07

stateful intelligent systems that can

9:09

simultaneously run inference while

9:11

constantly listening to you when you're

9:13

speaking and they can actually engage in

9:15

real and very natural conversations. Um

9:18

I would like to highlight four key uh

9:20

capabilities. First, they understand and

9:23

respond to spoken language. um they

9:26

don't just spit out answers based on

9:28

string matching or keywords but rather

9:30

they can actually understand the meaning

9:31

behind what people are saying. Um this

9:34

also means that they can handle a lot of

9:36

complex tasks. So someone might ask like

9:39

I'm looking for a product recommendation

9:41

and the agent can subsequently kind of

9:43

look into the users's purchase history,

9:45

the shops's current stock levels and

9:48

recommend something that they actually

9:49

like. And you actually might see this

9:51

referred in some places called multi-

9:53

aent or workflows. Um speech is

9:57

obviously the fastest way to communicate

9:58

your intent in any system. We're

10:00

speaking now I guess [laughter] but you

10:02

can just say what you want. There's like

10:04

no typing, no clicking through menus and

10:06

no learning learning curves. And lastly

10:09

um none of this would be possible unless

10:11

the agent can keep track of the state of

10:13

the conversation. uh which means the

10:15

communications obviously is very highly

10:17

contextual and your agents needs to have

10:19

like state so they can actually hold a

10:21

coherent conversation across time.

10:24

So as you can imagine this makes um

10:27

voice agents perfect. You see a lot of

10:29

startups happening right now especially

10:31

in customer service, sales, tech support

10:33

etc. And so today we're going to be

10:35

focusing on the sales agent use case.

10:39

So, first let's talk about what's

10:40

actually happening inside a voice agent

10:42

when you're having a conversation and

10:44

break it down.

10:46

>> Yeah. So, you guys can see on this

10:48

diagram on the right, once speech is

10:51

detected, the voice data is forwarded to

10:54

ST or that's called speech to text. This

10:57

listens and converts to your your words

10:59

to text in real time. And the last step

11:01

in this process is end of utterance um

11:04

or end of turn detection. um being

11:07

interrupted by AI every time you pause.

11:09

It's like very annoying. So, while VAD

11:12

can help the system know when you are

11:14

and you aren't speaking, it's also very

11:16

important to analyze like what you're

11:17

saying, the context of your speech, and

11:19

to predict like whether you've done

11:21

sharing your thoughts. So, we have

11:23

another small smaller model here that

11:25

runs quickly on the CPU, which will

11:27

instruct the system to wait if it

11:29

predicts you're still speaking. So, once

11:31

your turn is done, the final text

11:33

transcription is forwarded to the next

11:35

layer.

11:38

And then after that phase, we have the

11:41

thinking phase. So your entire question

11:43

is now passed onto the large language

11:45

model. Um, and this is basically, you

11:47

know, the brain like understands what

11:49

you're asking. So it might need to look

11:51

things up, which we'll walk through

11:52

later. Um, like checking in this case,

11:54

if we're doing a sales call, we'll want

11:56

to pull additional context like

11:58

documents, your other like more

12:00

information about your company

12:01

basically.

12:04

>> Yeah. And then the third and the final

12:06

step is the speaking phase. So as LM

12:08

streams response back to the agent, the

12:10

agent will immediately starts forwarding

12:12

these LLM tokens to the TTS engine or

12:14

text to speech. Um this generated um

12:17

audio from TTS streams back to your

12:19

client's application in real time and

12:21

that's why the agent can actually start

12:23

responding when it's still thinking.

12:26

So the final result is that all of these

12:28

components tied together is what's

12:30

making, you know, an AI agent that feels

12:32

very responsive, that feels very

12:33

cohesive and immediate, even though

12:36

there's a lot of complex processing

12:37

happening behind the scenes. So there's

12:39

a lot of moving pieces. In this case,

12:41

we're going to be using LiveKit's agent

12:43

SDK to handle all this orchestration for

12:45

us. Um, it's going to manage the audio

12:48

streams, keep track of the context, and

12:50

coordinate all these different AI

12:52

services that we've just talked about.

12:54

So, now that we have a little bit of

12:56

context, um you can access the starter

12:58

code here. We shared it already. And if

13:00

you want to run the first section right

13:02

here, it'll allow you to install all of

13:04

the necessary packages. So, if you click

13:06

on it, um you'll be able to see some of

13:08

the output of the packages being

13:10

downloaded. And so, this is going to use

13:12

live kit agents with support for

13:14

Cartisia, Cilero for voice activity

13:16

detection, and openAI compatibility.

13:21

And so we've very briefly talked about

13:23

Cerebras. It is 50 times faster than

13:25

GPUs. And

13:29

um I'll skip here. And so as a final

13:31

note, so for this um for this workshop,

13:35

we're actually going to be using Llama

13:36

3.3. And if you see in the chart on the

13:40

bottom right, this is a chart from

13:42

artificial analysis. Artificial

13:43

analysis, if you're unfamiliar, is an

13:45

independent benchmark that benchmarks a

13:48

lot of different models, API providers,

13:50

etc. um on intelligence, speed, latency,

13:53

everything. And so you can see a

13:55

comparison here of Cerebrus on the very

13:57

left in terms of tokens per second and

13:59

any of your other providers like Nvidia.

14:05

Awesome. Um going back to our code, um

14:09

hopefully everyone has had a second to

14:11

kind of install the packages. Um, and

14:13

now let's also in we can also install

14:16

the live CLI. This is optional for our

14:18

work workshop today, but if you want to

14:20

use live kit beyond this, um, here are

14:22

the commands depending on your system.

14:24

Um, in general, we're obviously using

14:26

Python notebook today. So, no one has to

14:28

battle around your environment when

14:31

we're getting started. But again, if you

14:33

want to continuously build and deploy uh

14:35

the voice agent, the CLI probably is the

14:37

easy way easiest way to do it. So just

14:39

uh type in LK app create and you can

14:42

instantly clone a pre-built agent like

14:44

this one.

14:49

Cool. And um let's talk a little bit

14:51

about what exactly LifeKit is and why we

14:54

need it for a voice agent. So the

14:57

existing internet isn't exactly designed

14:59

to build voice agent a uh application.

15:02

So HTTP stands for hypertext transfer

15:06

protocol. So it was designed for

15:07

transferring text over a network and

15:10

obviously for what we're building we

15:11

need to transfer voice data instead of

15:12

just text over a network with low

15:15

latency. Um and kit is a real-time

15:17

infrastructure platform for doing just

15:19

that. So instead of using HTTP actually

15:21

uses a different protocol called web RTC

15:24

to transport voice data between your

15:25

client application AI model with less

15:28

than 100 millisecond of latency anywhere

15:30

in the world which is awesome. It's very

15:32

resilient, handles a lot of concurrent

15:33

sessions and it's fully open source. So

15:35

you can kind of dig into the code and

15:37

you can see how it works or even host

15:39

infrastructure yourself as well.

15:42

15:44

yeah, so you can use live kit to build

15:45

any of type of like voice agents, the

15:47

ones that can join your meetings, the

15:49

ones you're answering phone calls and

15:50

sell centers and call centers and in our

15:53

case today an agent that can speak to

15:54

prospective customers on your website on

15:56

your behalf. And here you can see

15:59

connecting it to the original diagram

16:01

that we showed. So you see like the LLM,

16:03

TTS, ST and all the AI components that

16:06

we talked about earlier. And now you can

16:08

see, you know, how these actual tools

16:09

like Live Kit, Tart, Cartisia, your

16:11

inference provider, all of these things

16:13

are actually playing together to help

16:15

you create a voice agent. And so the

16:17

final component as I mentioned is the

16:19

actual speech processing um which so in

16:22

addition to cerebrus and lifkit and as I

16:24

mentioned we'll be using cartisia to

16:26

turn the voice into text and then at the

16:28

end text back to voice.

16:32

So now that our API keys are set up step

16:35

two is all about teaching our AI sales

16:37

agent about our business. So when you

16:39

train a new employee you have to give it

16:41

information and context on your

16:42

business. And so that's what we're going

16:43

to be doing now.

16:45

>> Yeah. Um, I think the challenge a lot of

16:48

the times with LLMs is that they know a

16:50

lot about everything, but they might not

16:51

know many specific things or domain

16:54

things about your company. Um, and

16:55

they're only really as good as their

16:57

training set. So, if we want to respond

16:58

with any information that isn't common

17:00

public knowledge, we should really try

17:02

and load it into the LLM's context to

17:04

minimize hallucination or any sort of

17:05

canned responses such as, "I can't help

17:07

with that."

17:10

So, in this case, we're just going to be

17:11

feeding the LLM a document with

17:13

additional information. So, for example,

17:15

we can load our pricing details if

17:17

someone asks about pricing. But we can

17:19

also load information like product

17:21

descriptions, pricing info, key um key

17:23

benefits. And another big thing that we

17:26

can do is write pre-written responses to

17:28

common objections. So, for example, if

17:31

it's common that someone says it's too

17:32

expensive, you can write a pre-written

17:34

message so that our agent will always

17:36

stay on message and it has the correct

17:38

context. So, if you look at the

17:40

notebook, you can see what that context

17:42

looks like in practice, right? you don't

17:44

have to just give it access to a full

17:46

document. Um you can see that we've in

17:49

um organized all the information that

17:50

our sales agent needs into a very simple

17:53

structured format for the AI to

17:55

understand and reference.

17:58

So you can see everything that you um a

18:00

good salesperson would need like the

18:02

descriptions and then as we mentioned it

18:04

has these pre-written messages as well

18:06

so that you can control the out um the

18:09

behavior of your voice agent more

18:10

closely.

18:13

Um, now we're off to the more exciting

18:16

part, even more exciting part, step

18:18

three, where we actually create our

18:19

sales agent. So, this is where

18:21

everything that we've just talked about,

18:23

the components, and we're going to wire

18:24

them all together into a working system.

18:27

Um, and before you run anything, let's

18:30

actually walk through what is happening

18:31

in the sales agent class. So, in the

18:34

code, you can see we start by loading

18:35

our contacts by using the load context

18:37

function we defined earlier. And this

18:39

gives us our agent access to all the

18:41

product information, pricing, and

18:44

objection handlers that we set up.

18:50

Oh, sorry.

18:54

So, and finally, I want to look at how

18:56

we're implementing everything in code in

18:58

terms of creating the actual sales

18:59

agent. So the there's way more of the

19:02

code in the notebook, but as a high

19:05

level um you want to start there's kind

19:07

of four components. So you want to start

19:09

by you know telling your sales agent

19:10

your voice agent communicating um your

19:13

sales agent commun communicating by

19:15

voice um and give it proper rules like

19:18

you know don't use bullet points because

19:20

everything is spoken aloud. So you want

19:21

to do um a bit of prompting and then

19:23

most importantly only use information

19:26

from the context that you provided. So

19:28

you want to make be very careful

19:29

especially with voice agents that you

19:31

are not allowing um that you're reducing

19:32

the risk of hallucinations as much as

19:34

possible. And then the super call is

19:36

what's initializing our agent and passes

19:38

all of our configurations to the parent

19:40

agent. And this is setting up our agent

19:42

with the LMC TTS VA and all the

19:46

instructions working together. And then

19:48

the last thing that we're going to do is

19:49

we're also going to define an onenter

19:50

method which is what's going to start

19:52

the actual conversation. So, as soon as

19:54

someone joins the conversation with the

19:56

agent, instead of sitting in silence, it

19:59

immediately um or this is triggered as

20:01

soon as someone joins the conversation.

20:02

So, instead of ever sitting in silence,

20:05

you're going to immediately generate

20:06

that grading um and the good salesperson

20:08

will help.

20:11

Yeah. And then we're off to our step

20:14

four. We're actually launching a

20:15

sequence and running the agent. Um,

20:18

think of this entire kind of uh entry

20:21

point function as a start button to our

20:23

agent. And when someone wants to have a

20:25

conversation, obviously it kicks off

20:26

every in the gear and gets the agent

20:28

ready to talk. So this entry point

20:31

function is doing three main things. So

20:33

it's connecting the agent to a virtual

20:35

room where the conversation will happen.

20:37

So it's like dialing into a conference

20:39

call. Um, then it's going to create an

20:41

instance of our sales agent with the

20:42

setup that we just configured. And so

20:44

finally, it's going to start a session

20:47

that manages the back and forth

20:48

conversations. And so that is it for the

20:51

basis or like I guess the main framework

20:53

for how you would set up a sales agent.

20:55

But to make this project a little more

20:58

robust, we're actually going to talk

20:59

about one a few ways that you can expand

21:01

your sales agent. So

21:04

here's one example.

21:07

Yeah. So one thing you can do um to

21:10

expand our single agent into a multi-

21:12

aent system is um to just you know if

21:17

someone calls asking really deep

21:20

technical questions about API

21:21

integrations you really want them

21:23

talking to your best technical person

21:24

and not just your spicing pricing

21:26

specialist. Um again all limbs have

21:29

limited context windows which means that

21:31

similar to people they have limits on

21:33

the amount of things that they can

21:34

actually specialize. Um and here are the

21:36

three other agents in addition to that

21:38

single agent that um the the starter co

21:41

has just helped you guys run. Um three

21:43

of the different agents that we propose

21:45

in this case are um greeting agents um

21:48

our main sales agent who qual qualifies

21:50

leads. We have a technical specialist

21:52

agent as you can see on the left um who

21:54

are obviously specialized in sol solving

21:56

technical issues is the intent and then

21:59

finally we have the pricing specialist

22:01

agent on the right which handles budget

22:04

ROI and also deal negotiations. So the

22:07

main thing that you want to think about

22:08

here is you know on a real sales team

22:10

you want or any like multi- aent system

22:12

you want all of your agents to be able

22:13

to do very different things. And so one

22:15

of the key things in this um

22:17

implementation is that we have a um is

22:21

that we have a handoff. So our greeting

22:23

agent is what figuring out what the

22:25

customer actually needs and then being

22:26

able to route to the um to the relevant

22:29

sub agent.

22:32

And the code for all of these different

22:33

agents is fully fleshed out in the

22:35

notebook as well. And then the last

22:37

thing of course is you can is adding

22:38

tool calling. So for example when

22:40

someone a customer asks about technical

22:42

details you know we can properly route

22:45

and then this is also implemented as

22:47

well in the code notebook

22:50

and that is it. So thank you guys so

22:53

much for coming. Um all again all of the

22:56

notebook with all the instructions and

22:58

the step by step is in the notebook that

22:59

we're provided and have built. Um and

23:01

we'll be up here to answer any questions

23:02

that you guys might have. Thank you

23:04

guys.

23:05

[applause]

23:07

>> [music]

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This workshop focuses on building a voice sales agent capable of natural conversations with customers by pulling product context from external sources. The session introduces Cerebras's hardware, the Wafer-Scale Engine 3, which offers significant speed advantages over Nvidia GPUs due to its on-chip memory architecture, eliminating memory bandwidth bottlenecks. It also covers speculative decoding for inference acceleration. Participants learn about the core capabilities of voice agents, including understanding spoken language, handling complex tasks, using speech for communication, and maintaining conversational state. The workshop breaks down a voice agent's operation into listening (speech-to-text), thinking (LLM processing with external context), and speaking (text-to-speech). Tools like LiveKit's Agent SDK, Cartisia, and Cilero are used to orchestrate these components. A crucial step involves teaching the AI sales agent about specific business information to minimize hallucinations and provide accurate responses, and the workshop concludes by showing how to create and launch the agent, including expanding it into a multi-agent system with specialized roles and tool-calling capabilities.

Recently Distilled

Videos recently processed by our community