HomeVideos

Making an XDS control plane from scratch - Part 1 - Naive Python version

Now Playing

Making an XDS control plane from scratch - Part 1 - Naive Python version

Transcript

1287 segments

0:01

Hello everyone. This video is

0:04

a sort of tutorial in which I go over

0:07

how to implement a control plane from

0:09

scratch. I think this will be part one

0:12

of a series.

0:14

It will be uh this part will be quite a

0:17

basic implementation

0:20

and it's not something that I would

0:21

deploy to production. It's just

0:23

something that should help people gain

0:27

an understanding if they follow along

0:29

and do it themselves

0:32

so that you can understand more about

0:33

envoy proxy and uh and how a control

0:36

plane interacts with it. The video has

0:39

been sliced up into chapters so you can

0:42

skip to certain parts. Parts of the

0:45

video have been sped up uh otherwise it

0:48

would be roughly 2 hours. As it stands

0:51

now, the video is about one hour, and

0:53

you could probably speed it up as well,

0:56

depending on what pace you like.

0:59

I'm not sure how many of these types of

1:02

videos I'll do. It's not really

1:04

something that I planned to do. Um, I'm

1:06

more also planned to share knowledge,

1:08

not necessarily do tutorials,

1:11

but I might just, uh, do part one, two,

1:14

and three of this, see how it goes. And

1:18

uh if it's uh received positively, then

1:22

I'll uh possibly continue doing stuff

1:25

like this. So this video is not really

1:28

system design. It is I know a lot of you

1:31

been have been requesting stuff related

1:33

to system design. I might have to think

1:35

about how exactly I'm going to do that.

1:38

Um, I have some ideas, but this is just

1:40

essentially like writing code by hand,

1:43

uh, and walking through like a worked

1:46

example of a control plane.

1:48

I do mention at some point that you

1:50

could probably get this task done with

1:51

an LLM fairly quickly, but you have to

1:54

know what you want before you can

1:56

actually ask the LLM to create it for

1:58

you.

2:00

Uh, so just keep that in mind.

2:03

Anyway,

2:05

uh on to the video and uh hope you

2:07

enjoy.

2:10

All right,

2:12

the video that I have been talking about

2:16

is here.

2:18

We're making a control plane from

2:20

scratch in Python. The first thing we're

2:23

doing is creating a project using the UV

2:28

tooling to add dependencies such as fast

2:32

API and uvorn. These are that's a web

2:37

framework and a web server to run the

2:40

web framework.

2:42

Just opening a file.

2:45

Going to put down some boiler plate to

2:48

create a fast API server here.

2:51

And the first endpoint that we're going

2:53

to put is just going to be something

2:56

that sort of pretends to be a control

2:59

plane. It's just going to answer on a

3:02

particular route with uh anything that

3:06

comes in with a post method

3:11

and it's going to respond to that with

3:14

uh just an empty dictionary for the

3:16

moment.

3:18

So we'll create that and it's going to

3:20

be called clusters

3:23

because

3:24

we're just going to pretend to support

3:29

provisioning uh or or generating cluster

3:32

configuration for envoy

3:34

and then we'll later expand that to be

3:37

bit more generic.

3:45

So here we're just going to print out

3:46

the JSON that we receive from the

3:48

envirroxy just so we have something to

3:51

look at.

3:54

And then we're just going to give back

3:56

that dictionary that's empty for the

3:59

moment.

4:05

Now

4:08

it's good to stop here and just to um

4:10

try to get an actual envoy proxy

4:12

running. The reason for that is we want

4:15

to test that even the most basic bare

4:19

minimum

4:20

thing uh still works.

4:24

So we can we'll be able to see it hit

4:26

that clusters discovery endpoint and

4:30

it'll get back nothing. But that's kind

4:32

of what we want to see. So right now I'm

4:34

looking through GitHub

4:36

looking for some examples in the envoy

4:39

repository for uh envoy configuration.

4:42

This is called the bootstrap.

4:45

And in there we've got some stuff like

4:47

the static resources, dynamic resources,

4:51

the node, which is information about the

4:53

proxy, and the admin interface at the

4:56

bottom there. Just going to go through

4:58

and rename some of these

5:02

uh cluster details because we're going

5:05

to be accessing

5:07

we're going to be hitting

5:10

the server running on our local

5:11

computer. But if you were say in a

5:15

production environment, you'd probably

5:17

want this to be using DNS to find the

5:22

control plane.

5:24

One thing that's going to be different

5:25

about this control plane that we're

5:27

building is that it's not going to use

5:28

gRPC,

5:31

at least not this version. I wanted to

5:33

do this one with just REST and JSON just

5:35

to show you

5:37

sort of the real

5:41

basic version so that you can get an

5:43

idea of how stuff works and then later

5:45

on you can you can use GIC and I will

5:50

probably do a second part to this where

5:53

I use a Golang library to which includes

5:57

pro uh compiled protocol buffers to then

6:01

construct all of the configuration that

6:03

comes through. So right now

6:07

I've got the envoy documentation on the

6:09

right hand side and I'm basically just

6:12

using it to figure out what fields are

6:16

available in each part of the

6:17

configuration. So we've got this dynamic

6:20

resources which you can specify a CDS

6:24

config for. That's cluster discovery

6:26

service and that is a config source or

6:31

it or it accepts a config source. So in

6:33

here we need to type in

6:36

where is the config source coming from.

6:38

It could be a file on the disk or it

6:42

could be uh from an API. And we're going

6:46

to go with an API here.

7:02

So, we're just going to type that out.

7:04

And we're going to make it a

7:07

REST API.

7:10

So, Enboy knows not to use gRPC.

7:15

So, we're going to tell it that the type

7:16

is REST.

7:18

And then we're going to spend uh we're

7:20

going to specify the cluster name which

7:22

is just this static resource below

7:25

called control plane which is pointing

7:27

at the local host IP address and uh on

7:32

port 8050.

7:36

And so now that we have that, that

7:39

should be the bare minimum we need just

7:41

to run mboy

7:44

and to run the control plane and to just

7:47

get it doing some really basic stuff. So

7:51

I'm going to try to use docker to

7:54

start this envoy container

7:58

and then I'm going to run envoy with

8:00

that config file. And we can see it's

8:03

booted up and it gave us a message that

8:06

says that discovery clusters failed.

8:10

That's because we aren't currently

8:11

running the control plane. So we'll do

8:13

that now. Now it's listening on port

8:15

8050. We see in the envoy logs it says

8:19

that it added and updated a cluster. And

8:22

we also see the request come through

8:24

which was printed out by our very basic

8:27

minimal uh control plane

8:31

at the moment.

8:33

I'm going to set up a little make file

8:34

just to make it a little bit less

8:36

tedious to run the envoy container.

8:47

And then uh

8:51

now we can maybe look through these

8:54

logs.

8:56

You'll notice that the discovery request

8:58

is really long. Most of these are

9:02

extensions and it's the proxy telling us

9:05

what extensions it supports. That's kind

9:07

of irrelevant, but this highlighted

9:09

information here is a little bit more

9:12

relevant. It tells us the ID, the

9:14

cluster, and a few other pieces of

9:15

information.

9:19

Um, you don't have to pause and look at

9:21

that. It's just kind of uh

9:25

something you'll get familiar with over

9:28

time. Maybe you have lots of different

9:29

proxies and

9:32

they will uh have different IDs and and

9:37

clusters and what and whatnot.

9:40

And maybe you'll program your control

9:42

plane to

9:45

treat them differently somehow. or give

9:46

give different config to different

9:48

proxies.

9:52

So now we're going to put the response

9:54

of the control plane into a shape that's

9:57

more what Envoy expects and this is a

10:00

discovery response

10:02

and it's going to have a version uh info

10:05

which is just a

10:07

a version that envoy is going to take

10:10

and then a list of the resources in the

10:12

response. So if we look on the right

10:15

hand side um yeah I'm just checking that

10:19

the fields that I'm using are correct

10:22

it's got the type of fields there as

10:24

well

10:31

and then I think the next thing we want

10:32

to do is to

10:35

actually hand back some real resources

10:37

so we'll actually give this proxy

10:40

a cluster

10:43

at least one.

10:46

And uh in order to do that, we're

10:48

probably going to have to look at the

10:51

config again. Uh sorry, the

10:52

documentation again. We'll look at the

10:55

cluster and see what kind of fields are

10:57

available there.

10:59

Um I know

11:02

some of these off by heart. For example,

11:03

I know to set this the type to static

11:06

because the endpoints that I'm about to

11:09

write are just going to contain a

11:14

uh a static IP address. It's not going

11:17

to going to be a DNS name. This config

11:20

is pretty deeply nested and tedious. So,

11:23

it's a good idea to write abstractions

11:26

to produce these. For example, you could

11:30

you could just feed in like a set of

11:32

address and ports and maybe generate

11:35

those end endpoints there.

11:38

The endpoints can also contain uh

11:41

different metadata uh and locality

11:44

information. Locality means like the the

11:47

zone or the region that it resides in.

11:49

And then you can use that information to

11:51

do some uh pretty clever routing of

11:55

traffic.

11:57

the cluster when we're operating in JSON

12:00

mode also needs a type URL so that envoy

12:04

knows how to des serialize that JSON

12:06

into uh one of the protobuff um objects

12:10

that it sort of knows about.

12:15

And then

12:17

we're actually going to do

12:20

we're going to add a little endpoint

12:22

also a a a get endpoint to this control

12:26

plane. It's not really something you

12:28

would need in a control plane, but we're

12:30

just going to make it pretend like it is

12:32

a backend.

12:34

So that when envoy proxies,

12:38

basically we're giving Envoy something

12:39

to proxy to just to show that it is

12:43

accepting requests, sending them to a

12:45

backend, that backend is responding, and

12:48

then that response is being uh given

12:50

back to the client. And in this case, we

12:53

it's maybe a bit of a weird setup where

12:55

we've got the control plane acting as a

12:59

control plane and a back end, but just

13:01

bear with me on that. And if that's

13:03

confusing, just uh let me know and I'll

13:06

help you out.

13:10

So now we're going to run them both and

13:12

we're going to check out

13:15

uh we're going to look to see

13:18

whether Envoy has accepted that cluster.

13:20

And it looks like it has. It says

13:25

add update.

13:27

Oh, it actually says add update zero

13:29

clusters.

13:45

Oh, no. Actually, it says uh we've added

13:47

one one cluster there.

13:51

Um, one thing to note is that it is

13:53

saying that it added and removed a

13:55

cluster every single time it hits the

13:57

control plane. We're going to fix that

13:59

much later when we introduce versioning.

14:02

So now

14:05

it has that cluster. We're going to look

14:08

at the admin interface

14:10

to see what it has there. And we can see

14:13

this example that we've um given. It has

14:16

this little metric that says that it's

14:18

added via the API. And we've got a

14:21

little bit of other information about

14:22

how many requests it's received and

14:24

connections and all this other kind of

14:26

stuff.

14:29

We can also look at the config dump in

14:32

the admin interface.

14:34

And that's going to show us uh that

14:37

we've got some statically uh configured

14:41

clusters, but we also have a dynamically

14:43

configured cluster. So that's telling us

14:45

that the control plane is indeed giving

14:48

that resource to this proxy.

14:55

The next thing we're going to want to do

14:57

is to probably make this setup a little

15:01

bit more generic. So rather than having

15:07

well it depends on how you want to do

15:09

things but the way I'm going to do them

15:10

is I'm going to make it so that when

15:12

Envoy makes a request for a discovery

15:17

whatever the resource type is we're

15:18

going to take that and we're going to

15:20

check what it is before we decide what

15:22

to do.

15:25

So, we've basically copied this clusters

15:27

endpoint and made it resources, which is

15:31

I I suppose more generic. And then we're

15:33

going to check if it if the resource

15:35

type is clusters, then we're just going

15:37

to do what we were doing before, which

15:38

is handing back those statically defined

15:42

clusters.

15:46

And otherwise, if it's uh something like

15:49

listeners, we're going to do something

15:51

slightly different. We'll hand back

15:52

different resources.

15:55

This code is pretty naive. Uh, but we're

16:00

just doing it this way so that we can

16:02

get something up quick.

16:06

And just to demonstrate

16:10

uh the interaction,

16:12

you can you can optimize this however

16:14

you like. Um,

16:19

the way I've done this in Sovereign, the

16:21

open source Reaper that I've talked

16:23

about in the past is um

16:31

the the control plane fetches lots of

16:35

different data sources

16:38

and then it aggregates them into one big

16:40

thing

16:41

and then it gives that data to a set of

16:44

templates. templates and then

16:48

whatever that uh template generates it's

16:50

tied one to one to a resource type and

16:54

that is how that's uh made to be generic

16:59

whereas here we're just using I guess a

17:01

bunch of if statements to do that which

17:05

is okay when you're just

17:08

researching and experimenting how this

17:10

stuff works which is what we're doing

17:14

so in a Second here, we're going to need

17:16

to add a listener. The reason why we

17:17

need to add a listener is because

17:29

um we need we need envoy to accept

17:31

traffic. It's fine for it to have

17:34

clusters. That means it that it has

17:38

things it can send traffic to, but it

17:40

has nothing on the receiving end. So, we

17:42

need to tell it what port and address to

17:45

listen for traffic on and then how to

17:48

handle the traffic that lands on that uh

17:52

address and port. So, the one that we're

17:55

going to build here is going to be

17:58

something listening on

18:01

either local host or every address. and

18:06

it's going to contain

18:09

what's called a filter chain uh in

18:12

envoy. That means

18:15

it's basically a a a container for

18:21

hand like specifying a bunch of filters

18:24

to handle traffic. It's called a filter

18:27

chain because there's a chain of filters

18:29

inside it.

18:31

um that might not be the actual reason

18:33

but that's the way you can think about

18:35

it and some uh examples of filters would

18:39

be like handling TLS and then handling

18:41

HTTP

18:43

handling um compression rate limiting

18:48

uh authentication and routing. Let's

18:51

just say that's an example chain of

18:53

filters.

18:54

But our filter chain is going to be

18:56

really super basic. We're going to have

18:59

one filter chain. We're going to have

19:01

one filter in there. And it's just going

19:03

to be a HTTP connection manager.

19:08

And

19:10

that HTTP connection manager is also

19:12

going to contain HTT HTTP filters. And

19:16

there will only be one which will be the

19:20

router filter.

19:22

Another

19:24

thing that the HTTP connection manager

19:27

or HCM

19:29

will have is either

19:33

an inline definition of of routes or

19:36

route configuration which contains

19:38

virtual hosts which themselves contain

19:42

routes

19:43

or it can also contain a reference to

19:49

call the control plane.

19:52

for the routes. And that might I might

19:55

demonstrate that a little later later in

19:58

this video. And the reason why you would

20:01

want to externalize the routes from the

20:04

listener is

20:06

uh maybe complicated, but I'll go ahead

20:09

and try to explain it. It's essentially

20:11

the reason why you'd do that is

20:16

because if the control plane is serving

20:18

the listeners and routes differently,

20:21

what that means is that it can just

20:24

deliver the routes to the proxy. And

20:27

when Envoy only has to re reload the

20:29

route table, that's quite efficient

20:32

compared to reloading a listener. The

20:35

reason for that is because when envoy

20:37

receives a new listener,

20:40

it has to put the old listener into

20:43

draining mode or well actually at first

20:46

it creates a new listener and starts

20:48

accepting requests on that on the on the

20:51

new one and then it puts the the old one

20:53

into draining mode but it keeps it

20:55

around while those requests are

20:57

draining. So you generally don't want to

20:59

do that often. it can be slightly

21:02

disruptive

21:04

whereas just reloading routes uh is is

21:08

quite a bit nicer.

21:11

So I'm I'm currently writing the the

21:14

virtual host and the route config out um

21:17

inline and I'll ex externalize that a

21:19

bit later. Just going to continue doing

21:22

that.

21:23

And then um after we've got this route

21:26

configuration set up, what we're going

21:29

to do is um just check the admin

21:32

interface again to see whether Envoy has

21:38

uh received these resources and accepted

21:41

them.

21:45

Right now I'm um I've I've got one

21:47

virtual host there called example VH and

21:50

it has uh uh an asterisk in the domains.

21:54

That means that when Envoy receives any

21:57

HTTP traffic,

22:00

it's going to match against every host

22:02

header.

22:04

Um so that virtual host will unless

22:07

there's one that's more specific than it

22:10

then the traffic will land on this

22:11

virtual host. We're only going to have

22:13

one anyway. So we just put a asterisk

22:16

there. And we're also setting up some

22:18

routes where we can

22:22

uh we can name the route. We can add uh

22:25

match criteria. You can use reax, you

22:28

can use prefixes, whatever.

22:32

And then you've got a a decision of how

22:34

to handle the traffic. You could do a

22:36

redirect. You could do a direct

22:39

response.

22:40

In this circumstance, we're just going

22:43

to route to the cluster. And the

22:45

cluster's name is example. And that's

22:47

the cluster that we were returning from

22:50

uh from the control plane a little bit

22:53

earlier.

22:57

So now we'll run the proxy again

23:02

and the control plane as well.

23:05

And we're just going to take a little

23:06

look at the admin interface, see what's

23:09

in there.

23:12

And while we're at it, we'll probably

23:14

just send a request over to Envoy just

23:16

to confirm that this listener and route

23:18

config works and that it sends the

23:21

traffic to the cluster.

23:24

We can see that it

23:27

requested the clusters, but I forgot to

23:29

put in the LDS config here. So, it never

23:34

requested the listeners. So, I'll just

23:36

fix that up real quick. We'll run the

23:38

proxy again

23:40

and we should see it request both

23:42

clusters and listeners and it and so it

23:46

does.

23:49

And now it's going to keep requesting

23:52

those because it run it it pulls them on

23:54

a schedule. But we're just going to take

23:57

a little look see um

24:01

do the config dump again. We can see the

24:03

virtual host is in there.

24:06

It's showing up as a static route

24:08

config, but the listener shows up as a

24:10

dynamic listener.

24:14

We can also take a look at the listener.

24:16

We can confirm that it's listening on

24:17

port 8080.

24:21

So, now is probably a good time to hit

24:23

port 8080 and see what we get back.

24:28

And

24:30

it's kind of unusual. We're running on

24:32

local host, but it's taking a while. I

24:35

didn't expect it to take a while.

24:40

It's kind of weird. It's kind of hanging

24:41

there, which is not what we expect.

24:46

Let's look at the config again. Maybe

24:49

we're missing something in the API.

24:53

It's definitely listening on port 8080.

24:55

I just checked the listening ports

24:57

there. definitely doesn't have to do

24:59

anything with DNS, but it still seems to

25:03

just be kind of hanging there. It's kind

25:06

of interesting. Maybe there's something

25:07

wrong with the address it's listening

25:08

on.

25:10

We'll just restart the control plane and

25:13

have I got this new listener, but still

25:15

it's not quite working. So

25:19

maybe I missed this field, this connect

25:21

timeout field. I'm not really too sure.

25:24

Let's try to add a new route there.

25:28

Maybe we add a static response. Maybe

25:30

there's something wrong with how it's

25:31

proxying. I'm not not exactly sure

25:33

what's going on here. Just check the

25:36

docs. Make sure I'm putting the right

25:38

config there. Restarted the control

25:41

plan. We can see Envo's got the config.

25:44

Are we running the Docker image in the

25:47

right way? I think we are.

25:52

Yeah, looks looks okay.

25:55

But it's still still kind of hanging

25:57

when the request comes in. That's kind

25:58

of odd. But we can we can reach the

26:02

admin interface. That's kind of strange.

26:06

Maybe it's the port number. Maybe my

26:08

operating system is doing something

26:09

weird with the port number. But port

26:10

8082 also has the same thing. Maybe

26:14

let's

26:16

let's see. Uh let's do a little packet

26:18

trace. Let's just confirm. Is are the

26:22

requests actually making it to that

26:24

machine? And we can see maybe you have

26:28

to squint a little bit, but it looks

26:29

like the traffic is going back and

26:32

forth.

26:33

So, I'm quite perplexed here.

26:36

What could be wrong?

26:39

Has anyone guessed it yet?

26:43

Oh, we'll look over the config again. Is

26:44

there anything we've forgotten here? Oh,

26:47

of course.

26:49

There's no HTTP filter for the router.

26:53

So, we're going to look through the docs

26:54

to figure out what we need to add there.

27:02

We need uh one of these typed

27:04

configurations

27:06

for router.

27:09

And it's a pretty easy filter to set up.

27:11

It has literally no config. We can just

27:15

copy paste that type URL there.

27:19

and uh place that there and then

27:24

stuff should work.

27:28

Just cross our fingers

27:32

and hope that it actually works this

27:34

time. So yeah, pretty easy. Just paste

27:38

that there.

27:40

When you're working with protobuffs, you

27:43

don't have to specify the type URLs, but

27:46

like I said, we're we're doing things

27:51

maybe the hard way. Also going to add

27:53

another make target just to make it

27:56

easier to run the control plane. I don't

27:58

know if it's actually going to make it

27:59

easier.

28:01

I still have to type it out, but

28:04

we've got our control plane running

28:05

again. Going to start up envoy.

28:08

It's going to get all its resources as

28:12

it was before. But this time

28:15

we should be able to see

28:20

we should be able to send a request to

28:22

the proxy and it will then

28:26

hit the uh get endpoint on the control

28:30

plane server that we've built. And it

28:33

should say

28:35

oh well let's just test the static

28:39

response. And we actually get a response

28:40

back.

28:43

But that's coming directly from Envoy.

28:45

That's not uh from

28:48

the server. But that one there, the

28:50

greeting is from our server. We can see

28:54

we're making a request to port 8082,

28:57

which is envoy. It is then sending that

29:00

to the example cluster on port 8085,

29:04

which is the control plane that's

29:06

running.

29:08

So that's pretty neat.

29:15

I suppose to uh recap, we've got our

29:19

resources API endpoint.

29:23

We have our uh greeting

29:27

endpoint.

29:29

In a normal situation, this would be on

29:31

two separate servers, but we've just

29:32

bundled them into the one just to make

29:35

it easier to run these.

29:40

I suppose if you wanted to make

29:41

something a slightly more complicated

29:43

with fake backends and whatnot, then you

29:45

might use something like Docker Compose

29:48

and just start up a bunch of different

29:50

containers. And um then you can also use

29:53

DNS in that situation because the Docker

29:56

network will provide um DNS for the

29:59

different container names. So that might

30:02

be

30:04

nice. But yeah, we've got this um

30:08

cluster that we're creating.

30:11

We created a listener. We created the

30:12

route configuration

30:17

and we're providing those back using

30:20

this if statement sort of setup that we

30:22

have that checks the different resource

30:25

types. We can definitely make that

30:26

better. Uh just use your imagination

30:30

really.

30:32

The next thing we probably want to do

30:37

is to

30:40

use a control plane for what it's really

30:42

meant to be used for. What we've set up

30:45

here is is kind of useless because

30:51

you might as well just provide the

30:52

config directly to envoy from a file. So

30:57

what we're going to do instead is we're

30:58

going to create this uh file uh on the

31:01

disk and we're just going to pretend

31:03

like that file is is an external API.

31:07

You can just imagine this fetch external

31:09

data function does something that hits

31:13

something remote from the control plane

31:16

where perhaps it's got data that's

31:18

changing all the time

31:20

from some other system.

31:23

And we're going to pretend that that's

31:24

uh dynamic data.

31:28

And then every time Envoy requests its

31:31

resources, we're going to fetch that

31:33

data. And depending on what's in there,

31:36

we're going to give back different

31:38

configuration.

31:39

That is the true power of the control

31:42

plane. And I I probably wouldn't create

31:45

one um unless you planned to do

31:49

something like that.

31:52

So, I'm just going to sort of wing this

31:54

uh this dynamic data. We're just going

31:57

to come up with some random

32:00

configuration schema

32:04

where we're going to have a list of

32:05

objects and we're going to specify the

32:09

the type of the objects. Maybe we've

32:12

we're going to configure the listeners

32:13

dynamically. We're going to configure

32:15

some clusters and routes.

32:20

So, we'll just um

32:23

go ahead and set that up. Maybe have a

32:26

think about what exact structure we

32:29

want, what's going to be useful.

32:34

It's something that probably needs a

32:36

good amount of thought. You don't want

32:38

to get too bogged down, but

32:41

you don't want to have a schema that is

32:43

difficult to change. That can be a

32:47

significant pain in the future.

32:50

So I'm deciding here I'm going to have a

32:52

listener type. I'm also going to have a

32:54

service type that doesn't really tie to

32:57

any envoy resource. We're actually going

33:00

to pretend this service type is two

33:03

resources. It's going to abstract over

33:05

two.

33:06

One of which is going to be clusters and

33:08

the other which is going to be let's say

33:10

a virtual host.

33:13

So, we're going to put in information

33:15

that might be needed in both those

33:17

resources in the one place.

33:20

And then when the proxy requests this

33:23

information, we're going to churn out uh

33:26

configuration or we're going to somehow

33:28

churn out configuration

33:32

um to then serve those two distinct

33:34

resources

33:36

but from a single sort of um data source

33:41

there.

33:43

And of course you in a in a real

33:45

scenario you might have a bunch of

33:46

services in a multi-tenant scenario

33:52

you could separate them by any real uh

33:55

boundary like maybe you have ingress,

33:58

maybe you have egress, maybe you have

34:00

something else.

34:03

Um,

34:05

you know, you could you could use this

34:07

to

34:09

apply

34:11

all kinds of logic like

34:14

IP restrictions or security or

34:17

role-based access control or

34:20

really

34:22

it's it's totally up to your imagination

34:24

how you use this stuff.

34:26

And that's kind of

34:30

I suppose one of the benefits to to

34:33

making your own control plane versus

34:34

getting something off the shelf.

34:37

Uh I personally don't know too much

34:39

about the ones that are off the shelf

34:40

what they provide. I know ISTTO is one,

34:43

but I'm I'm not sure I don't know much

34:45

about it.

34:47

But anyway, um now that we have this

34:49

sort of external data um shape,

34:53

we want to change the code to actually

34:56

use it.

34:59

So, we're going to have to go through

35:02

and edit all these usages where it's got

35:05

hardcoded stuff, and instead it's going

35:08

to

35:10

be based off

35:12

the

35:14

uh JSON file that we've created. So,

35:16

we're going to probably create some uh

35:20

turn some of our existing code into

35:22

functions, and those functions will

35:24

accept some of the data that comes in

35:26

from that JSON file.

35:29

You can imagine in a production

35:31

scenario, you might have quite rich API

35:35

data that's fed into your control plane.

35:38

And so, you have to be a little bit

35:40

smart about how how you do this. You

35:42

don't want to, you know, if you've got a

35:45

thousand entries in your in your JSON

35:48

file, then you don't want your control

35:50

plane to be generating this stuff every

35:54

single time. So,

35:56

you want to probably have versioning,

36:01

some kind of a mechanism to determine

36:04

the version

36:06

and uh and some kind of use that to do

36:10

some sort of caching. I would strongly

36:13

suggest that

36:15

in the uh go control plane that's made

36:18

by Lyft,

36:19

they uh provide a cache implementation

36:23

that would probably suit most people. Um

36:26

I'll probably go into that in the second

36:28

part of this series.

36:33

So we've got some functions now where

36:35

we're going to take in these services,

36:37

this abstraction that we've created

36:41

and uh we're going to we're going to

36:44

generate the the route configuration and

36:46

the clusters from the service.

36:49

And then separately for any object that

36:52

is a listener type, we're going to

36:53

create some listeners out of that.

37:00

And this is just the tedious part of

37:02

making all of that work.

37:05

Could you do this with an LLM?

37:08

Yeah, you probably could.

37:12

But if you say had never done this

37:15

before,

37:16

it might be quite difficult for you to

37:18

prompt the exact thing that you want to

37:21

have happen. And that's kind of why I'm

37:24

making this video.

37:27

Because if you follow along

37:30

and then you try to customize this and

37:32

make it your own,

37:35

then you'll likely develop a really

37:38

great understanding of this stuff. And

37:41

then you you'll develop your own taste

37:43

and you'll you'll say stuff like, "Well,

37:47

I don't want this or I don't want that."

37:50

And that that's that's essentially what

37:52

helps you to prompt an LLM.

38:02

So right now we are uh

38:05

fetching that data on every request that

38:10

we receive. That's probably something

38:12

that you would want to optimize away.

38:15

Maybe you want to add some concurrency

38:17

or

38:19

uh worker threads to your control plane.

38:22

Maybe you want to have some kind of a

38:24

channel that um sends this data

38:28

to the other threads every time it's

38:30

updated so they have a fresh copy

38:33

somehow.

38:35

It's really up to you.

38:38

And we're just going to compute all

38:41

these resources and probably the most

38:43

inefficient way possible.

38:47

Make sure all the um names of the

38:50

resources. Another thing there is

38:55

um

38:58

your listener resource could

39:01

accidentally

39:04

tell the proxy to find the wrong

39:07

resource. Uh you could have a listener

39:10

that contains a reference to a route

39:12

configuration, but the names are wrong.

39:14

So you have to make sure in your code

39:17

that these names are aligned somehow.

39:21

Um

39:24

anyway, that was that was a small uh

39:27

little bit of a troubleshooting

39:30

that we have to do here. I think the

39:32

JSON was in the wrong format.

39:38

Some maybe some extra commas, maybe some

39:40

missing commas.

39:43

Just clean that up.

39:53

And uh now

39:58

I think that's looking good.

40:01

However,

40:03

we now have a little missing field

40:06

there. So, we've got to go and add

40:10

um the refresh delay to the

40:14

uh RDS that we've added in here.

40:17

Just going to set that to 5 seconds.

40:21

And that should be that.

40:27

We'll run the control plane again.

40:30

And now the proxy should com not

40:33

complain anymore.

40:35

Says that uh clusters were updated.

40:38

However,

40:40

it's now saying that there is a missing

40:42

type URL.

40:48

Uh it says missing type in any it's only

40:53

allowed blah blah blah.

40:56

So yeah, the route config doesn't have a

40:58

type URL and it it's uh it needs one. So

41:02

we're just going to go grab that from

41:04

the

41:05

documentation here. If I can find it.

41:13

There it is.

41:16

So, we'll add that in. We've got to type

41:18

out the

41:20

type.googleapis.com/envoy,

41:23

which I've memorized.

41:26

And now we run it again. And I think

41:30

envoy should be happy with that.

41:33

Oh, but it's not.

41:36

It says it it's got an unexpected

41:38

character zero. Expected a double quote.

41:43

I think what's happened here is that

41:45

I've specified the route names

41:48

as integers.

41:53

So, I probably need to fix that up. They

41:56

actually need to be a string.

42:00

Uh I've used the index

42:03

and that's why it's an integer.

42:07

But I might just uh do a an F string,

42:09

which is a format string with that index

42:13

just to make sure it's definitely a

42:15

string.

42:21

And now we've got them both running and

42:24

MVoice seems to be pretty happy with

42:27

that.

42:33

And we can see that it's added the

42:35

listener, it's added the clusters, it's

42:38

requested the routes. That's a good

42:39

sign.

42:42

So now I think uh we should be able to

42:46

make the same request that we did

42:47

before. And it's still going to work.

42:53

H.

42:59

Ah, it didn't work because we've

43:02

actually added in this domains here. So

43:04

the virtual virtual host has

43:08

a domain now

43:10

and it only accepts traffic on that

43:12

domain. So we have to specify that as a

43:14

host header. And there we go. We

43:16

actually get the response back this

43:19

time. So this allows us to demonstrate

43:24

the dynamic configuration. We can now

43:27

change this JSON file and the next time

43:30

the envoy proxy requests the resource

43:35

we should see

43:37

that

43:39

updated and the result of that should be

43:41

now that uh

43:44

when I make the same request I don't get

43:46

a response back because it has the wrong

43:49

domain name now but if we update it to

43:51

baz.bar.tld T

43:54

we can now see we're getting a response.

43:58

So it has loaded the new version of that

44:01

JSON file and the proxy has updated its

44:05

configuration.

44:07

So you can you can kind of get an idea

44:08

of how this is is quite useful um

44:13

especially when you're using a HTTP uh

44:17

endpoint instead of a JSON file.

44:27

So

44:28

you might question, well,

44:31

how would that be much different from

44:34

say just having a cron job that updates

44:37

the envoy configuration on a schedule?

44:40

The reason why it's different is because

44:42

your control plane can act as an

44:44

aggregator where it hits many different

44:47

sources of information

44:50

and computes configuration by pulling

44:53

them all together. You could arguably do

44:56

that on a Chrome job.

44:58

You could publish uh

45:02

a configuration somewhere and have that

45:05

picked up by a Chrome job. It's really

45:08

about the same. I think you'd still want

45:11

to

45:13

have

45:17

uh

45:19

use protocol buffers to sort of validate

45:21

that you're actually generating valid

45:23

configuration.

45:26

And you might also be able to observe

45:28

certain things from the control plane

45:30

like

45:32

whether uh the you can you can look at

45:35

the the the

45:37

proxies versions and you can try to

45:39

figure out like is there too much

45:42

variance in a particular cluster of

45:44

proxies? Do they all have the right

45:46

config? Is it up to date? You can't

45:49

really do that as much with a cron job,

45:53

although you could get some pretty

45:55

creative. So, the next step here,

46:00

we want Envoy to keep polling, but we

46:03

don't want it to think that the config

46:05

is changing all the time. So, we're

46:07

going to introduce versioning to this.

46:14

And the easiest, most simple, basic way

46:17

I can think to add versioning is

46:20

probably just to increment a number

46:22

every time the data changes. And we know

46:24

the data changes because

46:27

it will

46:29

sort of a string comparison or a d a

46:32

dictionary comparison

46:35

should tell us that it's not quite the

46:38

same as the the old set of data. So

46:41

that's what we're going to do here.

46:43

We're actually going to create a loop

46:44

and we're going to run that loop in a

46:46

thread

46:48

and we're going to modify some global

46:50

variables that hold the data and the

46:53

version.

46:55

Um,

46:57

this is not how you would want to do it

47:00

in in production. Obviously, it's hard

47:04

to know whether that thread is still

47:06

running or if anything's gone wrong. And

47:09

uh you can't really run this uh in a

47:11

distributed nature. you'd have to just

47:12

have one machine and that won't scale.

47:17

But just for demonstration, we're going

47:18

to have this thing that is running in a

47:21

loop running in a thread and it's going

47:24

to mutate some global memory

47:28

inside the Python uh running process

47:32

and the other um the other parts of that

47:36

program will be able to access that

47:39

memory as it's changed. So that's just

47:42

fine for this circumstance.

47:49

So we're going to replace uh this uh

47:52

fetch and we're instead just going to

47:55

refer to the global data there.

48:11

Um,

48:13

and once we've got this set up, then we

48:16

are

48:18

going to be able to

48:21

change the data.

48:24

Um, we'll have

48:29

um

48:32

we'll have that running in the in the

48:33

background and it's going to

48:36

uh fetch that new data. We'll print

48:38

something out probably just to show that

48:40

it has changed. And that sort of

48:43

decouples this.

48:46

It doesn't decouple all the computation,

48:48

but it it at least decouples the

48:50

fetching part. And it would be a smart

48:53

idea

48:55

in a production scenario to to decouple

48:59

both. You want to fetch the data,

49:01

compute the resources

49:04

as much as you can, and then provide

49:06

them when the discovery request comes

49:08

in.

49:10

At least in the polling architecture.

49:14

If you're doing something like delta

49:16

requests, then you'll have something

49:18

quite different

49:20

where you have to track the state

49:23

and the subscriptions on resources. And

49:26

it's much more complicated.

49:31

uh slightly

49:34

um more difficult to set up.

49:41

So, we've got this um this version

49:43

number. We're going to increment that.

49:47

And what this allows us to do is when

49:49

the envoy proxy requests configuration,

49:52

it tells us what version it currently

49:54

has. And if that version is not

49:57

different from the version that the

49:59

control plane has, then there is no

50:02

reason to give it back a 200 response

50:06

because when you give it a 200 response,

50:09

it thinks that those resources has

50:10

changed and it might do things like

50:13

drain clusters or drain listeners or

50:16

reload route tables and it's kind of

50:19

just wasted.

50:21

So, now that we've got that set up,

50:23

we're going to go ahead and change the

50:25

uh resources uh the the the data file

50:28

that we have.

50:32

And uh we should see that the control

50:36

plane is picking up that new data. Yep,

50:38

we see that there version two.

50:41

Now, when we run the proxy,

50:46

it's going to request the resources.

50:48

Then it's going to poll again

50:51

and it's going to get a different

50:53

response hopefully. Uh, not quite yet.

50:58

We are not comparing the

51:01

versions yet. So, we'll probably have to

51:04

go fix that.

51:06

So, what we want to do is we want to get

51:08

the version number from the proxy and

51:10

then do a comparison. And if the version

51:13

numbers are the same,

51:16

then the resources have not changed,

51:19

which means we give it back a different

51:21

response,

51:22

which essentially tells it to do

51:24

nothing.

51:27

So, we're going to grab that out of the

51:30

request body, which is

51:33

uh in JSON, which the proxy sends to us.

51:47

And then we can sort of

51:51

we can sort of uh neaten up this code a

51:55

little bit. Might do that in a second.

52:01

Uh we want to basically

52:05

we don't need to specify the same thing.

52:07

Okay, there's a bit of duplication going

52:08

on here. So, we want to return the

52:11

version info and the resources at the

52:13

end.

52:15

So, we'll just figure out what the

52:16

resources are in those if statements

52:22

and then we'll do a little check of the

52:25

version.

52:27

Just a simple comparison.

52:29

And if it is the same, we're going to

52:32

actually give back a 304

52:35

response code.

52:37

with no body. And it's going to tell

52:39

Mboy that nothing's changed.

52:53

So, we can also clean up this if

52:56

statement while we're here. Yeah, we can

52:58

probably just chuck these

53:01

resources into a mapping.

53:06

Uh, it's it's still pretty janky, but

53:10

we're just we're just doing an MVP here.

53:13

In a real situation, maybe you get this

53:15

data from a database or or something

53:20

else. you probably want to cache them

53:22

every time the data changes and then

53:25

pull them out when uh when you need

53:29

them. But in the meantime, we're just

53:31

going to make a hashmap, throw those uh

53:35

those objects in there, and then

53:38

uh we're going to try to get the

53:43

the resource type out of that hashmap.

53:45

And if it's not in there, then

53:49

we will hand back a 400 or something

53:51

like that.

53:58

So, yep, we have a key error. It means

54:02

whatever resource type was requested is

54:04

not in that map. So, we'll just get back

54:06

400 there.

54:16

And then we're going to try to run this

54:18

again.

54:21

We should see it request the resources.

54:24

And then

54:29

um

54:30

and then on the second time, oh, okay,

54:32

there's something wrong here.

54:36

The version info wasn't sent in the

54:39

discovery request. That's kind of

54:41

unusual.

54:45

It shouldn't be missing. I think it's

54:48

required.

54:52

But anyway, we can probably just handle

54:54

that in the Python code by

54:58

uh if it's none, then we'll set it to

55:01

something

55:04

like a sentinel value.

55:12

And that should probably get us going.

55:17

So, we should see request the resources.

55:20

Then it's going to request them again.

55:22

But because the version hasn't changed,

55:24

we'll see a different status code here.

55:29

And we see there is a 304.

55:34

And it's going to continue to be a 304

55:36

until the data changes, which is great.

55:38

It means that

55:40

when the proxy is reaching out to us,

55:43

we're not transferring stuff

55:45

unnecessarily. We're not telling it that

55:47

it has new resources. It's not reloading

55:50

stuff. It's not logging a bunch of stuff

55:53

for no reason. So that's all uh pretty

55:57

good positives.

55:59

So now when we change this data, we

56:02

should see those status codes

56:05

change to 200s. And we see we got new

56:08

data detected and employees now fetched

56:12

the new resources. It's logged that it

56:14

fetched and updated some stuff, but then

56:18

it just goes straight back to

56:21

uh stuff hasn't changed,

56:25

which is

56:27

neat.

56:35

So, up until this point, you probably

56:37

thought you may maybe you've been

56:39

thinking,

56:41

well, at least I would be thinking that

56:44

I'm not really comfortable

56:46

messing about with a lot of dictionaries

56:48

and

56:50

uh the lack of typing

56:53

really puts me off. How do I know I'm

56:56

not going to give something to Envoy

56:58

that is going to be

57:01

the wrong configuration? I'd have to

57:04

have extensive tests, which would be

57:06

good either way,

57:09

but I want to be able to catch it in my

57:11

editor if I can. That would be ideal.

57:15

When I specify a particular field for a

57:19

route configuration, I want to know that

57:21

I've specified the correct field. I can

57:25

look at the documentation,

57:27

but uh there's certain things that maybe

57:31

they're not as clear

57:34

um as to what exactly you can place in

57:36

those

57:38

uh in those objects.

57:42

So

57:44

because Envoy is built around protocol

57:47

buffers for its configuration

57:51

um a lot of that stuff is pretty easy

57:55

because the if you compile the protocol

57:57

buffers in your language it gives you

58:01

hopefully uh something that you can type

58:04

check and validate depending on the

58:07

library that you use to compile the

58:09

protocol buffers.

58:11

I personally have made my own library

58:14

called envoy data plane and that uses

58:16

better proto 2 as a dependency. So we're

58:21

just going to add that to the project

58:23

and then we can start the

58:26

somewhat tedious process

58:30

of replacing all of these. I've sped up

58:33

this footage because otherwise it would

58:35

take forever.

58:37

But I'm just going to go through

58:40

and um

58:43

replace all of these Python dictionaries

58:45

that are untyped and slightly dangerous

58:50

with

58:52

uh these compiled protoraph objects.

58:55

And along the way, my language server in

58:59

Neoim is going to tell me when I've put

59:03

something wrong,

59:05

and it's going to tell me what type it

59:06

needs to be. And then I can type out

59:08

that that the name of that type and then

59:12

I can automatically import it,

59:15

which is all pretty standard stuff these

59:18

days,

59:21

but it's uh it's the kind of thing that

59:24

it makes it much easier to write invoke

59:27

configuration. even if it's dynamic

59:29

stuff that's sort of an abstraction,

59:32

you want to have

59:34

um maybe some parts of the configuration

59:37

abstracted away, but the rest of it you

59:40

you've got to type it out. So, you've

59:42

got to make sure that that's

59:44

valid and checked and and all that kind

59:48

of good stuff.

59:50

So, I'm going through here, replacing

59:52

the listener,

59:56

making sure everything's fine.

60:00

And uh now when we rerun this all, we

60:04

can see it's doing same as before. And

60:08

when we change the

60:11

data,

60:13

it should be changing,

60:16

but it is not. That's kind of weird.

60:30

Puzzling.

60:31

Let's maybe try again.

60:35

See if we can

60:38

change something else. Okay, it picked

60:40

up that change, but maybe not the route

60:44

change that I made. That's kind of odd.

60:53

Let's try hitting the proxy again.

60:57

Put the correct host name. Okay.

61:02

Huh. That's weird.

61:04

Well, it seems to be proxying. It's

61:06

hitting the uh server, giving back a

61:08

response.

61:11

So, now we have these typed objects. uh

61:14

allows us to write this configuration

61:16

way easier. Makes sure that everything's

61:19

valid.

61:21

Um

61:23

well, to an extent, it might not catch

61:26

everything, but that's that's okay.

61:30

So, from this point,

61:33

I guess I can talk about what I'm going

61:35

to do next, what's going to be part two.

61:37

I sort of alluded to it

61:40

uh at different stages, but essentially

61:45

part two will be me doing pretty much

61:48

the same as this, but it's going to be a

61:50

lot shorter because I'm going to use the

61:52

Go control plane,

61:55

which is the sort of framework made by

61:57

Lyft for this purpose. And you'll see

62:00

it's quite a bit easier because they've

62:03

already figured out the

62:06

uh all of the stuff for caching and

62:08

versioning and

62:11

um you've got kind of first class

62:14

support with protobuff. You don't have

62:16

to use my library for example.

62:19

And uh it should be quite a bit easier.

62:22

And with that part as well, we will be

62:24

able to use gRPC

62:27

and possibly delta discovery requests

62:31

and that's going to make your control

62:33

plane let's say production grade. It's

62:37

more scalable.

62:41

Uh you may need to attach a database to

62:44

do the delta discovery requests. I'm not

62:47

too sure, but I'll explore that when I

62:52

uh record the video.

62:57

So, that's about it.

63:01

If you made it this far,

63:02

congratulations. If you followed along

63:06

and created it while watching, um you're

63:10

a legend.

63:13

I hope it was valuable and see you

63:16

later.

Interactive Summary

This tutorial series provides a hands-on introduction to building a control plane for Envoy Proxy from scratch. The video demonstrates how to set up a basic Python-based control plane using FastAPI that delivers configuration resources—such as clusters, listeners, and route configurations—to an Envoy instance. The instructor explains the polling mechanism Envoy uses to retrieve configurations, introduces versioning to avoid unnecessary reloads, and highlights the transition from untyped JSON dictionaries to typed Protocol Buffer objects for more robust configuration management. The goal is to provide a foundational understanding of how a control plane functions before moving into more advanced, production-ready approaches in subsequent videos.

Suggested questions

3 ready-made prompts