Making an XDS control plane from scratch - Part 1 - Naive Python version
1287 segments
Hello everyone. This video is
a sort of tutorial in which I go over
how to implement a control plane from
scratch. I think this will be part one
of a series.
It will be uh this part will be quite a
basic implementation
and it's not something that I would
deploy to production. It's just
something that should help people gain
an understanding if they follow along
and do it themselves
so that you can understand more about
envoy proxy and uh and how a control
plane interacts with it. The video has
been sliced up into chapters so you can
skip to certain parts. Parts of the
video have been sped up uh otherwise it
would be roughly 2 hours. As it stands
now, the video is about one hour, and
you could probably speed it up as well,
depending on what pace you like.
I'm not sure how many of these types of
videos I'll do. It's not really
something that I planned to do. Um, I'm
more also planned to share knowledge,
not necessarily do tutorials,
but I might just, uh, do part one, two,
and three of this, see how it goes. And
uh if it's uh received positively, then
I'll uh possibly continue doing stuff
like this. So this video is not really
system design. It is I know a lot of you
been have been requesting stuff related
to system design. I might have to think
about how exactly I'm going to do that.
Um, I have some ideas, but this is just
essentially like writing code by hand,
uh, and walking through like a worked
example of a control plane.
I do mention at some point that you
could probably get this task done with
an LLM fairly quickly, but you have to
know what you want before you can
actually ask the LLM to create it for
you.
Uh, so just keep that in mind.
Anyway,
uh on to the video and uh hope you
enjoy.
All right,
the video that I have been talking about
is here.
We're making a control plane from
scratch in Python. The first thing we're
doing is creating a project using the UV
tooling to add dependencies such as fast
API and uvorn. These are that's a web
framework and a web server to run the
web framework.
Just opening a file.
Going to put down some boiler plate to
create a fast API server here.
And the first endpoint that we're going
to put is just going to be something
that sort of pretends to be a control
plane. It's just going to answer on a
particular route with uh anything that
comes in with a post method
and it's going to respond to that with
uh just an empty dictionary for the
moment.
So we'll create that and it's going to
be called clusters
because
we're just going to pretend to support
provisioning uh or or generating cluster
configuration for envoy
and then we'll later expand that to be
bit more generic.
So here we're just going to print out
the JSON that we receive from the
envirroxy just so we have something to
look at.
And then we're just going to give back
that dictionary that's empty for the
moment.
Now
it's good to stop here and just to um
try to get an actual envoy proxy
running. The reason for that is we want
to test that even the most basic bare
minimum
thing uh still works.
So we can we'll be able to see it hit
that clusters discovery endpoint and
it'll get back nothing. But that's kind
of what we want to see. So right now I'm
looking through GitHub
looking for some examples in the envoy
repository for uh envoy configuration.
This is called the bootstrap.
And in there we've got some stuff like
the static resources, dynamic resources,
the node, which is information about the
proxy, and the admin interface at the
bottom there. Just going to go through
and rename some of these
uh cluster details because we're going
to be accessing
we're going to be hitting
the server running on our local
computer. But if you were say in a
production environment, you'd probably
want this to be using DNS to find the
control plane.
One thing that's going to be different
about this control plane that we're
building is that it's not going to use
gRPC,
at least not this version. I wanted to
do this one with just REST and JSON just
to show you
sort of the real
basic version so that you can get an
idea of how stuff works and then later
on you can you can use GIC and I will
probably do a second part to this where
I use a Golang library to which includes
pro uh compiled protocol buffers to then
construct all of the configuration that
comes through. So right now
I've got the envoy documentation on the
right hand side and I'm basically just
using it to figure out what fields are
available in each part of the
configuration. So we've got this dynamic
resources which you can specify a CDS
config for. That's cluster discovery
service and that is a config source or
it or it accepts a config source. So in
here we need to type in
where is the config source coming from.
It could be a file on the disk or it
could be uh from an API. And we're going
to go with an API here.
So, we're just going to type that out.
And we're going to make it a
REST API.
So, Enboy knows not to use gRPC.
So, we're going to tell it that the type
is REST.
And then we're going to spend uh we're
going to specify the cluster name which
is just this static resource below
called control plane which is pointing
at the local host IP address and uh on
port 8050.
And so now that we have that, that
should be the bare minimum we need just
to run mboy
and to run the control plane and to just
get it doing some really basic stuff. So
I'm going to try to use docker to
start this envoy container
and then I'm going to run envoy with
that config file. And we can see it's
booted up and it gave us a message that
says that discovery clusters failed.
That's because we aren't currently
running the control plane. So we'll do
that now. Now it's listening on port
8050. We see in the envoy logs it says
that it added and updated a cluster. And
we also see the request come through
which was printed out by our very basic
minimal uh control plane
at the moment.
I'm going to set up a little make file
just to make it a little bit less
tedious to run the envoy container.
And then uh
now we can maybe look through these
logs.
You'll notice that the discovery request
is really long. Most of these are
extensions and it's the proxy telling us
what extensions it supports. That's kind
of irrelevant, but this highlighted
information here is a little bit more
relevant. It tells us the ID, the
cluster, and a few other pieces of
information.
Um, you don't have to pause and look at
that. It's just kind of uh
something you'll get familiar with over
time. Maybe you have lots of different
proxies and
they will uh have different IDs and and
clusters and what and whatnot.
And maybe you'll program your control
plane to
treat them differently somehow. or give
give different config to different
proxies.
So now we're going to put the response
of the control plane into a shape that's
more what Envoy expects and this is a
discovery response
and it's going to have a version uh info
which is just a
a version that envoy is going to take
and then a list of the resources in the
response. So if we look on the right
hand side um yeah I'm just checking that
the fields that I'm using are correct
it's got the type of fields there as
well
and then I think the next thing we want
to do is to
actually hand back some real resources
so we'll actually give this proxy
a cluster
at least one.
And uh in order to do that, we're
probably going to have to look at the
config again. Uh sorry, the
documentation again. We'll look at the
cluster and see what kind of fields are
available there.
Um I know
some of these off by heart. For example,
I know to set this the type to static
because the endpoints that I'm about to
write are just going to contain a
uh a static IP address. It's not going
to going to be a DNS name. This config
is pretty deeply nested and tedious. So,
it's a good idea to write abstractions
to produce these. For example, you could
you could just feed in like a set of
address and ports and maybe generate
those end endpoints there.
The endpoints can also contain uh
different metadata uh and locality
information. Locality means like the the
zone or the region that it resides in.
And then you can use that information to
do some uh pretty clever routing of
traffic.
the cluster when we're operating in JSON
mode also needs a type URL so that envoy
knows how to des serialize that JSON
into uh one of the protobuff um objects
that it sort of knows about.
And then
we're actually going to do
we're going to add a little endpoint
also a a a get endpoint to this control
plane. It's not really something you
would need in a control plane, but we're
just going to make it pretend like it is
a backend.
So that when envoy proxies,
basically we're giving Envoy something
to proxy to just to show that it is
accepting requests, sending them to a
backend, that backend is responding, and
then that response is being uh given
back to the client. And in this case, we
it's maybe a bit of a weird setup where
we've got the control plane acting as a
control plane and a back end, but just
bear with me on that. And if that's
confusing, just uh let me know and I'll
help you out.
So now we're going to run them both and
we're going to check out
uh we're going to look to see
whether Envoy has accepted that cluster.
And it looks like it has. It says
add update.
Oh, it actually says add update zero
clusters.
Oh, no. Actually, it says uh we've added
one one cluster there.
Um, one thing to note is that it is
saying that it added and removed a
cluster every single time it hits the
control plane. We're going to fix that
much later when we introduce versioning.
So now
it has that cluster. We're going to look
at the admin interface
to see what it has there. And we can see
this example that we've um given. It has
this little metric that says that it's
added via the API. And we've got a
little bit of other information about
how many requests it's received and
connections and all this other kind of
stuff.
We can also look at the config dump in
the admin interface.
And that's going to show us uh that
we've got some statically uh configured
clusters, but we also have a dynamically
configured cluster. So that's telling us
that the control plane is indeed giving
that resource to this proxy.
The next thing we're going to want to do
is to probably make this setup a little
bit more generic. So rather than having
well it depends on how you want to do
things but the way I'm going to do them
is I'm going to make it so that when
Envoy makes a request for a discovery
whatever the resource type is we're
going to take that and we're going to
check what it is before we decide what
to do.
So, we've basically copied this clusters
endpoint and made it resources, which is
I I suppose more generic. And then we're
going to check if it if the resource
type is clusters, then we're just going
to do what we were doing before, which
is handing back those statically defined
clusters.
And otherwise, if it's uh something like
listeners, we're going to do something
slightly different. We'll hand back
different resources.
This code is pretty naive. Uh, but we're
just doing it this way so that we can
get something up quick.
And just to demonstrate
uh the interaction,
you can you can optimize this however
you like. Um,
the way I've done this in Sovereign, the
open source Reaper that I've talked
about in the past is um
the the control plane fetches lots of
different data sources
and then it aggregates them into one big
thing
and then it gives that data to a set of
templates. templates and then
whatever that uh template generates it's
tied one to one to a resource type and
that is how that's uh made to be generic
whereas here we're just using I guess a
bunch of if statements to do that which
is okay when you're just
researching and experimenting how this
stuff works which is what we're doing
so in a Second here, we're going to need
to add a listener. The reason why we
need to add a listener is because
um we need we need envoy to accept
traffic. It's fine for it to have
clusters. That means it that it has
things it can send traffic to, but it
has nothing on the receiving end. So, we
need to tell it what port and address to
listen for traffic on and then how to
handle the traffic that lands on that uh
address and port. So, the one that we're
going to build here is going to be
something listening on
either local host or every address. and
it's going to contain
what's called a filter chain uh in
envoy. That means
it's basically a a a container for
hand like specifying a bunch of filters
to handle traffic. It's called a filter
chain because there's a chain of filters
inside it.
um that might not be the actual reason
but that's the way you can think about
it and some uh examples of filters would
be like handling TLS and then handling
HTTP
handling um compression rate limiting
uh authentication and routing. Let's
just say that's an example chain of
filters.
But our filter chain is going to be
really super basic. We're going to have
one filter chain. We're going to have
one filter in there. And it's just going
to be a HTTP connection manager.
And
that HTTP connection manager is also
going to contain HTT HTTP filters. And
there will only be one which will be the
router filter.
Another
thing that the HTTP connection manager
or HCM
will have is either
an inline definition of of routes or
route configuration which contains
virtual hosts which themselves contain
routes
or it can also contain a reference to
call the control plane.
for the routes. And that might I might
demonstrate that a little later later in
this video. And the reason why you would
want to externalize the routes from the
listener is
uh maybe complicated, but I'll go ahead
and try to explain it. It's essentially
the reason why you'd do that is
because if the control plane is serving
the listeners and routes differently,
what that means is that it can just
deliver the routes to the proxy. And
when Envoy only has to re reload the
route table, that's quite efficient
compared to reloading a listener. The
reason for that is because when envoy
receives a new listener,
it has to put the old listener into
draining mode or well actually at first
it creates a new listener and starts
accepting requests on that on the on the
new one and then it puts the the old one
into draining mode but it keeps it
around while those requests are
draining. So you generally don't want to
do that often. it can be slightly
disruptive
whereas just reloading routes uh is is
quite a bit nicer.
So I'm I'm currently writing the the
virtual host and the route config out um
inline and I'll ex externalize that a
bit later. Just going to continue doing
that.
And then um after we've got this route
configuration set up, what we're going
to do is um just check the admin
interface again to see whether Envoy has
uh received these resources and accepted
them.
Right now I'm um I've I've got one
virtual host there called example VH and
it has uh uh an asterisk in the domains.
That means that when Envoy receives any
HTTP traffic,
it's going to match against every host
header.
Um so that virtual host will unless
there's one that's more specific than it
then the traffic will land on this
virtual host. We're only going to have
one anyway. So we just put a asterisk
there. And we're also setting up some
routes where we can
uh we can name the route. We can add uh
match criteria. You can use reax, you
can use prefixes, whatever.
And then you've got a a decision of how
to handle the traffic. You could do a
redirect. You could do a direct
response.
In this circumstance, we're just going
to route to the cluster. And the
cluster's name is example. And that's
the cluster that we were returning from
uh from the control plane a little bit
earlier.
So now we'll run the proxy again
and the control plane as well.
And we're just going to take a little
look at the admin interface, see what's
in there.
And while we're at it, we'll probably
just send a request over to Envoy just
to confirm that this listener and route
config works and that it sends the
traffic to the cluster.
We can see that it
requested the clusters, but I forgot to
put in the LDS config here. So, it never
requested the listeners. So, I'll just
fix that up real quick. We'll run the
proxy again
and we should see it request both
clusters and listeners and it and so it
does.
And now it's going to keep requesting
those because it run it it pulls them on
a schedule. But we're just going to take
a little look see um
do the config dump again. We can see the
virtual host is in there.
It's showing up as a static route
config, but the listener shows up as a
dynamic listener.
We can also take a look at the listener.
We can confirm that it's listening on
port 8080.
So, now is probably a good time to hit
port 8080 and see what we get back.
And
it's kind of unusual. We're running on
local host, but it's taking a while. I
didn't expect it to take a while.
It's kind of weird. It's kind of hanging
there, which is not what we expect.
Let's look at the config again. Maybe
we're missing something in the API.
It's definitely listening on port 8080.
I just checked the listening ports
there. definitely doesn't have to do
anything with DNS, but it still seems to
just be kind of hanging there. It's kind
of interesting. Maybe there's something
wrong with the address it's listening
on.
We'll just restart the control plane and
have I got this new listener, but still
it's not quite working. So
maybe I missed this field, this connect
timeout field. I'm not really too sure.
Let's try to add a new route there.
Maybe we add a static response. Maybe
there's something wrong with how it's
proxying. I'm not not exactly sure
what's going on here. Just check the
docs. Make sure I'm putting the right
config there. Restarted the control
plan. We can see Envo's got the config.
Are we running the Docker image in the
right way? I think we are.
Yeah, looks looks okay.
But it's still still kind of hanging
when the request comes in. That's kind
of odd. But we can we can reach the
admin interface. That's kind of strange.
Maybe it's the port number. Maybe my
operating system is doing something
weird with the port number. But port
8082 also has the same thing. Maybe
let's
let's see. Uh let's do a little packet
trace. Let's just confirm. Is are the
requests actually making it to that
machine? And we can see maybe you have
to squint a little bit, but it looks
like the traffic is going back and
forth.
So, I'm quite perplexed here.
What could be wrong?
Has anyone guessed it yet?
Oh, we'll look over the config again. Is
there anything we've forgotten here? Oh,
of course.
There's no HTTP filter for the router.
So, we're going to look through the docs
to figure out what we need to add there.
We need uh one of these typed
configurations
for router.
And it's a pretty easy filter to set up.
It has literally no config. We can just
copy paste that type URL there.
and uh place that there and then
stuff should work.
Just cross our fingers
and hope that it actually works this
time. So yeah, pretty easy. Just paste
that there.
When you're working with protobuffs, you
don't have to specify the type URLs, but
like I said, we're we're doing things
maybe the hard way. Also going to add
another make target just to make it
easier to run the control plane. I don't
know if it's actually going to make it
easier.
I still have to type it out, but
we've got our control plane running
again. Going to start up envoy.
It's going to get all its resources as
it was before. But this time
we should be able to see
we should be able to send a request to
the proxy and it will then
hit the uh get endpoint on the control
plane server that we've built. And it
should say
oh well let's just test the static
response. And we actually get a response
back.
But that's coming directly from Envoy.
That's not uh from
the server. But that one there, the
greeting is from our server. We can see
we're making a request to port 8082,
which is envoy. It is then sending that
to the example cluster on port 8085,
which is the control plane that's
running.
So that's pretty neat.
I suppose to uh recap, we've got our
resources API endpoint.
We have our uh greeting
endpoint.
In a normal situation, this would be on
two separate servers, but we've just
bundled them into the one just to make
it easier to run these.
I suppose if you wanted to make
something a slightly more complicated
with fake backends and whatnot, then you
might use something like Docker Compose
and just start up a bunch of different
containers. And um then you can also use
DNS in that situation because the Docker
network will provide um DNS for the
different container names. So that might
be
nice. But yeah, we've got this um
cluster that we're creating.
We created a listener. We created the
route configuration
and we're providing those back using
this if statement sort of setup that we
have that checks the different resource
types. We can definitely make that
better. Uh just use your imagination
really.
The next thing we probably want to do
is to
use a control plane for what it's really
meant to be used for. What we've set up
here is is kind of useless because
you might as well just provide the
config directly to envoy from a file. So
what we're going to do instead is we're
going to create this uh file uh on the
disk and we're just going to pretend
like that file is is an external API.
You can just imagine this fetch external
data function does something that hits
something remote from the control plane
where perhaps it's got data that's
changing all the time
from some other system.
And we're going to pretend that that's
uh dynamic data.
And then every time Envoy requests its
resources, we're going to fetch that
data. And depending on what's in there,
we're going to give back different
configuration.
That is the true power of the control
plane. And I I probably wouldn't create
one um unless you planned to do
something like that.
So, I'm just going to sort of wing this
uh this dynamic data. We're just going
to come up with some random
configuration schema
where we're going to have a list of
objects and we're going to specify the
the type of the objects. Maybe we've
we're going to configure the listeners
dynamically. We're going to configure
some clusters and routes.
So, we'll just um
go ahead and set that up. Maybe have a
think about what exact structure we
want, what's going to be useful.
It's something that probably needs a
good amount of thought. You don't want
to get too bogged down, but
you don't want to have a schema that is
difficult to change. That can be a
significant pain in the future.
So I'm deciding here I'm going to have a
listener type. I'm also going to have a
service type that doesn't really tie to
any envoy resource. We're actually going
to pretend this service type is two
resources. It's going to abstract over
two.
One of which is going to be clusters and
the other which is going to be let's say
a virtual host.
So, we're going to put in information
that might be needed in both those
resources in the one place.
And then when the proxy requests this
information, we're going to churn out uh
configuration or we're going to somehow
churn out configuration
um to then serve those two distinct
resources
but from a single sort of um data source
there.
And of course you in a in a real
scenario you might have a bunch of
services in a multi-tenant scenario
you could separate them by any real uh
boundary like maybe you have ingress,
maybe you have egress, maybe you have
something else.
Um,
you know, you could you could use this
to
apply
all kinds of logic like
IP restrictions or security or
role-based access control or
really
it's it's totally up to your imagination
how you use this stuff.
And that's kind of
I suppose one of the benefits to to
making your own control plane versus
getting something off the shelf.
Uh I personally don't know too much
about the ones that are off the shelf
what they provide. I know ISTTO is one,
but I'm I'm not sure I don't know much
about it.
But anyway, um now that we have this
sort of external data um shape,
we want to change the code to actually
use it.
So, we're going to have to go through
and edit all these usages where it's got
hardcoded stuff, and instead it's going
to
be based off
the
uh JSON file that we've created. So,
we're going to probably create some uh
turn some of our existing code into
functions, and those functions will
accept some of the data that comes in
from that JSON file.
You can imagine in a production
scenario, you might have quite rich API
data that's fed into your control plane.
And so, you have to be a little bit
smart about how how you do this. You
don't want to, you know, if you've got a
thousand entries in your in your JSON
file, then you don't want your control
plane to be generating this stuff every
single time. So,
you want to probably have versioning,
some kind of a mechanism to determine
the version
and uh and some kind of use that to do
some sort of caching. I would strongly
suggest that
in the uh go control plane that's made
by Lyft,
they uh provide a cache implementation
that would probably suit most people. Um
I'll probably go into that in the second
part of this series.
So we've got some functions now where
we're going to take in these services,
this abstraction that we've created
and uh we're going to we're going to
generate the the route configuration and
the clusters from the service.
And then separately for any object that
is a listener type, we're going to
create some listeners out of that.
And this is just the tedious part of
making all of that work.
Could you do this with an LLM?
Yeah, you probably could.
But if you say had never done this
before,
it might be quite difficult for you to
prompt the exact thing that you want to
have happen. And that's kind of why I'm
making this video.
Because if you follow along
and then you try to customize this and
make it your own,
then you'll likely develop a really
great understanding of this stuff. And
then you you'll develop your own taste
and you'll you'll say stuff like, "Well,
I don't want this or I don't want that."
And that that's that's essentially what
helps you to prompt an LLM.
So right now we are uh
fetching that data on every request that
we receive. That's probably something
that you would want to optimize away.
Maybe you want to add some concurrency
or
uh worker threads to your control plane.
Maybe you want to have some kind of a
channel that um sends this data
to the other threads every time it's
updated so they have a fresh copy
somehow.
It's really up to you.
And we're just going to compute all
these resources and probably the most
inefficient way possible.
Make sure all the um names of the
resources. Another thing there is
um
your listener resource could
accidentally
tell the proxy to find the wrong
resource. Uh you could have a listener
that contains a reference to a route
configuration, but the names are wrong.
So you have to make sure in your code
that these names are aligned somehow.
Um
anyway, that was that was a small uh
little bit of a troubleshooting
that we have to do here. I think the
JSON was in the wrong format.
Some maybe some extra commas, maybe some
missing commas.
Just clean that up.
And uh now
I think that's looking good.
However,
we now have a little missing field
there. So, we've got to go and add
um the refresh delay to the
uh RDS that we've added in here.
Just going to set that to 5 seconds.
And that should be that.
We'll run the control plane again.
And now the proxy should com not
complain anymore.
Says that uh clusters were updated.
However,
it's now saying that there is a missing
type URL.
Uh it says missing type in any it's only
allowed blah blah blah.
So yeah, the route config doesn't have a
type URL and it it's uh it needs one. So
we're just going to go grab that from
the
documentation here. If I can find it.
There it is.
So, we'll add that in. We've got to type
out the
type.googleapis.com/envoy,
which I've memorized.
And now we run it again. And I think
envoy should be happy with that.
Oh, but it's not.
It says it it's got an unexpected
character zero. Expected a double quote.
I think what's happened here is that
I've specified the route names
as integers.
So, I probably need to fix that up. They
actually need to be a string.
Uh I've used the index
and that's why it's an integer.
But I might just uh do a an F string,
which is a format string with that index
just to make sure it's definitely a
string.
And now we've got them both running and
MVoice seems to be pretty happy with
that.
And we can see that it's added the
listener, it's added the clusters, it's
requested the routes. That's a good
sign.
So now I think uh we should be able to
make the same request that we did
before. And it's still going to work.
H.
Ah, it didn't work because we've
actually added in this domains here. So
the virtual virtual host has
a domain now
and it only accepts traffic on that
domain. So we have to specify that as a
host header. And there we go. We
actually get the response back this
time. So this allows us to demonstrate
the dynamic configuration. We can now
change this JSON file and the next time
the envoy proxy requests the resource
we should see
that
updated and the result of that should be
now that uh
when I make the same request I don't get
a response back because it has the wrong
domain name now but if we update it to
baz.bar.tld T
we can now see we're getting a response.
So it has loaded the new version of that
JSON file and the proxy has updated its
configuration.
So you can you can kind of get an idea
of how this is is quite useful um
especially when you're using a HTTP uh
endpoint instead of a JSON file.
So
you might question, well,
how would that be much different from
say just having a cron job that updates
the envoy configuration on a schedule?
The reason why it's different is because
your control plane can act as an
aggregator where it hits many different
sources of information
and computes configuration by pulling
them all together. You could arguably do
that on a Chrome job.
You could publish uh
a configuration somewhere and have that
picked up by a Chrome job. It's really
about the same. I think you'd still want
to
have
uh
use protocol buffers to sort of validate
that you're actually generating valid
configuration.
And you might also be able to observe
certain things from the control plane
like
whether uh the you can you can look at
the the the
proxies versions and you can try to
figure out like is there too much
variance in a particular cluster of
proxies? Do they all have the right
config? Is it up to date? You can't
really do that as much with a cron job,
although you could get some pretty
creative. So, the next step here,
we want Envoy to keep polling, but we
don't want it to think that the config
is changing all the time. So, we're
going to introduce versioning to this.
And the easiest, most simple, basic way
I can think to add versioning is
probably just to increment a number
every time the data changes. And we know
the data changes because
it will
sort of a string comparison or a d a
dictionary comparison
should tell us that it's not quite the
same as the the old set of data. So
that's what we're going to do here.
We're actually going to create a loop
and we're going to run that loop in a
thread
and we're going to modify some global
variables that hold the data and the
version.
Um,
this is not how you would want to do it
in in production. Obviously, it's hard
to know whether that thread is still
running or if anything's gone wrong. And
uh you can't really run this uh in a
distributed nature. you'd have to just
have one machine and that won't scale.
But just for demonstration, we're going
to have this thing that is running in a
loop running in a thread and it's going
to mutate some global memory
inside the Python uh running process
and the other um the other parts of that
program will be able to access that
memory as it's changed. So that's just
fine for this circumstance.
So we're going to replace uh this uh
fetch and we're instead just going to
refer to the global data there.
Um,
and once we've got this set up, then we
are
going to be able to
change the data.
Um, we'll have
um
we'll have that running in the in the
background and it's going to
uh fetch that new data. We'll print
something out probably just to show that
it has changed. And that sort of
decouples this.
It doesn't decouple all the computation,
but it it at least decouples the
fetching part. And it would be a smart
idea
in a production scenario to to decouple
both. You want to fetch the data,
compute the resources
as much as you can, and then provide
them when the discovery request comes
in.
At least in the polling architecture.
If you're doing something like delta
requests, then you'll have something
quite different
where you have to track the state
and the subscriptions on resources. And
it's much more complicated.
uh slightly
um more difficult to set up.
So, we've got this um this version
number. We're going to increment that.
And what this allows us to do is when
the envoy proxy requests configuration,
it tells us what version it currently
has. And if that version is not
different from the version that the
control plane has, then there is no
reason to give it back a 200 response
because when you give it a 200 response,
it thinks that those resources has
changed and it might do things like
drain clusters or drain listeners or
reload route tables and it's kind of
just wasted.
So, now that we've got that set up,
we're going to go ahead and change the
uh resources uh the the the data file
that we have.
And uh we should see that the control
plane is picking up that new data. Yep,
we see that there version two.
Now, when we run the proxy,
it's going to request the resources.
Then it's going to poll again
and it's going to get a different
response hopefully. Uh, not quite yet.
We are not comparing the
versions yet. So, we'll probably have to
go fix that.
So, what we want to do is we want to get
the version number from the proxy and
then do a comparison. And if the version
numbers are the same,
then the resources have not changed,
which means we give it back a different
response,
which essentially tells it to do
nothing.
So, we're going to grab that out of the
request body, which is
uh in JSON, which the proxy sends to us.
And then we can sort of
we can sort of uh neaten up this code a
little bit. Might do that in a second.
Uh we want to basically
we don't need to specify the same thing.
Okay, there's a bit of duplication going
on here. So, we want to return the
version info and the resources at the
end.
So, we'll just figure out what the
resources are in those if statements
and then we'll do a little check of the
version.
Just a simple comparison.
And if it is the same, we're going to
actually give back a 304
response code.
with no body. And it's going to tell
Mboy that nothing's changed.
So, we can also clean up this if
statement while we're here. Yeah, we can
probably just chuck these
resources into a mapping.
Uh, it's it's still pretty janky, but
we're just we're just doing an MVP here.
In a real situation, maybe you get this
data from a database or or something
else. you probably want to cache them
every time the data changes and then
pull them out when uh when you need
them. But in the meantime, we're just
going to make a hashmap, throw those uh
those objects in there, and then
uh we're going to try to get the
the resource type out of that hashmap.
And if it's not in there, then
we will hand back a 400 or something
like that.
So, yep, we have a key error. It means
whatever resource type was requested is
not in that map. So, we'll just get back
400 there.
And then we're going to try to run this
again.
We should see it request the resources.
And then
um
and then on the second time, oh, okay,
there's something wrong here.
The version info wasn't sent in the
discovery request. That's kind of
unusual.
It shouldn't be missing. I think it's
required.
But anyway, we can probably just handle
that in the Python code by
uh if it's none, then we'll set it to
something
like a sentinel value.
And that should probably get us going.
So, we should see request the resources.
Then it's going to request them again.
But because the version hasn't changed,
we'll see a different status code here.
And we see there is a 304.
And it's going to continue to be a 304
until the data changes, which is great.
It means that
when the proxy is reaching out to us,
we're not transferring stuff
unnecessarily. We're not telling it that
it has new resources. It's not reloading
stuff. It's not logging a bunch of stuff
for no reason. So that's all uh pretty
good positives.
So now when we change this data, we
should see those status codes
change to 200s. And we see we got new
data detected and employees now fetched
the new resources. It's logged that it
fetched and updated some stuff, but then
it just goes straight back to
uh stuff hasn't changed,
which is
neat.
So, up until this point, you probably
thought you may maybe you've been
thinking,
well, at least I would be thinking that
I'm not really comfortable
messing about with a lot of dictionaries
and
uh the lack of typing
really puts me off. How do I know I'm
not going to give something to Envoy
that is going to be
the wrong configuration? I'd have to
have extensive tests, which would be
good either way,
but I want to be able to catch it in my
editor if I can. That would be ideal.
When I specify a particular field for a
route configuration, I want to know that
I've specified the correct field. I can
look at the documentation,
but uh there's certain things that maybe
they're not as clear
um as to what exactly you can place in
those
uh in those objects.
So
because Envoy is built around protocol
buffers for its configuration
um a lot of that stuff is pretty easy
because the if you compile the protocol
buffers in your language it gives you
hopefully uh something that you can type
check and validate depending on the
library that you use to compile the
protocol buffers.
I personally have made my own library
called envoy data plane and that uses
better proto 2 as a dependency. So we're
just going to add that to the project
and then we can start the
somewhat tedious process
of replacing all of these. I've sped up
this footage because otherwise it would
take forever.
But I'm just going to go through
and um
replace all of these Python dictionaries
that are untyped and slightly dangerous
with
uh these compiled protoraph objects.
And along the way, my language server in
Neoim is going to tell me when I've put
something wrong,
and it's going to tell me what type it
needs to be. And then I can type out
that that the name of that type and then
I can automatically import it,
which is all pretty standard stuff these
days,
but it's uh it's the kind of thing that
it makes it much easier to write invoke
configuration. even if it's dynamic
stuff that's sort of an abstraction,
you want to have
um maybe some parts of the configuration
abstracted away, but the rest of it you
you've got to type it out. So, you've
got to make sure that that's
valid and checked and and all that kind
of good stuff.
So, I'm going through here, replacing
the listener,
making sure everything's fine.
And uh now when we rerun this all, we
can see it's doing same as before. And
when we change the
data,
it should be changing,
but it is not. That's kind of weird.
Puzzling.
Let's maybe try again.
See if we can
change something else. Okay, it picked
up that change, but maybe not the route
change that I made. That's kind of odd.
Let's try hitting the proxy again.
Put the correct host name. Okay.
Huh. That's weird.
Well, it seems to be proxying. It's
hitting the uh server, giving back a
response.
So, now we have these typed objects. uh
allows us to write this configuration
way easier. Makes sure that everything's
valid.
Um
well, to an extent, it might not catch
everything, but that's that's okay.
So, from this point,
I guess I can talk about what I'm going
to do next, what's going to be part two.
I sort of alluded to it
uh at different stages, but essentially
part two will be me doing pretty much
the same as this, but it's going to be a
lot shorter because I'm going to use the
Go control plane,
which is the sort of framework made by
Lyft for this purpose. And you'll see
it's quite a bit easier because they've
already figured out the
uh all of the stuff for caching and
versioning and
um you've got kind of first class
support with protobuff. You don't have
to use my library for example.
And uh it should be quite a bit easier.
And with that part as well, we will be
able to use gRPC
and possibly delta discovery requests
and that's going to make your control
plane let's say production grade. It's
more scalable.
Uh you may need to attach a database to
do the delta discovery requests. I'm not
too sure, but I'll explore that when I
uh record the video.
So, that's about it.
If you made it this far,
congratulations. If you followed along
and created it while watching, um you're
a legend.
I hope it was valuable and see you
later.
Ask follow-up questions or revisit key timestamps.
This tutorial series provides a hands-on introduction to building a control plane for Envoy Proxy from scratch. The video demonstrates how to set up a basic Python-based control plane using FastAPI that delivers configuration resources—such as clusters, listeners, and route configurations—to an Envoy instance. The instructor explains the polling mechanism Envoy uses to retrieve configurations, introduces versioning to avoid unnecessary reloads, and highlights the transition from untyped JSON dictionaries to typed Protocol Buffer objects for more robust configuration management. The goal is to provide a foundational understanding of how a control plane functions before moving into more advanced, production-ready approaches in subsequent videos.
Videos recently processed by our community