Infra that fixes itself, thanks to coding agents

Infra that fixes itself, thanks to coding agents — Mahmoud Abdelwahab, Railway

Watch on YouTube

Now Playing

Infra that fixes itself, thanks to coding agents — Mahmoud Abdelwahab, Railway

Transcript

446 segments

0:00

your app's infrastructure should fix

0:01

itself. Let me show you. So, right now

0:03

I'm on the Rayway dashboard and I have a

0:05

bunch of services that are deployed and

0:07

all of these services have one thing in

0:09

common. They all have bugs and problems.

0:11

So, for example, this service has a

0:13

memory leak. If I click on it, go to

0:16

metrics, we can just see memoryization

0:19

keeps growing high and very quickly.

0:22

This is just a sign of a memory leak and

0:24

pretty sure the service would eventually

0:26

crash. If I look over at the amount of

0:29

requests, we have a high number of 500s.

0:32

So, the server is failing to respond. We

0:35

have a high request error rate of 94%.

0:39

And we also have an extremely high

0:41

response time of like multiple seconds

0:44

uh for like the service to respond,

0:45

which is not ideal. Like if this was a

0:48

service running in production,

0:50

everything would be on fire. You'd be

0:51

getting paged and you just try to bring

0:54

back the service up back quickly. But

0:57

the thing is not all problems are this

1:00

obvious. For example, this service all

1:03

it does just queries a postcrist

1:04

database. And if we go to metrics, we'll

1:07

just see that well CPU utilization seems

1:10

fine. Memory usage is also fine. Sure,

1:13

it's a bit spiky, but okay, whatever. We

1:16

have some fails, but okay, nothing too

1:20

alarming. request error rate is somewhat

1:23

high. So that also should make us kind

1:26

of like want to investigate, but the

1:28

response time is extremely high. The

1:31

thing is this is because the service

1:33

makes queries that are super slow. And

1:36

the thing is if you're an end user

1:38

that's trying to use this experience,

1:40

you would just suffer. You would need

1:41

like 30 seconds for like a page to load,

1:44

which would be a nightmare. So the thing

1:47

is when you deploy your app to

1:49

production maybe some you know bugs or

1:52

issues make their way to production

1:54

things happen and kind of like the

1:57

typical way of dealing with these things

1:58

is maybe you set up a bunch of

2:00

thresholds and when these thresholds are

2:03

met for let's say CPU or memoryization

2:07

maybe uh you want to have a threshold

2:10

for the request error rate it shouldn't

2:13

exceed a certain amount well what will

2:15

happen is You're going to get alerted

2:17

and you'll be aware that there is an

2:18

issue, but you still have to do the

2:20

investigation yourself. You have to dig

2:22

through logs, metrics, and traces to try

2:25

to paint a picture in your head and try

2:27

to piece things together so that you can

2:29

ship a fix. Now, what I'm proposing is

2:33

you should have a coding agent that

2:34

monitors the state of your project and

2:38

your application's infrastructure. And

2:40

if any issue is detected, so you know

2:43

any of the thresholds we define are met,

2:46

we should just have a fix shipped,

2:48

right? So like instead of, you know,

2:51

getting the alert and investigating, you

2:53

just review a pull request and you're

2:55

like, uh, looks good to me. You ship it

2:58

and then everything is good and crisis

3:00

averted. So today I'm going to show you

3:02

what I have in terms of demo that kind

3:06

of paints a picture of how this could be

3:09

achieved. So at a high level I want to

3:11

have a series of workflows that will

3:13

kick in that will help me go from issue

3:16

detected in on railway my deployment

3:18

provider to a pull request being open in

3:22

my GitHub repo. And this is what I have

3:24

in mind. Uh the first workflow that I

3:26

want to have is a workflow that runs on

3:28

a schedule. So let's say it runs every

3:30

10 minutes, 15 minutes, 30 minutes. And

3:32

what this workflow will do is one, it

3:35

should fetch the application's

3:37

architecture. We should have an

3:39

understanding of what services are

3:41

deployed, which you know like frontends,

3:43

backends, crons, cues are live in my

3:46

project. And I then want to fetch each

3:50

services resource metrics, so CPU and

3:52

memory utilization. And I also want to

3:55

fetch each services HTTP metrics. I want

3:57

to see the request error rate, the

4:00

number of failed requests for, you know,

4:02

500 400 errors. And once that's done, I

4:07

will want to then see which services

4:10

have exceeded which thresholds. And then

4:12

I just want to return a list of the

4:14

affected services. So this would be

4:17

essentially the goal. Now you might be

4:19

wondering, well, why not make this an

4:21

alertbased system? So maybe we configure

4:24

something like web hooks for alerts and

4:26

then that would kick off uh essentially

4:28

this workflow instead. I would argue

4:31

that it's probably better to be able to

4:33

analyze a slice of time rather than just

4:36

having a threshold being met because it

4:39

can get pretty noisy. Like imagine you

4:41

have a spiky workload uh and you know

4:44

you reach the 80% resource utilization

4:47

for like your CPU but things are still

4:50

fine and that's not like in my mind this

4:53

is enough to be investigate but it might

4:57

not like it might mean that there just

4:59

aren't issues when we try to look at

5:02

like the bigger picture and all the

5:04

details.

5:06

Now once we have this list of impact

5:09

that is impact services we essentially

5:11

want to pull in even more context for

5:13

them. So like at a high level we want to

5:15

see project health all the services is

5:17

everything operating as expected. Oh we

5:20

have this thing that we're suspicious

5:21

about. Let's actually pull all of you

5:24

know additional context for the service

5:26

because imagine again you have like high

5:28

resource utilization. Maybe you're just

5:30

successful. You have high usage. Uh but

5:32

then when you pull the logs it's like oh

5:33

everything seems fine. there aren't any

5:36

errors. Well, you're good. And you can

5:39

imagine that we can even pull even more

5:41

context. Like imagine maybe we scan the

5:44

code in the repo and based on that we

5:46

infer the upstream providers that the

5:49

repo relies on and then we can

5:51

automatically check the status pages of

5:53

these services. Imagine like a payment

5:54

processor goes down. Well, that's kind

5:57

of how you can know and then the coding

6:00

agent will be able to maybe tell you

6:02

like, hey, you should just like wait out

6:04

this issue.

6:06

And once we have all this information,

6:07

we can just write a detail plan. So like

6:09

we can look at, oh, we have a high

6:12

number of 500 requests. We see that we

6:16

have very high resource utilization for

6:19

memory. and we see that we have you know

6:23

um just errors specifying that a

6:26

specific endpoint is failing. Well, this

6:29

is enough information that we can write

6:31

a detailed plan of hey this is my

6:33

application's architecture. These are

6:35

the affected services. We just then give

6:37

this plan to an agent and then the agent

6:40

will just follow the process of hey let

6:42

me clone this repo. I'll just create a

6:44

to-do list based on the plan you gave

6:45

me. I'll implement all the fixes and

6:48

I'll just create a pull request. And

6:50

this is kind of how we go from issue

6:52

detected to an open pull request. So

6:55

let's actually see this in practice. So

6:57

because we have the idea of workflows,

7:00

what I want to do is actually use what

7:02

is known as durable execution. So the

7:05

idea of durable workflows has been

7:06

around for a while and it's really one

7:09

of my favorite abstractions because it

7:11

can help you simplify complex logic

7:13

while making it more reliable. So for

7:16

example here we have this workflow. So

7:19

this actually is ingest but there are

7:21

lots of solutions out there that pretty

7:22

much do the same thing and we have this

7:25

function that you know called process

7:27

video upload. It listens on an event of

7:30

video uploaded and we essentially want

7:33

to do three things. We first want to

7:35

generate a transcript and we do this by

7:37

making an API call to a third party API.

7:40

Once we get that transcript, we want to

7:42

generate a summary by also making a

7:45

request to an LLM provider. And once we

7:47

have the transcript and the summary, we

7:48

want to store them in the database. The

7:50

thing is all of these steps, they are

7:55

not 100% guaranteed to work. Uh they are

7:58

prone to failure. And what's neat about

8:00

this pattern is by default, these steps

8:03

will be automatically retried. You don't

8:05

even have to think about it. But if you

8:07

let's say want to modify this behavior,

8:09

maybe you want the retry to happen uh

8:12

like on a certain schedule like you know

8:14

exponential back off uh maybe you want

8:16

to define another thing that should

8:20

happen in the case of failure you'll be

8:22

able to do it. But what's neat is each

8:25

step when it succeeds uh the result is

8:28

cached. So if for example we are able to

8:31

transcribe the video correctly, we

8:32

summarize the transcript correctly, but

8:34

we failed to write to the database. If

8:36

we were to retry this workflow, we just

8:39

continue where we left off. Uh we don't

8:41

we won't really repeat any work, which

8:43

is one awesome because it's faster, but

8:45

also it's more cost effective. So at a

8:47

high level, this is the thing that I'll

8:49

be relying on in my code because I'll be

8:51

making API calls to the railway API to

8:54

be able to fetch the project

8:55

architecture, all the resource metrics

8:58

um as well as you know the HTTP metrics

9:00

and whatnot. So yeah, uh this is kind of

9:04

like the first thing that um we need to

9:07

talk about. The second thing is the

9:09

coding agent. And for the coding agent,

9:12

I'll be using Open Code. Open code is an

9:15

AI agent that's built for the terminal.

9:16

You can think of it as an alternative to

9:18

something like cloud code, but the main

9:20

difference is open code is fully open

9:22

source and you can choose any LLM

9:25

provider or uh you know model that you

9:28

like, which is pretty nice. Uh you have

9:30

this nice terminal UI, but honestly

9:33

what's so cool about the project is how

9:35

it's architected. So if you go to their

9:37

docs, they actually have a server

9:40

implementation. you can have a a a

9:44

headless server that runs that exposes

9:47

an API for you to essentially interact

9:49

with an agent. So the way it works is

9:52

when you run the command open code,

9:54

which is what starts up the agent in

9:56

your terminal, it doesn't just run a

9:58

single app. It actually starts a

9:59

terminal UI and a server. And because

10:02

the terminal UI here is the client, we

10:05

can essentially bring our own client and

10:07

talk to the server, which is awesome. uh

10:09

because now we can run open code on a

10:12

server in this case would be on railway

10:15

and we can just have this server have

10:18

all the tools that the agent would need.

10:20

So we'd install all of the necessary you

10:22

know tools we can configure git and then

10:25

the agent will be able to open pull

10:27

requests and you know go through the

10:30

file system and do everything. Let me

10:32

show you what how easy it is to

10:34

essentially have this deployed on right

10:35

away. So if you go to the code uh here

10:38

right now this is my project it's called

10:40

railway autofix I know great name uh I

10:43

have essentially two directories one is

10:45

for my API the other one is for open

10:48

code and open code really we just have a

10:51

single server running using bun and all

10:54

we're doing is we're just calling a

10:57

function uh that is called create open

11:00

code server so if I actually stop this

11:02

here you can see it runs on port 4000 9

11:06

496 and this is pretty much all we need

11:11

and I have a docker file and in this

11:13

docker file we're essentially defining

11:15

that environment. So, we're installing a

11:17

bunch of tools. You can see we're

11:18

installing curl, jq, bash, all the other

11:20

tools, even git. Uh, we're installing

11:22

the GitHub CLI, which is what will allow

11:25

us to open pull requests against a given

11:27

repo. We're then installing open code in

11:30

the environment. We're configuring git

11:32

and at the end, we're just exposing the

11:35

port and we're just authenticating the

11:37

GitHub CLI, which is pretty neat. Uh, by

11:39

the way, the code will be linked

11:41

somewhere down below. But that's really

11:43

it for open code. And when it comes to

11:46

the actual API, let me actually run it.

11:49

So now the this is the open code server

11:52

that's running. And if I go here, I have

11:55

my actual API running on localhost 3000.

11:58

And I have a UI that is provided by

12:01

ingest, which is very useful for

12:03

debugging. So if I go here and I go to

12:07

functions, essentially each function

12:09

here is a workflow and it has a bunch of

12:12

steps. So let's actually try to run it

12:14

to see what happens. Um now in

12:17

production when this is live this

12:20

monitor project health workflow should

12:22

run on a schedule and if an issue is

12:25

detected we will call the pool service

12:26

context and then pull service context

12:30

will call the workflow for generating a

12:33

fix. So if we actually just kick things

12:34

off this is how the flow of things will

12:38

happen. So if I actually have now I have

12:40

this function run. We called moderate

12:42

project health. Then we called pull

12:44

service context and now we're actually

12:45

calling generate fix because we detected

12:47

an issue and we're just setting um like

12:51

the railway specific variables as

12:54

environment variables. And all of these

12:55

are actually available uh on railway.

12:58

They're just set automatically which is

12:59

pretty neat. So if I actually go to

13:01

monitor project health you'll see we

13:03

have a bunch of steps. Uh the first one

13:06

is getting the project architecture and

13:08

this step right here this is we can

13:10

actually see its output. So we can see

13:12

all of the databases that I have in my

13:14

project. I just have one. Uh we can see

13:17

also a list of all the services as well

13:20

as their configuration. We can see which

13:23

like where's the repo for them and we

13:26

just now have a highle overview of our

13:30

applications infrastructure. Uh we also

13:32

see that we have any kind of like

13:33

volumes that are there which is cool.

13:36

And then we have a series of steps that

13:38

are actually running in parallel. So

13:40

like you know things are efficient. So

13:42

we're getting the database resources. We

13:44

can see on average well what's the max

13:46

CPU? Uh and it's like 0.9 CPU. Okay.

13:50

Same thing for memory. And we actually

13:52

have a summary. And this summary

13:54

essentially is us formatting these

13:56

results so that we can then pass it to

13:58

the coding agent. So you can see CPU

13:59

usage average 0.93 vcpu

14:03

and you know this is the max and memory

14:05

usage as well. Now this is actually high

14:09

uh and we'll be able to kind of

14:11

understand that because it's like oh

14:13

memory usage here is 31.96 GB out of a

14:17

max which is 32 gigs. Uh and then we

14:19

just pull even more um like resources.

14:22

So like because we have multiple

14:24

services we will call each step for it.

14:27

Right? So like we will pull the HTTP

14:28

metrics for each of the three services

14:31

that we have deployed for example. But

14:33

also for this one for the HTTP metrics

14:35

we can see the error rate percentage for

14:37

400s for 500s. We see like the latency

14:42

um and we just have like a status count.

14:44

So we can also have a summary and then

14:46

we can say hey these this is the rate of

14:50

um like request error rates. This these

14:52

are the latencies and this way when we

14:56

actually at the end of like this

14:58

workflow so I go to runs go here again

15:01

towards the end we will actually give

15:05

this uh pull service context function

15:08

just all of this information in a nicely

15:11

formatted way. So if I actually go now

15:13

to this function run, we will see here

15:16

that we're fetching the HTTP logs, the

15:19

build logs, the deployment logs for like

15:21

all the services that are affected. And

15:24

we can see here like this is the

15:25

function payload. Uh so this is the

15:28

stuff that we passed from the other

15:29

function. And we can see we just have

15:32

all this info. We also have an

15:34

architecture summary. So this actually

15:36

we can expand this. Uh the architecture

15:38

summary is just a nicely formatted uh

15:41

text saying like this is the project

15:42

architecture. We have three services we

15:46

are running in the production

15:47

environment. We have one database. We

15:49

have all these volumes and we just have

15:51

all of this information. It's just

15:53

harder to read cuz like in one line but

15:55

for the um coding agent we'll just give

15:58

it to it as like markdown. So now that

16:02

we have that go to runs again. Now that

16:05

we have that, we are just going to make

16:07

a call to another workflow which is

16:09

generate fix. And for this one, what it

16:12

does is one, it will analyze with AI. So

16:16

this is the actual output in terms of

16:18

like the input. It's a bit large to

16:21

render here. Uh but we analyze it with

16:23

AI. So like you can imagine we give a

16:26

large language model saying like hey

16:28

this is my project architecture. This is

16:30

the data. This is how things are

16:32

performing. And then we take all of this

16:35

information and now we actually come up

16:38

with a plan. So you can see here

16:40

debugging steps. We want to see

16:42

reproduce locally with the same load.

16:44

Maybe we want to run it. We want to see

16:46

what will happen if we see that the

16:48

agent is like oh I ran into an error.

16:50

Then it's going to fix it. And then we

16:53

have like recommendations. So like this

16:55

is the plan that we'll then just pass to

16:57

our coding agent. And then we have a

16:59

step to create a session. So on the

17:02

coding agent you can imagine each

17:04

session being its own chat. So this will

17:07

run like imagine you have multiple repos

17:09

each repo will have its own session. The

17:11

coding agent will work and then at the

17:14

end it should you know if as expected it

17:17

should open a pull request. So yeah

17:20

that's pretty much it. This is how it

17:22

works. Now if everything works as

17:24

expected we should see a pull request on

17:27

the project. And here we go. We have a

17:30

pull request that is open with all of

17:32

our changes. If we go to the

17:34

conversation, we actually be able to see

17:36

that we have a summary of all the

17:37

changes, uh, an analysis summary, the

17:40

root causes, what was fixed. So, we

17:43

should be able to just review this. If

17:46

everything looks good, we merge and

17:47

we're good to go. And that's it. I hope

17:49

you enjoyed this talk as much as I

17:51

enjoyed making it. If you have any

17:52

questions, feel free to reach out to me

17:54

on X or Twitter. This is where I mostly

17:56

hang out. Also, the repo for this

17:59

project will be available somewhere down

18:01

below. So, make sure to check it out.

18:03

And with that, thank you so much for

18:05

watching and I'll see you in the next

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video demonstrates an automated infrastructure self-healing system that detects service issues and automatically generates pull requests to fix them. By utilizing durable execution workflows, the system monitors metrics, analyzes logs and project architecture, and employs an AI coding agent to identify root causes, draft detailed plans, and implement solutions, thereby reducing the manual investigation burden for developers.