HomeVideos

Compilers in the Age of LLMs — Yusuf Olokoba, Muna

Now Playing

Compilers in the Age of LLMs — Yusuf Olokoba, Muna

Transcript

427 segments

0:00

If you're an AI engineer right now, your

0:02

day-to-day probably looks something like

0:05

this. You've got an open client in your

0:07

codebase. You've got a few hugging face

0:10

tabs open. You've got three different

0:12

repos with the word playground in them.

0:15

And you've got at least one agentic

0:17

workflow that's really just stringing

0:18

together a bunch of HTTP calls. Right

0:21

now, everyone is talking about voice

0:23

agents, MCP, and these are pretty cool

0:26

technologies, but when you peel back the

0:28

hype a little bit, what I hear when I

0:31

talk to a lot of engineering teams is

0:32

that they're usually grappling with much

0:35

more fundamental and boring problems.

0:38

How do I use more models in more places

0:41

without having to rebuild or extend my

0:43

infrastructure every single time?

0:46

So say you want to go try out a new open

0:49

source model that just dropped on

0:51

hugging face today. That usually means

0:53

you got to go write a Docker file, spin

0:55

up a Docker container, and then get that

0:57

running on infrastructure that you own

1:00

or that you rent from a third party

1:02

provider. And if you're wiring this into

1:05

an AI agent, well, that's another tool

1:07

that you have to put into the context

1:09

and perhaps expose either like an MCP or

1:12

something similar. A lot of this is just

1:15

complexity that creeps in and only grows

1:17

further more time you spend. What

1:20

developers actually want is something

1:22

way simpler. Just give me an open style

1:25

client that just works. Let me point it

1:28

to any model at all. It doesn't matter

1:30

if it's running locally, if it's running

1:32

remotely, if it's Llama CBP or Tensor

1:35

RT. I just want something that works

1:37

with minimal code changes. In this talk,

1:41

I'll walk you through how we decided to

1:43

build a compiler for Python that enables

1:47

developers to write simple plain Python

1:49

code and then convert that into a tiny

1:52

self-contained binary that can then run

1:55

anywhere at all. It could be the cloud,

1:58

it could be Apple silicon, it could be

1:59

anything else in between. Further, I'll

2:02

show you how we use LLM within that

2:03

compiler pipeline. a few things we

2:06

tried, what worked, what didn't work,

2:08

how we fenced them with verification and

2:10

LLM power testing, and how these this

2:13

infrastructure gives us the ability to

2:15

not just run any AM model at all, but we

2:18

can now run it in so many more places

2:20

beyond just server side.

2:22

So before we start getting our hands

2:24

dirty with an example, I wanted to

2:26

provide some motivation on why we

2:28

thought building a Python compiler was

2:30

the best way to solve AI deployment in

2:33

the long run.

2:34

First, we needed an extremely simple and

2:37

standardized way for developers to bring

2:40

their AI models, whether the ones that

2:42

they've built internally or models that

2:44

they found open source on Ugging on

2:46

GitHub, and then get something that they

2:49

could execute very easily in their

2:52

codebase.

2:54

So, when a new OpenAI model comes out,

2:56

for example, all you have to do is

2:58

simply just change the model argument

3:00

pointing it to the new model that OpenAI

3:02

just dropped. We wanted to recreate

3:04

something that tracked this experience

3:06

as closely as possible. Conceptually,

3:10

this would have to look like something

3:11

that ingested code, Python inference

3:13

code, and then spat out some other thing

3:16

that knew how to get executed in our

3:18

develop in our users uh execution

3:20

environments. Second, we wanted to

3:23

prepare for what we strongly believe to

3:25

be the future of AI deployment, hybrid

3:28

inference. We expect that in the future

3:31

we will see smaller models typically

3:34

much closer to users either locally on

3:36

their devices or in edge locations

3:38

working in tandem with cloud AI models

3:41

that are much larger and have a bigger

3:43

reasoning abilities and we expect that

3:46

this is going to be the future of how a

3:47

lot of people consume AI in their

3:49

day-to-day lives. As such, this means

3:51

that developers have to move away from,

3:53

you know, the the cages of Python code

3:56

and Docker containers into something

3:58

that is a lot more low-level, closer to

4:01

the hardware, and a lot more responsive.

4:05

So, let's get our hands dirty. This is a

4:08

Python function that runs Google's

4:11

embedding Gemma 270 million parameter

4:14

model. It's a very simple text embedding

4:16

model that takes in a list of of

4:19

sentences, just plain text, and then

4:22

runs a model that is able to generate an

4:24

embedding vector or a list of embedding

4:26

vectors, an embedding matrix. You will

4:29

typically use models like this in text,

4:32

in text search, in retrieval augmented

4:34

generation, and in other frameworks

4:36

where you need to be able to retrieve

4:38

documents or retrieve subsections of

4:40

documents. This model from Google is

4:42

small enough at only 270 million

4:44

parameters that not only can it run very

4:46

easily on uh GPUs in the cloud, it can

4:49

also run very quickly on consumer

4:51

hardware also. And today we will be

4:54

figuring out how to take this Python

4:56

function that runs the embedding model,

4:59

generate equivalent C++ and Rust code

5:02

that is much lower level and is now able

5:04

to run anywhere at all. And then we will

5:07

compile a binary that contains this

5:10

model and all the dependencies it needs.

5:12

And finally we will consume this model

5:14

using the familiar OpenAI

5:17

client.bings.create

5:19

experience.

5:20

The very first step is taking our

5:23

function and generating a graph

5:26

representation that describes everything

5:28

that happens within that function. We

5:31

call this tracing.

5:33

Initially our uh first prototypes of

5:36

building a symbolic tracing uh solution

5:39

was actually built off of PyTorch 2

5:42

which introduced Torch compile along

5:45

with Torch FX

5:47

uh for this purpose. So the way that

5:49

torch FX works is it'll take in Python

5:52

source code and then run it with fake

5:54

inputs that don't allocate any memory

5:57

and then give you a description a graph

5:59

of everything that happened within that

6:01

function. We actually try to use this

6:03

but we faced two major issues that

6:06

caused us to build our own uh tracing

6:08

infrastructure. The first was that

6:11

PyTorch uh is very focused its tracer is

6:14

very focused on only PyTorch code. And

6:17

so in order to trace arbitrary code

6:19

which your functions will usually have

6:21

to rely on things like numpy operations

6:24

or OpenCV or something else we would

6:27

have had to figure out a way to like add

6:29

support for those data types into

6:31

PyTorch.

6:32

The second reason why we didn't stick

6:34

with PyTorch was in order for the tracer

6:37

to work, it had to be run on fake

6:40

inputs. And so, you know, creating a

6:42

fake tensor is trivial. You just, you

6:44

know, give it the same description and

6:46

don't allocate any data. But it's a lot

6:49

harder to create a fake image or a fake

6:52

dictionary or a fake, you know, whatever

6:54

type that we might encounter in the

6:55

wild. And so, we simply decided that we

6:58

were going to build something in house.

7:00

Our first attempt was actually using LLM

7:03

as a way to generate traces because LLMs

7:06

for quite some time now have had this

7:08

capability of structured outputs. This

7:11

is where you can give an LLM a prompt

7:13

some data whether it be an image, text

7:15

or audio and ask it to respond to you

7:18

with a specific schema that you have

7:19

given to the model. This actually turned

7:22

out to work pretty well. Uh it had

7:25

almost like a 100% uh accuracy rate in

7:27

our own testing. The only limitation was

7:30

it simply took way too much time. And so

7:33

eventually we decided we're just going

7:34

to do it old school. We would build a

7:36

tracer by first analyzing the code

7:38

looking at the a or the abst the

7:41

abstract syntax tree of the Python code

7:43

and then using a bunch of internal

7:45

huristics to build our own internal

7:47

representation or IR of the user's

7:50

function. So for this function that

7:52

we've written up, the IR is actually

7:54

incredibly simple. I'm not going to show

7:56

you the entire thing, but I'll just show

7:57

you the parts that are relevant to look

7:59

at. As you can see, there's input nodes

8:02

for the actual uh inputs to the

8:04

function. So like that's a list of the

8:06

strings. There's a function call to

8:08

calling out to the tokenizer. Another

8:10

one's calling out to the model. And then

8:12

we return those outputs so that the user

8:14

can then get their embedding vectors.

8:16

Now that we have a high-level

8:18

intermediate representation of our

8:20

Python function, the next step is to

8:22

figure out how to translate that somehow

8:25

into lower level C++ or Rust code. But

8:29

before jumping into that, I wanted to

8:30

talk about one major difference between

8:33

Python as a language and C++ or other

8:36

lower level languages that we will run

8:38

into and have to solve. Python is a very

8:41

dynamic language. So one variable X

8:46

could be assigned to an integer and then

8:48

immediately after assigned to say a

8:50

string. There is full dynamism in

8:53

anything goes. Whereas in lower level

8:55

languages like C++ and Rust if you

8:57

declare a variable you must give it a

8:59

type and that type can never change.

9:02

This gives us quite a bit of a challenge

9:04

because we need to figure out how to

9:07

attach or constrain the types in the

9:10

code that we will be generating from our

9:12

Python highle code.

9:15

So let's look at the first line of our

9:17

function. The very first node if you

9:20

call it of our IR. As you can see

9:23

prompts is this list that is being

9:25

generated by a comprehension statement.

9:28

and we're effectively just adding a set

9:30

of prefixes for every sentence that has

9:33

been passed in by the user. And so let's

9:36

just focus in on that addition operation

9:38

that's happening within that

9:40

comprehension. As you can see, well, we

9:43

know that every item in text is a string

9:46

because we have pretty much annotated

9:48

our function as such, right? The input

9:50

text is a list of strings. And we also

9:53

know that the text prefix map just

9:55

contains a bunch of strings. Each prefix

9:58

is itself a string. And so the question

10:00

then becomes how do we know or how do we

10:02

figure out the C++ type on the output of

10:06

that operation. And this is where the

10:08

compiler comes in specifically a

10:10

technique we call type propagation. And

10:13

so here we will take one string the

10:16

prefix the other string the actual input

10:18

text that was provided to the function.

10:21

And we now know that there is some

10:23

addition operation happening to these

10:25

two. So we can simply write or generate

10:30

a C++ function that takes in two strings

10:33

and performs the operator.add operation

10:36

from Python.

10:38

The output of that function that we

10:40

generate in C++ as you can see here well

10:43

it's just a string and that's how we

10:45

know that whatever the output of this

10:47

addition operation uh we're doing is

10:50

must itself be a string. So in that way

10:54

zooming out we've been able to take the

10:57

input information the input type

10:59

information from just the signature of

11:01

our Python function along with the C++

11:04

type information or the the native type

11:06

information of this global constant task

11:10

prefix map and then we've been able to

11:12

use that to propagate into the output of

11:16

the concatenation of these two things.

11:18

We now know that if I concatenate one

11:20

prefix with one input string, the result

11:22

itself is a string. And so we can then

11:25

do this propagation for every

11:27

intermediate variable or every operation

11:30

within our original Python function. And

11:32

that's how we can kind of like flow type

11:34

information through. And so at this

11:36

point you might be wondering well your

11:39

compiler if you're doing this

11:41

propagation thing that requires us

11:42

manually implementing some operation in

11:44

C++ or in RS code we would have to

11:47

literally rewrite this for every unique

11:52

function call or operation that we ever

11:54

encounter in Python. And you'll be

11:56

correct you'll be 100% correct that is

11:58

in fact what we would have to do. But

12:00

that is now tractable and it's an easier

12:03

problem to solve now for two reasons.

12:07

The first reason is that all the variety

12:10

you'll ever see in source code in the

12:12

wild is not because there's such a giant

12:15

volume of these operations. The volume

12:18

is actually because you can combine

12:19

operations in so many different ways.

12:22

You can permute them in so many

12:23

different ways. in each of these

12:24

permutations is what forms a unique

12:27

Python function or Python code. And so

12:30

we really only need to cover that base

12:33

level or that base number of elementary

12:35

functions. And we could just stack them

12:37

or combine them in different ways in C++

12:39

the same way we do in Python. But you

12:42

might even say to that that wait that

12:45

elementary set of functions, it's still

12:46

pretty large. And you would be 100%

12:48

right. We need to cover everything from

12:51

you know adding two things to like you

12:53

know subtracting them to exponentiation

12:56

to like you know some stuff that is like

12:58

in native libraries like numpy

13:00

operations or pyarch operations and so

13:02

yeah so you have a perfectly valid

13:04

point. The only reason why that's

13:06

tractable now is well we don't have to

13:08

sit down and write the equivalent native

13:11

code that does the same thing in Python

13:13

anymore. we can simply have LLMs

13:16

generate all the code that we need that

13:19

translates a function from Python right

13:22

into C++ and Rust. And so this gives us

13:25

the ability to basically massroduce a

13:28

lot of the operations that we we would

13:29

otherwise have had to manually rewrite

13:32

ourselves in native code. And so now

13:35

that we've been able to propagate type

13:37

information through our Python IR graph,

13:41

we basically have all we need to simply

13:44

generate actual C++ code that is correct

13:48

and will compile. So here's what it

13:51

actually looks like side by side. As you

13:53

can see, I'm l just walking through and

13:55

you can see where we're doing that, you

13:57

know, list comprehension to add the

13:59

prefixes to each string. You can see

14:01

where we are running the tokenizer to

14:03

tokenize those input text into IDs. And

14:06

you can now see we're running the model

14:08

and returning the output embedding

14:10

vectors or the embedding matrix. At this

14:13

point, because we now have C++ source

14:15

code, we can now compile this to run

14:18

natively on any device or platform that

14:21

we would ever want to run on. Simply

14:24

because every piece of technology that

14:25

you've ever touched has a C or C++

14:29

compiler. This is what gives us the

14:31

ability to take high-level Python code

14:33

and convert it into a form that is

14:35

self-contained and that can now run

14:37

anywhere at all. So let's go ahead and

14:40

do that. And then what we're going to

14:41

end up with on the other end is simply a

14:43

uh dynamic library uh a shared object if

14:47

you call it that that we can then load

14:49

into a process and execute like any

14:51

other code. Now comes the fun part.

14:54

Let's figure out how to actually invoke

14:56

or use our compiled embedding model from

14:59

any language on any device. We're going

15:02

to go with JavaScript running on Node.js

15:04

for this example. And so the very first

15:07

step we want to do is figure out how to

15:09

call in to our compiled library from

15:13

JavaScript in Node.js. We can use FFI

15:16

for this for this purpose. And so this

15:18

is where you're able to effectively

15:20

design bindings and declare that hey I'm

15:23

loading this native library which has

15:26

been compiled for my system and my

15:28

architecture. It has this function with

15:31

some name. In our case we already have a

15:33

a function name and that function that

15:36

native function has this signature. And

15:39

so we're able to write a bunch of

15:40

scaffolding code. this we figured out a

15:42

way to standardize this across different

15:44

different uh compiled functions to make

15:46

it very easy for ourselves but this is

15:48

pretty open-ended once you do you can

15:51

basically point NodeJS or your

15:53

JavaScript application to the location

15:55

of that compiled library load it in and

15:58

simply just invoke it like any other

16:00

thing when we do guess what we get our

16:03

embedding matrix right there and for the

16:06

final piece of the puzzle let's take it

16:08

back to the top let's figure out how to

16:10

expose our compiled embedding model

16:12

through our OpenAI style client. So what

16:16

we're going to do is create a class,

16:18

just call it client. Within it, we'll

16:20

create a nested class called embeddings.

16:22

And within that, we will create a create

16:24

function mirroring the official OpenAI

16:27

client.create

16:29

path. And so within that function, when

16:32

the user passes in the model name, all

16:34

we're going to do is simply just go from

16:36

the name of that model to a path to the

16:40

compiled binary that we just created

16:42

from our C++ code generation. And with

16:46

that, with the rest of all the uh FFI

16:48

that we just implemented, we now have a

16:50

way of taking the model, resolving it to

16:53

a path to the library, loading that

16:54

library in library in, and simply just

16:56

executing it to get out our embedding

16:58

matrix.

17:00

The final step is to simply massage the

17:02

outputs so that it looks just like the

17:05

outputs that the official OpenAI client

17:07

gives you. And with this entire system

17:09

in place, we have just recreated the

17:13

official OpenAI client, but given it

17:15

access to any open-source model that we

17:18

can get into a Python function.

Interactive Summary

The video discusses an approach to simplifying AI deployment by building a Python compiler that converts high-level Python AI code into low-level, self-contained binaries (C++ or Rust). This enables models to run efficiently on diverse hardware, including edge devices, while maintaining a consistent developer experience similar to the OpenAI client API. The presenter explains the process of tracing Python code, propagating types, and leveraging LLMs to automate the generation of necessary native code, ultimately allowing developers to run any open-source model through a unified, familiar interface.

Suggested questions

4 ready-made prompts