I Replaced My AI Server With A Browser Tab (WebGPU 2026 Setup)

Watch on YouTube

Now Playing

Transcript

264 segments

0:00

When I tell people that they can run

0:01

most AI systems directly in their

0:03

browser, they don't believe me. Now, a

0:05

year ago, I wouldn't have believed it

0:07

either, but right now, I have AI

0:09

chatting with me in real time inside

0:10

Chrome without a server. And that's just

0:13

one of five AI models in this video's

0:15

project. There's speech to text, image

0:18

classification, object detection,

0:20

real-time tracking, and more. All of it

0:22

running on your machine through the

0:24

browser directly with no server or

0:27

complex expensive API keys. If you watch

0:30

this video, I promise you'll understand

0:31

the secret of running local AI systems

0:34

directly in your browser. Welcome to my

0:37

browser AI website. And this is a full

0:39

website that you can run yourself to try

0:42

out many different AI models like image

0:44

classification, LMS, computer vision,

0:46

all running locally in your browser. And

0:48

I'm going to prove to you today that it

0:49

runs locally by going through all these

0:51

use cases. So let's start with image

0:53

classification. The way that this works

0:55

is that this 80 megabyte model will be

0:58

downloaded in a cached session inside

1:00

your browser. This is necessary because

1:02

we don't have some kind of server in the

1:04

middle that's serving the AI model. So

1:06

in this case, I've already got it cached

1:08

and it loaded very quickly. And I'm able

1:10

to just drop in an image. So in my case,

1:12

I'm just going to drop in an image of a

1:13

lion to see if it can classify it

1:15

properly. We're going to double click on

1:17

it. And there you go. It's very

1:19

confident that this is a lion, which

1:20

makes a lot of sense. And I could try

1:22

one more example here with this cat

1:25

specifically. It's an Egyptian cat. So

1:27

very nice. This model is quite capable

1:29

in doing image classification. But of

1:31

course the point is that it's running

1:33

fairly quickly. It's able to recognize

1:35

this cat to 230 milliseconds and it's

1:37

all happening locally. And to prove that

1:39

it works locally, I think it's easier to

1:41

show an LLM chat. This LLM chat is using

1:44

a local Llama 3.2 model. So it will take

1:47

a little bit of time to download. In my

1:49

case though, I've got it all cached and

1:50

I can just press one of these example

1:52

prompts like writing a haiku about AI.

1:55

And you can see here that many tokens

1:57

per second get generated. This was so

1:59

quick that, you know, I couldn't even

2:01

show you an increase in GPU utilization

2:04

here. So instead, what I'm going to do

2:05

is I'm going to copy quite a large part

2:07

of this Wikipedia article. I'm going to

2:09

paste it into this chat and then say

2:11

summarize this article because now I

2:14

will have to use the GPU for a bit of a

2:16

longer time to process this much larger

2:18

prompt. And you can see now that my GPU

2:20

is basically being fully utilized and it

2:23

is now generating a much longer

2:25

response. And the moment that the

2:27

response is done generating, the GPU

2:29

utilization drops. It's still being used

2:31

a little bit because I'm recording this

2:32

video right now, but clearly this is all

2:34

running on my local GPU. There's a huge

2:36

advantage to using WebGPU because it

2:38

allows you to build a seamless web app

2:41

without the need to build a complex

2:43

installable desktop application, but you

2:45

can still use someone's real local

2:47

hardware to run AI models properly. You

2:50

have to find a bit of a balance here

2:51

though because some of these models are

2:53

really big, 700 megabytes. You don't

2:54

want your end users to have to pull

2:56

those models. So, let's have a look at a

2:58

much slimmer but still impressive

3:00

example, computer vision. By the way,

3:03

you can find this entire web application

3:05

for free in the link in the description

3:07

below. So, definitely check that out

3:08

once we're done with this video. This

3:10

model is just 5 megabytes, but it

3:12

actually allows me to do realtime hand

3:14

tracking. Let's see how it works. I'm

3:16

going to go ahead and start this. It's

3:17

going to load that model. And now I'm

3:19

going to have to use one of my other

3:20

cameras I'm not recording with. Here we

3:22

go. I'm going to go ahead and pull this

3:24

up a little bit. So, let's see. Here we

3:27

are. That's looking good. And now we're

3:29

going to allow while visiting this site.

3:31

Here we go. This is me. So now what I

3:33

can do is I can just do a couple of

3:34

gestures and it will be able to figure

3:36

out what gestures I'm making. If I move

3:39

the camera down like this, you can see

3:40

that it still does a pretty good job of

3:42

recognizing my hands. I can move this

3:44

alongside and you can see that it sees

3:45

that it's an open palm and eventually it

3:48

stops recognizing it once I hide behind

3:50

my microphone. So this is a very small 5

3:52

megabyte model, but it's very competent.

3:54

And the great part is that you can

3:56

extend this model for your own use case

3:58

and have it run simply on this web page.

4:00

Even a pretty old device should be able

4:03

to handle this AI model pretty well. So

4:06

now we're going to have a look at this

4:08

speech to text model that I have. And

4:09

for this one, I'm using a slightly

4:11

larger moonshine based model because

4:13

this one is super good. I'm going to

4:15

show you right now. I can simply load

4:16

the model and start recording. And now

4:19

you will see this waveform of my speech

4:21

and it will start transcribing the

4:23

entire thing the moment that I press

4:25

stop recording. So I think we've talked

4:27

for long enough now. I'm going to press

4:29

stop recording and let's see how it

4:31

does. So very quickly it will say and

4:34

now you will see this waveform off my

4:36

speech. Slight mistake here where you

4:38

just want to have off my speech but

4:40

semantically this is basically

4:41

everything that I talked about and it

4:44

was able to transcribe 12 seconds of

4:46

audio in just 567 milliseconds. So this

4:49

is very quick as well. Lastly, I wanted

4:53

to show you semantic search. And this is

4:56

a very important model to run locally

4:58

because it will allow you to make more

5:01

complex AI models know your data. For

5:03

example, you can use semantic search to

5:05

search through your own documents to

5:07

customize your own AI agents. Let me

5:09

just show you what I mean by that. I'm

5:11

going to load this model which is a

5:13

model that has stored a lot of

5:15

embeddings. So, representations of

5:17

different concepts. So if I for example

5:20

say preparing food, it will find a lot

5:23

of information from that embedding model

5:26

that relates to the concept of preparing

5:28

food. For example, fresh herbs should be

5:31

added at the end of cooking to preserve

5:33

their aroma. And this is very quick. It

5:36

was able to find all of this in just 6

5:38

milliseconds. Now, the great part about

5:40

this is that by introducing semantic

5:43

search like this, you can make sure that

5:44

the smaller language models that run

5:46

locally are grounded in the truth that

5:49

they have the latest up-to-date

5:51

information of whatever solution you're

5:53

building right now. And so all of these

5:56

AI systems can work together to create a

5:59

more complex system that all runs in

6:02

someone's browser. Now, just a quick

6:04

note before we move on. If you want to

6:06

stay uptod date with the latest and

6:07

greatest in AI engineering, make sure to

6:09

subscribe to this channel because I will

6:11

keep you informed every single week with

6:13

projects like these. So, how does it

6:16

actually work from a code perspective?

6:17

Let's have a little bit of a look at

6:18

that as well. If we check out our

6:21

codebase, you can see that we only have

6:23

a front-end TypeScript project. There is

6:26

no backend here. So, then how are these

6:28

AI models actually called? Well, we can

6:31

have a look in the source folder and

6:33

then we can have a look at a couple of

6:35

hooks that will check whether web GPU is

6:38

enabled because not every browser has

6:40

access to it. You need GPU acceleration

6:42

for some of the AI models like large

6:45

language models, but other models might

6:47

work just fine on the CPU. It just

6:49

depends on what you're trying to do.

6:50

Now, we can also have a look at the

6:52

components because in here we can find

6:54

all the different demos. For this

6:56

example, let's have a look at the LLM

6:57

chat. You can see here that we have an

6:59

LLM chat tsx file which describes

7:02

basically just the front- end components

7:03

that get rendered in the web app. But

7:06

it's more interesting to have a look at

7:07

the worker because this one actually

7:09

interacts with the Llama 3.2 engine to

7:12

generate an answer. Now what you see in

7:15

here is something that if you've ever

7:17

done any kind of AI engineering before

7:19

actually is not that different from what

7:21

you would create in a server. Here on

7:23

line 35, you can see that we call engine

7:26

chat completions create to create a

7:29

response for all of the messages that

7:31

are in the chat right now. And to do

7:34

that, we're using the MLC engine API

7:37

that is included in this web application

7:39

as part of MLC AI, web lm. So all of

7:43

these nice web packages really are

7:45

aligned with similar API structures that

7:48

you might find when you're just creating

7:49

an API request to OpenAI or a different

7:52

cloud provider. But now you're doing all

7:54

of that locally inside of the web

7:56

application that calls a model that's

7:59

stored in the browser storage in the

8:01

cache. And similarly, if we have a look

8:04

at the image classification, all we need

8:06

to do here is create a classifier

8:08

object, which is using the hogging face

8:10

transformers library. And then the only

8:13

thing we do here is call a weight

8:15

classifier with the image URL to find

8:18

the top five predictions. And that's it.

8:21

The actual code to call the AI model is

8:23

super simple because the community has

8:26

done a lot of great work to abstract all

8:28

of that difficult code away from you. So

8:30

you can just continue working on your

8:32

use case instead of having to build

8:34

hundreds of lines of Python code, for

8:36

example, to interact with an AI model.

8:40

Now, I do want to warn you that this is

8:42

not for every use case. Sometimes it is

8:45

not really appropriate to let your end

8:46

users download AI models in their cache,

8:48

especially if you're talking about an AI

8:50

model that's 100 megabytes in size. but

8:53

it can help you work towards an easy

8:55

proof of concept that you can share with

8:57

other people without you having to just

8:59

create, you know, a back-end server and

9:01

deploy that to the cloud. Now, the great

9:03

part about this is that you can deploy

9:05

this kind of web application pretty much

9:07

for free because you don't need a

9:09

complex backend yet to serve the AI

9:11

model. If you're interested in building

9:13

out this project for yourself, you're in

9:15

luck because you can find it for free in

9:16

the link in the description below. If

9:18

you want more expert advice on deploying

9:20

it, extending it, and becoming a high

9:22

paid AI engineer, check out my program,

9:24

Aanative Engineers, as

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video demonstrates how to run various AI models directly in a web browser using technologies like WebGPU and local caching, bypassing the need for servers or expensive APIs. The creator showcases five specific use cases: image classification, LLM chat (Llama 3.2), real-time hand tracking (computer vision), speech-to-text, and semantic search. It also explains the underlying architecture, which relies on a pure frontend TypeScript codebase and libraries like MLC AI and Hugging Face Transformers to leverage local hardware effectively.