I Replaced My AI Server With A Browser Tab (WebGPU 2026 Setup)
264 segments
When I tell people that they can run
most AI systems directly in their
browser, they don't believe me. Now, a
year ago, I wouldn't have believed it
either, but right now, I have AI
chatting with me in real time inside
Chrome without a server. And that's just
one of five AI models in this video's
project. There's speech to text, image
classification, object detection,
real-time tracking, and more. All of it
running on your machine through the
browser directly with no server or
complex expensive API keys. If you watch
this video, I promise you'll understand
the secret of running local AI systems
directly in your browser. Welcome to my
browser AI website. And this is a full
website that you can run yourself to try
out many different AI models like image
classification, LMS, computer vision,
all running locally in your browser. And
I'm going to prove to you today that it
runs locally by going through all these
use cases. So let's start with image
classification. The way that this works
is that this 80 megabyte model will be
downloaded in a cached session inside
your browser. This is necessary because
we don't have some kind of server in the
middle that's serving the AI model. So
in this case, I've already got it cached
and it loaded very quickly. And I'm able
to just drop in an image. So in my case,
I'm just going to drop in an image of a
lion to see if it can classify it
properly. We're going to double click on
it. And there you go. It's very
confident that this is a lion, which
makes a lot of sense. And I could try
one more example here with this cat
specifically. It's an Egyptian cat. So
very nice. This model is quite capable
in doing image classification. But of
course the point is that it's running
fairly quickly. It's able to recognize
this cat to 230 milliseconds and it's
all happening locally. And to prove that
it works locally, I think it's easier to
show an LLM chat. This LLM chat is using
a local Llama 3.2 model. So it will take
a little bit of time to download. In my
case though, I've got it all cached and
I can just press one of these example
prompts like writing a haiku about AI.
And you can see here that many tokens
per second get generated. This was so
quick that, you know, I couldn't even
show you an increase in GPU utilization
here. So instead, what I'm going to do
is I'm going to copy quite a large part
of this Wikipedia article. I'm going to
paste it into this chat and then say
summarize this article because now I
will have to use the GPU for a bit of a
longer time to process this much larger
prompt. And you can see now that my GPU
is basically being fully utilized and it
is now generating a much longer
response. And the moment that the
response is done generating, the GPU
utilization drops. It's still being used
a little bit because I'm recording this
video right now, but clearly this is all
running on my local GPU. There's a huge
advantage to using WebGPU because it
allows you to build a seamless web app
without the need to build a complex
installable desktop application, but you
can still use someone's real local
hardware to run AI models properly. You
have to find a bit of a balance here
though because some of these models are
really big, 700 megabytes. You don't
want your end users to have to pull
those models. So, let's have a look at a
much slimmer but still impressive
example, computer vision. By the way,
you can find this entire web application
for free in the link in the description
below. So, definitely check that out
once we're done with this video. This
model is just 5 megabytes, but it
actually allows me to do realtime hand
tracking. Let's see how it works. I'm
going to go ahead and start this. It's
going to load that model. And now I'm
going to have to use one of my other
cameras I'm not recording with. Here we
go. I'm going to go ahead and pull this
up a little bit. So, let's see. Here we
are. That's looking good. And now we're
going to allow while visiting this site.
Here we go. This is me. So now what I
can do is I can just do a couple of
gestures and it will be able to figure
out what gestures I'm making. If I move
the camera down like this, you can see
that it still does a pretty good job of
recognizing my hands. I can move this
alongside and you can see that it sees
that it's an open palm and eventually it
stops recognizing it once I hide behind
my microphone. So this is a very small 5
megabyte model, but it's very competent.
And the great part is that you can
extend this model for your own use case
and have it run simply on this web page.
Even a pretty old device should be able
to handle this AI model pretty well. So
now we're going to have a look at this
speech to text model that I have. And
for this one, I'm using a slightly
larger moonshine based model because
this one is super good. I'm going to
show you right now. I can simply load
the model and start recording. And now
you will see this waveform of my speech
and it will start transcribing the
entire thing the moment that I press
stop recording. So I think we've talked
for long enough now. I'm going to press
stop recording and let's see how it
does. So very quickly it will say and
now you will see this waveform off my
speech. Slight mistake here where you
just want to have off my speech but
semantically this is basically
everything that I talked about and it
was able to transcribe 12 seconds of
audio in just 567 milliseconds. So this
is very quick as well. Lastly, I wanted
to show you semantic search. And this is
a very important model to run locally
because it will allow you to make more
complex AI models know your data. For
example, you can use semantic search to
search through your own documents to
customize your own AI agents. Let me
just show you what I mean by that. I'm
going to load this model which is a
model that has stored a lot of
embeddings. So, representations of
different concepts. So if I for example
say preparing food, it will find a lot
of information from that embedding model
that relates to the concept of preparing
food. For example, fresh herbs should be
added at the end of cooking to preserve
their aroma. And this is very quick. It
was able to find all of this in just 6
milliseconds. Now, the great part about
this is that by introducing semantic
search like this, you can make sure that
the smaller language models that run
locally are grounded in the truth that
they have the latest up-to-date
information of whatever solution you're
building right now. And so all of these
AI systems can work together to create a
more complex system that all runs in
someone's browser. Now, just a quick
note before we move on. If you want to
stay uptod date with the latest and
greatest in AI engineering, make sure to
subscribe to this channel because I will
keep you informed every single week with
projects like these. So, how does it
actually work from a code perspective?
Let's have a little bit of a look at
that as well. If we check out our
codebase, you can see that we only have
a front-end TypeScript project. There is
no backend here. So, then how are these
AI models actually called? Well, we can
have a look in the source folder and
then we can have a look at a couple of
hooks that will check whether web GPU is
enabled because not every browser has
access to it. You need GPU acceleration
for some of the AI models like large
language models, but other models might
work just fine on the CPU. It just
depends on what you're trying to do.
Now, we can also have a look at the
components because in here we can find
all the different demos. For this
example, let's have a look at the LLM
chat. You can see here that we have an
LLM chat tsx file which describes
basically just the front- end components
that get rendered in the web app. But
it's more interesting to have a look at
the worker because this one actually
interacts with the Llama 3.2 engine to
generate an answer. Now what you see in
here is something that if you've ever
done any kind of AI engineering before
actually is not that different from what
you would create in a server. Here on
line 35, you can see that we call engine
chat completions create to create a
response for all of the messages that
are in the chat right now. And to do
that, we're using the MLC engine API
that is included in this web application
as part of MLC AI, web lm. So all of
these nice web packages really are
aligned with similar API structures that
you might find when you're just creating
an API request to OpenAI or a different
cloud provider. But now you're doing all
of that locally inside of the web
application that calls a model that's
stored in the browser storage in the
cache. And similarly, if we have a look
at the image classification, all we need
to do here is create a classifier
object, which is using the hogging face
transformers library. And then the only
thing we do here is call a weight
classifier with the image URL to find
the top five predictions. And that's it.
The actual code to call the AI model is
super simple because the community has
done a lot of great work to abstract all
of that difficult code away from you. So
you can just continue working on your
use case instead of having to build
hundreds of lines of Python code, for
example, to interact with an AI model.
Now, I do want to warn you that this is
not for every use case. Sometimes it is
not really appropriate to let your end
users download AI models in their cache,
especially if you're talking about an AI
model that's 100 megabytes in size. but
it can help you work towards an easy
proof of concept that you can share with
other people without you having to just
create, you know, a back-end server and
deploy that to the cloud. Now, the great
part about this is that you can deploy
this kind of web application pretty much
for free because you don't need a
complex backend yet to serve the AI
model. If you're interested in building
out this project for yourself, you're in
luck because you can find it for free in
the link in the description below. If
you want more expert advice on deploying
it, extending it, and becoming a high
paid AI engineer, check out my program,
Aanative Engineers, as
Ask follow-up questions or revisit key timestamps.
Loading summary...
Videos recently processed by our community