The Unbeatable Local AI Coding Workflow (Full 2026 Setup)
472 segments
You'll learn the best local AI coding
workflow for 2026. In this video, we
will be using the latest and greatest
quen models, routing our local models
through clock code, and even using any
AI model that you want on weak laptops
locally using LM Studio Link. You don't
want to miss out on this. So, let's get
right into it. Welcome to my local
setup. This is my Linux machine with my
RTX 1590 with 32 GB of VRAM. I'm going
to be using a couple of models
throughout this video. And the first one
is a new quen 3.5 model with 35 billion
parameters. And you can see that my GPU
blazes through this Python code at more
than 100 tokens per second. And this is
because this model is quite big, but not
all of the parameters are active when
you're asking a question. This is the
benefit of a mixture of expert model
which is very common with modern local
AI systems. So you can even see it 140
tokens a second and it's already done
with this question. What's important to
realize here is that if you cannot fit
the entire model on your GPU like I'm
going to show now, the performance of
the AI model is going to be much worse.
In this case, I'm actually loading some
of the parameters of the model into my
system RAM instead. And that's going to
lead to a lot of data having to be
transported back and forth. And you will
see that the performance will be much
poorer. So just because you can fit a
model on your system by putting some of
the parameters on your system RAM
doesn't mean that it's actually going to
be usable in practice. Especially for
agent coding, you're going to be using
very big context windows where the
compute cost basically scales
exponentially. So you really have to
experiment and see which model can truly
fit on your GPU at acceptable speeds if
you want to really code proper solutions
with it. Next, I want to be exposing
this local model to my MacBook, which is
my main development environment. I could
be running a similar model there, but it
will be much slower compared to this
GPU. So, we're going to be using LM
Studios new linking functionality to
expose an encrypted connection between
two devices. So, I can effectively run
this model quote unquote locally on the
MacBook. It's very easy to set up. All I
have to do is basically open LM Studio
on these two different devices. So, I've
logged into MM Studio on the Ubuntu
device. And now we're going to just hop
over to the MacBook and open it up here.
And then here on the MacBook, I can
already just browse to the linking
functionality. And then I will be seeing
that YUbuntu machine pop up immediately.
And indeed, the Quen 3 coder model is
already seen as loaded in here. And it's
going to be super seamless to ask a
question to this local model. Now, I can
just select it in here as a linked
model. And then I can do a new chat and
ask it to generate some Python code. So
now you can see that the model is
available as a linked model on my
YUbuntu machine. And just to prove that
it's running locally here on the top
right, you can see that my GPU is
starting to spike up and it's only doing
it for a short moment because I just
asked it to generate a very simple
Python script. But we do have this
connection set up properly now, which is
very nice. So now what's next? Well, LM
Studio is a nice chat interface, but
it's not really a good interface for
truly coding some complex solutions. So,
I'm going to be connecting Claude Code
to LM Studio. And the first step is to
enable the local server so that I can
point Claude code to it because since a
couple of months, you can connect cloud
code to LM models of your choice. You
don't just have to rely on the models by
anthropic anymore. So, I don't know how
to do this off the top of my head. And
in this chat, I was just basically
asking it to research itself, how to
change its own settings so that it could
point itself to the LM Studio API. Now,
it's good to know that LM Studio exposes
an API that has multiple endpoints.
There is an API that's compatible with
the Open AI API standard, but there's
also a specific one that's compatible
with the anthropic one, which is
probably the easiest one to use here
because Cloud Code expects that. If I go
to LM Studio, you will be able to see
those different supported endpoints.
We've got, you know, a chat interface,
which is the Alm Studio API, but there's
an OpenAI compatible endpoint as well.
But more importantly, there is also an
anthropic compatible endpoint v1/
messages, which is the one that we want.
So I can basically just tell cloud code
that we have that endpoint available. So
it can give us the right recommendation
for the command to connect it to. While
the AI is thinking, I want to make sure
that you've already subscribed to this
channel because most of the people
watching my channel are not subscribed.
And if you don't subscribe, you will
miss out on a latest in creatives in AI
engineering. So make sure to click the
button below. So after a little while,
it basically asks us to export these two
environment variables to override the
anthropic base URL and API key. There
might be many other ways to get this
done, but this will work for me. So I'm
overriding the anthropic base URL and
key now and just saying hello to Claude
over the command line. And you can
indeed see that that command is now
being sent to my local GPU. it's
actually taking a while to respond. The
reason for this is because cloud code
injects quite a lot of context into its
system prompt and it's very easy to miss
out on this detail and think that
everything is going to be as fast as an
empty chat, but that's not true at all.
And this is what a lot of YouTube videos
are actually missing. You will see that
your AI model will be much slower when
you connect it over cloud code. It's not
really a free lunch. You can see right
here that it takes a long time for it to
process the prompt because claude code
simply gives it a huge system prompt
with all kinds of directives on how to
code properly. All these videos were
promoting cloud code via local models. I
feel like most of the people promoting
this are not using it themselves because
unless you have a very powerful machine,
this is going to be extremely slow as
your repository grows in size. It's
simply one of the limitations to local
AI coding. Regardless though, you're
able to customize this prompt and get
some things out of it or use a different
CLI provider that has a more lean
prompt. But in my case, it just takes a
while and now it's finally starting to
generate that answer. Again though, I'm
using a coding model that doesn't fit
entirely on my GPU. So, we're going to
be optimizing that later. For now,
you'll be able to see that we get that
response. How can I help you? Well, it
took a long time to get that response,
but this is because of that context
window that's being filled by the system
prompt of Cloud Code. Next, what we want
to do is optimize this a little bit
because we're obviously not going to be
able to code if it takes 2 minutes to
get any kind of response. So now what
we're going to be doing is switching to
the Quen 3.5 model, which is not
specifically made for programming, but
it's still a very competent language
model, so it will do a pretty good job
and this will fit on my GPU entirely.
Now, I'm making one mistake on purpose.
I'm using the default settings of LM
Studio with only a 4,000 token context
window. And you will see that if I try
this request again, it will hang
indefinitely because the cloud code
system prompt is thousands of tokens
long, we are actually going to be
hitting the limit of my local model
immediately and there's no clear error
message indicating this. So this is
another tip to look out for when you are
trying to combine CLI tools that expect
a huge context window when you haven't
set it up properly yourself. So now
we're going to be increasing the context
window to something more closer to
80,000 tokens. And this is also
necessary because as you ingest a lot of
files to be able to answer code
questions or to be able to come up with
new API endpoints, you need to have a
long context window. And now you will
see that it's actually responding a lot
faster. But one thing that's a little
bit weird about this answer is that it
says that it's sonnet. How come? Well,
again, cloud code is injecting the
system prompt into the language model.
And even though we're using a quen
model, because that system prompt says
that it's cloud set, it thinks this as
well. It's another very important thing
to realize about language models. They
don't always have self-awareness of the
model that they actually are. The system
prompt that they are fed really dictates
their behavior very much. Now, if I add
these environment variables to my
terminal, I can launch the regular cloud
UI and it will use my local AI model.
We're getting a bit of an off conflict
here, but for now, it's fine to ignore
that. So, we can finally start building
something. And to test out my model in
detail, we're going to be building a bit
of a full stack application to interact
with the LM Studio API. Why not? So, we
can just say plan out a sample repo that
has a Nex.js TypeScript
projects to showcase your ability to
create a full stack app. Be a little bit
creative with the concept and don't
just, you know, recommend a lame to-do
list app because we have seen a million
of those already. In fact, we are
showcasing LM Studios ability to share
models between PCs. It would be nice if
you can mimic their UI that shows the
health of the server with loaded models
by exploring the API available at and
then in LM Studio. I'm just going to
paste the documentation of the API cuz
I'm just searching here for the right
endpoint there. There's a REST API
document that you can get to. There we
go. Open documentation. And I can
manually copy paste this, but there's a
simple copy as markdown button. So, I'll
paste an entire description to just
ground the model in the API of LM Studio
because it probably doesn't know that.
And now we're going to go ahead and use
plan mode with my local model. And you
can see it actually starts to respond
pretty quickly because with everything
I've set up now, I've optimized it to
run on my 5090 directly and it's able to
take care of quick responses. So, it's
just going to go ahead and explore the
codebase, which is not too exciting
because there's nothing in the codebase
as of yet. And then it's going to create
the plan based on all of that. And now
it starts to ask me questions. So this
shows you that the local model, even
though it's not, you know, Claude Opus,
the latest and greatest, it does
actually use the tool calling pretty
well because now it's asking me a couple
of questions like the primary focus of
this demo application. So I'm just going
to say that it should just be a simple
dashboard as proof of concept and the
next.js back end will sit in the middle
as a proxy. So we could just have a
simple HTML page that would interact
with the API directly. But I want to
prove that this system can build a full
stack app. So we're just going to have
the Nex.js back end pass the request
from the front end to the LM Studio API.
Hence, I'm just going to call it a proxy
for now. And then in terms of
interactivity, well, I want to keep this
simple. We're just going to have a
simple connection to the LM Studio API
and it's just going to be, you know, a
simple dashboard. And then in terms of
the a IML component, we're just going to
leave that out for now. We're just going
to keep it simple. Now, after a bit of
planning, you can see here that I've
used 45,000 tokens out of my 200,000
tokens, but that doesn't really
represent the real local AI model.
Because this is just cloud code, it
thinks I'm using clots on 4.6. So, this
might not represent the maximum amount
of tokens depending on the local model
configuration you have, but it is nice
to see how many of the tokens are being
used by, for example, the system prompt,
which is indeed already 3,000 tokens, as
well as all the messages you've sent so
far. So you know when you maybe have to
summarize a conversation or clear out
and start with a fresh conversation
history. And after a while by just time
skipping here we've got seven different
spec files that we can implement in
order to build out this full stack
dashboard. So we're really just working
on this entire you know agentic flow
where we're first creating our specs.
Then we'll have the AI agent work them
out and we should end up with a pretty
nice end result here. And just checking
up on some of the code samples that it's
writing. You can see that there is some
scaffolding code here where we're going
to have a back end that's going to call
that v1 API on the LM Studio side. And
after all of this planning, we ended up
using 65,000 tokens. So what happens
when I try to fill the context window?
Well, I'm just pasting a bunch of extra
stuff in here to show you that it is
still able to respond no problem at all.
And the moment that I do this, you will
actually have different behaviors
depending on how you configured LM
Studio because you can configure the
behavior of when the context window of
the model has been hit fully. For
example, in this case, what I've done in
my settings is I've set the context
overflow to truncate the middle of the
conversation history. This keeps some of
that initial conversation history where
you explore the codebase but it will
basically get rid of a lot of things
that happened in the middle of the
conversation which does of course reduce
the memory of your LLM but it frees up
space for you to continue the
conversation. Sometimes though cloud
code will take care of this on its own
and sometimes it will proactively
summarize the conversation history
again. It's just good to be aware that
there are different ways to go about
compressing your conversation history so
you can keep chatting even if you have a
limited context window. Next, we want to
implement the full solution. And I like
to use the bypass all permissions mode
for cloud code so I don't have to press,
you know, enter for every single small
change. The way I'm going to do that
safely is I'm going to run inside of a
dev container. I've got many videos
explaining how that works. They will
basically isolate my development
environment so I'm able to run cloud
code in bypass all permissions mode. And
of course, I'm going to now set my model
to be 200,000 tokens. And the main
reason why I'm doing it this way is
because I don't mind the decrease in
speed. I have bypass all permission mode
on, so I can just walk away from my PC,
come back later, and it's totally fine
if it takes 20 minutes longer to work
out this full stick application. Now,
I'm going to ask it to sequentially work
out each spec. Now, one thing that's
important to note is that I'm explicitly
asking it to create sub agents for each
task. This means that will create new
instance of cloud code with a fresh
context window to work on one piece of
work and then report back to the main
agent. This way I'm able to get much
more out of the limited context window
that I might have for a local model. So
I definitely recommend you to work with
sub agents more than ever if you're
doing local AI coding. So after a while,
30 minutes or so to be exact, we have a
dashboard that seems to be working, but
I ran into quite a couple of bugs. And
to be honest, there were some hard-coded
information here, like this Nvidia RTX
3080 GPU being used. That's just, you
know, made up on the spot by the LLM.
It's pretty typical, right? If you don't
specify everything in detail, it's going
to make things up. But for the purpose
of this demonstration, I want the models
to be shown, the real models that are
loaded in LM Studio. And to get that
done, I had to pass more documentation
about how the API worked and get it to
actually fix the code that I had written
so far. To be honest, this is the same
for state-of-the-art models. You have to
keep coding. You have to fix bugs. But
it is good to be aware that the local
models simply aren't as good as what you
get from, for example, the latest
claopus model. So, you do have to be
realistic and realize that you're
probably going to get more bugs simply
because your models aren't as strong.
One great way to fix bugs is actually to
make sure that the LM agent is able to
call the backend APIs directly that
you're trying to integrate with because
that way it's able to selfassess whether
it's calling the APIs properly. So in
this case I'm giving it instructions for
how it can call the LM Studio API on its
own. So it can align the output format
of the API with the code that it's
writing. So it'll be much more accurate
and bug free. Given some extra time, you
can now see that we have a nice overview
of the models. And it indeed also knows
that the Quen 3.5 model has been
installed. Even here, we can see there's
a couple of weird details that are
hardcoded like this 256k context window,
which is the maximum context window for
that model, but it's not the actual, you
know, limit that I configured. But even
so, you can see here that we're still
working on some of the endpoints, but at
the very least, the model one is
returning a valid response from the
server. So granted, we still have some
work to do here, but clearly we're able
to create a real fullc application using
a local model connected via cloud code.
And in fact, that model is not even
running on my MacBook, but it's running
on the Linux machine using the link
feature of LM Studio. So I hope that you
enjoy this new way to work with local
models, and you should definitely try
this workflow for yourself because it is
much more powerful than things that were
possible 2 years ago. It's still not the
same as using the best state-of-the-art
cloud models, but if you are a privacy
enthusiast, you should definitely get
into this because local AI coding has
never been better before. If you want to
learn more like this, definitely
subscribe to the channel, but also check
out my AI engineering community in the
link in the description below and sign
up for my free resources to learn
Ask follow-up questions or revisit key timestamps.
This video demonstrates a 2026 local AI coding workflow, utilizing LM Studio to run powerful models like the new Quen 3.5 (35B parameters) on a Linux machine with an RTX 1590 GPU. The presenter showcases how to link this powerful setup to a MacBook, effectively running the local AI "locally" on the less powerful device. A significant portion covers integrating Claude Code with LM Studio's local server via an Anthropic-compatible API. Challenges with model performance due to large Claude Code system prompts and limited context windows are discussed, along with solutions like optimizing models and increasing context size. The workflow is demonstrated by building a full-stack Next.js application that interacts with the LM Studio API, highlighting the importance of sub-agents and extensive API documentation for debugging and improving local model accuracy. While local models may not match cloud-based state-of-the-art models in quality, the workflow offers a powerful and privacy-focused alternative for AI engineering.
Videos recently processed by our community