The Unbeatable Local AI Coding Workflow (Full 2026 Setup)

Watch on YouTube

Now Playing

Transcript

472 segments

0:00

You'll learn the best local AI coding

0:02

workflow for 2026. In this video, we

0:04

will be using the latest and greatest

0:06

quen models, routing our local models

0:08

through clock code, and even using any

0:10

AI model that you want on weak laptops

0:13

locally using LM Studio Link. You don't

0:16

want to miss out on this. So, let's get

0:17

right into it. Welcome to my local

0:19

setup. This is my Linux machine with my

0:21

RTX 1590 with 32 GB of VRAM. I'm going

0:25

to be using a couple of models

0:26

throughout this video. And the first one

0:28

is a new quen 3.5 model with 35 billion

0:31

parameters. And you can see that my GPU

0:33

blazes through this Python code at more

0:35

than 100 tokens per second. And this is

0:38

because this model is quite big, but not

0:40

all of the parameters are active when

0:41

you're asking a question. This is the

0:43

benefit of a mixture of expert model

0:46

which is very common with modern local

0:48

AI systems. So you can even see it 140

0:51

tokens a second and it's already done

0:53

with this question. What's important to

0:55

realize here is that if you cannot fit

0:57

the entire model on your GPU like I'm

0:59

going to show now, the performance of

1:01

the AI model is going to be much worse.

1:04

In this case, I'm actually loading some

1:06

of the parameters of the model into my

1:08

system RAM instead. And that's going to

1:10

lead to a lot of data having to be

1:12

transported back and forth. And you will

1:14

see that the performance will be much

1:16

poorer. So just because you can fit a

1:19

model on your system by putting some of

1:21

the parameters on your system RAM

1:23

doesn't mean that it's actually going to

1:25

be usable in practice. Especially for

1:27

agent coding, you're going to be using

1:29

very big context windows where the

1:31

compute cost basically scales

1:33

exponentially. So you really have to

1:35

experiment and see which model can truly

1:38

fit on your GPU at acceptable speeds if

1:40

you want to really code proper solutions

1:42

with it. Next, I want to be exposing

1:46

this local model to my MacBook, which is

1:47

my main development environment. I could

1:49

be running a similar model there, but it

1:51

will be much slower compared to this

1:53

GPU. So, we're going to be using LM

1:55

Studios new linking functionality to

1:57

expose an encrypted connection between

1:59

two devices. So, I can effectively run

2:01

this model quote unquote locally on the

2:04

MacBook. It's very easy to set up. All I

2:07

have to do is basically open LM Studio

2:09

on these two different devices. So, I've

2:11

logged into MM Studio on the Ubuntu

2:13

device. And now we're going to just hop

2:15

over to the MacBook and open it up here.

2:18

And then here on the MacBook, I can

2:20

already just browse to the linking

2:21

functionality. And then I will be seeing

2:23

that YUbuntu machine pop up immediately.

2:26

And indeed, the Quen 3 coder model is

2:28

already seen as loaded in here. And it's

2:30

going to be super seamless to ask a

2:32

question to this local model. Now, I can

2:35

just select it in here as a linked

2:37

model. And then I can do a new chat and

2:39

ask it to generate some Python code. So

2:43

now you can see that the model is

2:44

available as a linked model on my

2:46

YUbuntu machine. And just to prove that

2:48

it's running locally here on the top

2:50

right, you can see that my GPU is

2:51

starting to spike up and it's only doing

2:53

it for a short moment because I just

2:55

asked it to generate a very simple

2:56

Python script. But we do have this

2:58

connection set up properly now, which is

3:00

very nice. So now what's next? Well, LM

3:03

Studio is a nice chat interface, but

3:05

it's not really a good interface for

3:07

truly coding some complex solutions. So,

3:09

I'm going to be connecting Claude Code

3:11

to LM Studio. And the first step is to

3:13

enable the local server so that I can

3:16

point Claude code to it because since a

3:18

couple of months, you can connect cloud

3:20

code to LM models of your choice. You

3:23

don't just have to rely on the models by

3:25

anthropic anymore. So, I don't know how

3:28

to do this off the top of my head. And

3:29

in this chat, I was just basically

3:31

asking it to research itself, how to

3:34

change its own settings so that it could

3:36

point itself to the LM Studio API. Now,

3:38

it's good to know that LM Studio exposes

3:40

an API that has multiple endpoints.

3:43

There is an API that's compatible with

3:45

the Open AI API standard, but there's

3:47

also a specific one that's compatible

3:49

with the anthropic one, which is

3:51

probably the easiest one to use here

3:52

because Cloud Code expects that. If I go

3:54

to LM Studio, you will be able to see

3:57

those different supported endpoints.

3:58

We've got, you know, a chat interface,

4:01

which is the Alm Studio API, but there's

4:03

an OpenAI compatible endpoint as well.

4:05

But more importantly, there is also an

4:07

anthropic compatible endpoint v1/

4:10

messages, which is the one that we want.

4:12

So I can basically just tell cloud code

4:14

that we have that endpoint available. So

4:16

it can give us the right recommendation

4:17

for the command to connect it to. While

4:21

the AI is thinking, I want to make sure

4:22

that you've already subscribed to this

4:24

channel because most of the people

4:25

watching my channel are not subscribed.

4:27

And if you don't subscribe, you will

4:29

miss out on a latest in creatives in AI

4:31

engineering. So make sure to click the

4:33

button below. So after a little while,

4:36

it basically asks us to export these two

4:38

environment variables to override the

4:40

anthropic base URL and API key. There

4:42

might be many other ways to get this

4:44

done, but this will work for me. So I'm

4:46

overriding the anthropic base URL and

4:48

key now and just saying hello to Claude

4:50

over the command line. And you can

4:52

indeed see that that command is now

4:53

being sent to my local GPU. it's

4:56

actually taking a while to respond. The

4:58

reason for this is because cloud code

5:00

injects quite a lot of context into its

5:03

system prompt and it's very easy to miss

5:05

out on this detail and think that

5:06

everything is going to be as fast as an

5:08

empty chat, but that's not true at all.

5:10

And this is what a lot of YouTube videos

5:11

are actually missing. You will see that

5:14

your AI model will be much slower when

5:16

you connect it over cloud code. It's not

5:18

really a free lunch. You can see right

5:20

here that it takes a long time for it to

5:22

process the prompt because claude code

5:24

simply gives it a huge system prompt

5:26

with all kinds of directives on how to

5:28

code properly. All these videos were

5:30

promoting cloud code via local models. I

5:33

feel like most of the people promoting

5:35

this are not using it themselves because

5:37

unless you have a very powerful machine,

5:39

this is going to be extremely slow as

5:42

your repository grows in size. It's

5:44

simply one of the limitations to local

5:46

AI coding. Regardless though, you're

5:48

able to customize this prompt and get

5:50

some things out of it or use a different

5:51

CLI provider that has a more lean

5:53

prompt. But in my case, it just takes a

5:55

while and now it's finally starting to

5:57

generate that answer. Again though, I'm

5:59

using a coding model that doesn't fit

6:01

entirely on my GPU. So, we're going to

6:03

be optimizing that later. For now,

6:05

you'll be able to see that we get that

6:07

response. How can I help you? Well, it

6:08

took a long time to get that response,

6:10

but this is because of that context

6:12

window that's being filled by the system

6:14

prompt of Cloud Code. Next, what we want

6:17

to do is optimize this a little bit

6:19

because we're obviously not going to be

6:20

able to code if it takes 2 minutes to

6:22

get any kind of response. So now what

6:24

we're going to be doing is switching to

6:26

the Quen 3.5 model, which is not

6:28

specifically made for programming, but

6:30

it's still a very competent language

6:32

model, so it will do a pretty good job

6:34

and this will fit on my GPU entirely.

6:37

Now, I'm making one mistake on purpose.

6:39

I'm using the default settings of LM

6:41

Studio with only a 4,000 token context

6:44

window. And you will see that if I try

6:46

this request again, it will hang

6:48

indefinitely because the cloud code

6:50

system prompt is thousands of tokens

6:52

long, we are actually going to be

6:54

hitting the limit of my local model

6:56

immediately and there's no clear error

6:58

message indicating this. So this is

7:00

another tip to look out for when you are

7:02

trying to combine CLI tools that expect

7:04

a huge context window when you haven't

7:07

set it up properly yourself. So now

7:09

we're going to be increasing the context

7:10

window to something more closer to

7:13

80,000 tokens. And this is also

7:15

necessary because as you ingest a lot of

7:17

files to be able to answer code

7:19

questions or to be able to come up with

7:21

new API endpoints, you need to have a

7:23

long context window. And now you will

7:26

see that it's actually responding a lot

7:27

faster. But one thing that's a little

7:30

bit weird about this answer is that it

7:32

says that it's sonnet. How come? Well,

7:34

again, cloud code is injecting the

7:36

system prompt into the language model.

7:39

And even though we're using a quen

7:41

model, because that system prompt says

7:43

that it's cloud set, it thinks this as

7:45

well. It's another very important thing

7:47

to realize about language models. They

7:50

don't always have self-awareness of the

7:51

model that they actually are. The system

7:53

prompt that they are fed really dictates

7:55

their behavior very much. Now, if I add

7:58

these environment variables to my

7:59

terminal, I can launch the regular cloud

8:02

UI and it will use my local AI model.

8:05

We're getting a bit of an off conflict

8:06

here, but for now, it's fine to ignore

8:09

that. So, we can finally start building

8:11

something. And to test out my model in

8:14

detail, we're going to be building a bit

8:15

of a full stack application to interact

8:18

with the LM Studio API. Why not? So, we

8:21

can just say plan out a sample repo that

8:24

has a Nex.js TypeScript

8:27

projects to showcase your ability to

8:30

create a full stack app. Be a little bit

8:35

creative with the concept and don't

8:38

just, you know, recommend a lame to-do

8:41

list app because we have seen a million

8:43

of those already. In fact, we are

8:45

showcasing LM Studios ability to share

8:47

models between PCs. It would be nice if

8:50

you can mimic their UI that shows the

8:52

health of the server with loaded models

8:54

by exploring the API available at and

8:58

then in LM Studio. I'm just going to

9:00

paste the documentation of the API cuz

9:03

I'm just searching here for the right

9:04

endpoint there. There's a REST API

9:07

document that you can get to. There we

9:09

go. Open documentation. And I can

9:12

manually copy paste this, but there's a

9:13

simple copy as markdown button. So, I'll

9:15

paste an entire description to just

9:17

ground the model in the API of LM Studio

9:20

because it probably doesn't know that.

9:21

And now we're going to go ahead and use

9:23

plan mode with my local model. And you

9:26

can see it actually starts to respond

9:27

pretty quickly because with everything

9:29

I've set up now, I've optimized it to

9:31

run on my 5090 directly and it's able to

9:34

take care of quick responses. So, it's

9:36

just going to go ahead and explore the

9:37

codebase, which is not too exciting

9:39

because there's nothing in the codebase

9:40

as of yet. And then it's going to create

9:42

the plan based on all of that. And now

9:45

it starts to ask me questions. So this

9:47

shows you that the local model, even

9:48

though it's not, you know, Claude Opus,

9:50

the latest and greatest, it does

9:52

actually use the tool calling pretty

9:54

well because now it's asking me a couple

9:55

of questions like the primary focus of

9:58

this demo application. So I'm just going

10:00

to say that it should just be a simple

10:01

dashboard as proof of concept and the

10:04

next.js back end will sit in the middle

10:05

as a proxy. So we could just have a

10:08

simple HTML page that would interact

10:10

with the API directly. But I want to

10:12

prove that this system can build a full

10:14

stack app. So we're just going to have

10:16

the Nex.js back end pass the request

10:19

from the front end to the LM Studio API.

10:22

Hence, I'm just going to call it a proxy

10:24

for now. And then in terms of

10:25

interactivity, well, I want to keep this

10:27

simple. We're just going to have a

10:29

simple connection to the LM Studio API

10:32

and it's just going to be, you know, a

10:34

simple dashboard. And then in terms of

10:36

the a IML component, we're just going to

10:38

leave that out for now. We're just going

10:39

to keep it simple. Now, after a bit of

10:41

planning, you can see here that I've

10:42

used 45,000 tokens out of my 200,000

10:45

tokens, but that doesn't really

10:47

represent the real local AI model.

10:49

Because this is just cloud code, it

10:51

thinks I'm using clots on 4.6. So, this

10:54

might not represent the maximum amount

10:55

of tokens depending on the local model

10:57

configuration you have, but it is nice

10:59

to see how many of the tokens are being

11:01

used by, for example, the system prompt,

11:02

which is indeed already 3,000 tokens, as

11:05

well as all the messages you've sent so

11:07

far. So you know when you maybe have to

11:09

summarize a conversation or clear out

11:11

and start with a fresh conversation

11:13

history. And after a while by just time

11:15

skipping here we've got seven different

11:17

spec files that we can implement in

11:19

order to build out this full stack

11:22

dashboard. So we're really just working

11:24

on this entire you know agentic flow

11:26

where we're first creating our specs.

11:28

Then we'll have the AI agent work them

11:30

out and we should end up with a pretty

11:31

nice end result here. And just checking

11:34

up on some of the code samples that it's

11:36

writing. You can see that there is some

11:38

scaffolding code here where we're going

11:40

to have a back end that's going to call

11:41

that v1 API on the LM Studio side. And

11:45

after all of this planning, we ended up

11:47

using 65,000 tokens. So what happens

11:50

when I try to fill the context window?

11:52

Well, I'm just pasting a bunch of extra

11:53

stuff in here to show you that it is

11:56

still able to respond no problem at all.

11:58

And the moment that I do this, you will

12:00

actually have different behaviors

12:02

depending on how you configured LM

12:03

Studio because you can configure the

12:05

behavior of when the context window of

12:07

the model has been hit fully. For

12:09

example, in this case, what I've done in

12:11

my settings is I've set the context

12:13

overflow to truncate the middle of the

12:15

conversation history. This keeps some of

12:18

that initial conversation history where

12:19

you explore the codebase but it will

12:22

basically get rid of a lot of things

12:23

that happened in the middle of the

12:25

conversation which does of course reduce

12:27

the memory of your LLM but it frees up

12:30

space for you to continue the

12:31

conversation. Sometimes though cloud

12:33

code will take care of this on its own

12:35

and sometimes it will proactively

12:37

summarize the conversation history

12:38

again. It's just good to be aware that

12:40

there are different ways to go about

12:42

compressing your conversation history so

12:44

you can keep chatting even if you have a

12:46

limited context window. Next, we want to

12:48

implement the full solution. And I like

12:50

to use the bypass all permissions mode

12:52

for cloud code so I don't have to press,

12:53

you know, enter for every single small

12:55

change. The way I'm going to do that

12:57

safely is I'm going to run inside of a

12:59

dev container. I've got many videos

13:00

explaining how that works. They will

13:02

basically isolate my development

13:04

environment so I'm able to run cloud

13:06

code in bypass all permissions mode. And

13:08

of course, I'm going to now set my model

13:10

to be 200,000 tokens. And the main

13:13

reason why I'm doing it this way is

13:14

because I don't mind the decrease in

13:16

speed. I have bypass all permission mode

13:18

on, so I can just walk away from my PC,

13:21

come back later, and it's totally fine

13:23

if it takes 20 minutes longer to work

13:25

out this full stick application. Now,

13:27

I'm going to ask it to sequentially work

13:29

out each spec. Now, one thing that's

13:31

important to note is that I'm explicitly

13:33

asking it to create sub agents for each

13:35

task. This means that will create new

13:37

instance of cloud code with a fresh

13:39

context window to work on one piece of

13:42

work and then report back to the main

13:43

agent. This way I'm able to get much

13:46

more out of the limited context window

13:48

that I might have for a local model. So

13:50

I definitely recommend you to work with

13:52

sub agents more than ever if you're

13:54

doing local AI coding. So after a while,

13:57

30 minutes or so to be exact, we have a

14:00

dashboard that seems to be working, but

14:02

I ran into quite a couple of bugs. And

14:04

to be honest, there were some hard-coded

14:06

information here, like this Nvidia RTX

14:08

3080 GPU being used. That's just, you

14:11

know, made up on the spot by the LLM.

14:13

It's pretty typical, right? If you don't

14:15

specify everything in detail, it's going

14:17

to make things up. But for the purpose

14:19

of this demonstration, I want the models

14:21

to be shown, the real models that are

14:23

loaded in LM Studio. And to get that

14:25

done, I had to pass more documentation

14:27

about how the API worked and get it to

14:30

actually fix the code that I had written

14:32

so far. To be honest, this is the same

14:35

for state-of-the-art models. You have to

14:36

keep coding. You have to fix bugs. But

14:39

it is good to be aware that the local

14:40

models simply aren't as good as what you

14:42

get from, for example, the latest

14:44

claopus model. So, you do have to be

14:46

realistic and realize that you're

14:48

probably going to get more bugs simply

14:50

because your models aren't as strong.

14:52

One great way to fix bugs is actually to

14:55

make sure that the LM agent is able to

14:57

call the backend APIs directly that

14:59

you're trying to integrate with because

15:00

that way it's able to selfassess whether

15:02

it's calling the APIs properly. So in

15:04

this case I'm giving it instructions for

15:06

how it can call the LM Studio API on its

15:09

own. So it can align the output format

15:11

of the API with the code that it's

15:13

writing. So it'll be much more accurate

15:15

and bug free. Given some extra time, you

15:19

can now see that we have a nice overview

15:21

of the models. And it indeed also knows

15:23

that the Quen 3.5 model has been

15:25

installed. Even here, we can see there's

15:27

a couple of weird details that are

15:28

hardcoded like this 256k context window,

15:32

which is the maximum context window for

15:34

that model, but it's not the actual, you

15:36

know, limit that I configured. But even

15:38

so, you can see here that we're still

15:40

working on some of the endpoints, but at

15:42

the very least, the model one is

15:43

returning a valid response from the

15:45

server. So granted, we still have some

15:47

work to do here, but clearly we're able

15:49

to create a real fullc application using

15:52

a local model connected via cloud code.

15:55

And in fact, that model is not even

15:57

running on my MacBook, but it's running

15:59

on the Linux machine using the link

16:01

feature of LM Studio. So I hope that you

16:03

enjoy this new way to work with local

16:06

models, and you should definitely try

16:07

this workflow for yourself because it is

16:09

much more powerful than things that were

16:11

possible 2 years ago. It's still not the

16:14

same as using the best state-of-the-art

16:15

cloud models, but if you are a privacy

16:18

enthusiast, you should definitely get

16:19

into this because local AI coding has

16:21

never been better before. If you want to

16:23

learn more like this, definitely

16:25

subscribe to the channel, but also check

16:27

out my AI engineering community in the

16:29

link in the description below and sign

16:30

up for my free resources to learn

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This video demonstrates a 2026 local AI coding workflow, utilizing LM Studio to run powerful models like the new Quen 3.5 (35B parameters) on a Linux machine with an RTX 1590 GPU. The presenter showcases how to link this powerful setup to a MacBook, effectively running the local AI "locally" on the less powerful device. A significant portion covers integrating Claude Code with LM Studio's local server via an Anthropic-compatible API. Challenges with model performance due to large Claude Code system prompts and limited context windows are discussed, along with solutions like optimizing models and increasing context size. The workflow is demonstrated by building a full-stack Next.js application that interacts with the LM Studio API, highlighting the importance of sub-agents and extensive API documentation for debugging and improving local model accuracy. While local models may not match cloud-based state-of-the-art models in quality, the workflow offers a powerful and privacy-focused alternative for AI engineering.