Z.ai GLM 4.6: What We Learned From 100 Million Open Source Downloads

Z.ai GLM 4.6: What We Learned From 100 Million Open Source Downloads — Yuxuan Zhang, Z.ai

Watch on YouTube

Now Playing

Z.ai GLM 4.6: What We Learned From 100 Million Open Source Downloads — Yuxuan Zhang, Z.ai

Transcript

390 segments

0:01

Hello everyone. I'm John from VAI and

0:05

I'm very happy to here to talk about our

0:08

latest model series uh J 4.6 series. And

0:13

let's jump right in.

0:18

First uh I will introduce the GM series

0:21

model.

0:23

Gen 4.6 Six is not our first open source

0:26

model since 2022 starting from the very

0:30

first G30B.

0:33

We have been quite serious about open

0:35

source our work. Over the years, we have

0:38

released a whole family of models such

0:41

as chat gem 6p for language model and

0:44

the code for vision understanding code

0:47

view for image generation and the video

0:50

for video generation and the very uh

0:53

many more across the different domains

0:55

and on this side you can see a map of

0:58

our open source model so far and include

1:02

the different color such as the white is

1:04

for the language model the z GM series

1:07

and the pink for the multi mode

1:09

understanding such as the code VM and

1:12

now it's called GMV and the green one is

1:15

for image generation and the yellow is

1:17

for video generation.

1:20

Uh 2025 is our open source year of and

1:25

in this year we added even more models

1:28

including the GM4 0414 dense model

1:32

including like 9B and 32B and the GM4.5

1:37

G4.6 M O series model which is actually

1:41

our first MO models family. So up to now

1:46

we have released over 65 module in total

1:51

and the closed platform like hing face

1:53

monoscope and others we have already

1:56

passed 100 million downloads. If you

1:59

search for the GM or video on GitHub you

2:02

will find 105 1,500 community project

2:06

really top of them and is much a

2:09

communitydriven ecosystem. Now

2:13

let's move to uh GE 4.6.

2:17

Uh I will introduce it now.

2:20

So G 4.6 is our latest flagship model or

2:25

many public benchmark especially in math

2:28

and coding. G 4.6 shows a clear game

2:31

over GM 4.5.

2:34

It also output opensource model release

2:36

in the same period like dipstick version

2:39

3.2.

2:40

and even beat this commercial model such

2:43

as the cloud SOS 4 on several

2:45

benchmarks.

2:47

Of course, if we compare to the clock

2:49

4.5, there still be a noticeable uh

2:52

noticeable gap. So, we are not coming

2:55

with everything, but we're getting close

2:57

and close.

2:59

uh but what makes us especially happy is

3:02

here is um arena uh this benchmark

3:07

uh this is uh which is closer to real

3:09

user preference and on element arena gem

3:12

4.6 say it's time for number one

3:14

together with GPD5 and the cost cross

3:18

storage 4.5 and it's the only open

3:20

source model here and so I'm very happy

3:24

appreciate and I want to thank our

3:26

developer who try to who try our model

3:29

and votes for it so let's move to the CC

3:33

bench

3:36

so beside the user benchmark we also

3:39

build our own data set called the CC

3:42

bench Here we want to text agent style

3:44

coding in real world not just iso lore

3:49

problem. So we built a agent coding text

3:52

platform based on the cloud code and on

3:55

top of that we create CCB version 1.1.

3:58

So compared with version one version one

4:02

uh the new version added 22 hard coding

4:05

task and we statistically evaluate coco

4:09

sonets a consonate 4 and g 4.5 the kim

4:13

K2 and the versions 3.1 terminus

4:18

in total citybench have 74 tasks is

4:22

covering the front end development and

4:25

internal tool development in the data

4:26

analyze and also So algorithm

4:28

implementation. So for every model we

4:32

record the full agent trader query the

4:34

planning st the call and code adds and

4:38

execution the fully open source this

4:40

benchmark. So you can check all the link

4:43

later uh below in the hiring phase and

4:45

JM 4.6 made a clear jump over June 4.5

4:49

and over uh over performance is called

4:52

to call 4 with about 68.6% 6% win rate

4:57

while being significant better than

4:59

under open source baseline. So uh where

5:02

does the performance come from? A lot of

5:05

uh let's talk about GM 4.6 training

5:08

and in this uh we will start from the

5:13

data v training design. First part is

5:16

the general pre-training. So we start

5:19

with about 15 billion uh 15 trillion

5:22

tokens of the generate proposal data

5:25

includes web page books uh acupedia and

5:29

multilang multi uh multilingual content

5:32

and so on. So this stage is about

5:35

building a strong allrounder best model.

5:38

The contest then here is 4,000 tokens

5:41

and the next step is called the

5:43

reasoning continue training. So on top

5:47

of that base with about 7 trillion token

5:50

of extra code and reasoning data. So

5:53

it's part of this part of this counts

5:56

for a high quality open source reports

5:58

and another part is math science and

6:01

context program with four stepbystep

6:03

reasoning.

6:05

Then we come to the mid training. So we

6:08

move to ripple label codes uh include

6:11

that multiple files issues and pull

6:14

request and the difference from the same

6:17

project and all these packed into one

6:19

long contains and the goal is to teach

6:22

the model to following the close file

6:25

and understand the chains and to also

6:29

understand the pro square chains and

6:32

read the real project structure end to

6:34

end. So at this stage we stand the

6:36

content to 32,000

6:39

and the model can basically see the key

6:42

file of a medium size ripple on one

6:45

shot.

6:47

Then is a synetic reasoning data. We

6:50

added about 500 billion token of synetic

6:53

reasoning data. So it cover map science

6:56

and algorithm with experience thinking

6:58

trace. So it's mean the lay of the

7:00

groundwork of future agent behavior like

7:04

breaking down the task refle uh

7:06

reflecting on the mistake and doing

7:08

longchain reasoning. Uh the next step is

7:11

the long content and agent data. Uh

7:14

finally we use about 100 billion token

7:17

of a long content and agent data. Here

7:19

the secret then is now pushed further to

7:22

180 uh 20 uh 128,000 for GM 4.6 is

7:27

200,000. So the model can handle four

7:30

documents and the whole data uh code

7:33

base and very long chart at the same

7:35

time we feed lots of agent chery. So

7:38

include that multi-step two calls the

7:40

search and the code execution extra. So

7:45

uh in this space improve the model long

7:48

content capability and the aging

7:49

capability.

7:51

Also in this slide we introduce slide

7:54

and it's our reinforcement learning

7:56

framework and based on eston inference

7:59

stack uh in practice uh we design in

8:05

house training framework here and we

8:07

also open source it uh we found that the

8:10

different task need very different

8:12

system design

8:15

for short reasoning task like the math

8:17

or the code completion. So the best

8:19

setup is clocked uh with the average

8:23

agriculture. So we train inference in

8:26

the same GPU. So after one batch update

8:28

wave. So the next batch immediately

8:31

sample for the latest policy. the

8:33

screens the most of GPU memory and

8:35

compute and but for agent task and for

8:39

example the real software engineering uh

8:43

usually have many steps like for example

8:46

open the browser and hit uh backend API

8:49

and for the external response um extra

8:53

so if we force every worker to stay in

8:56

the sameness the forces the get dragged

8:59

down by the service field and GPU

9:03

So in slide we decide he agriculture to

9:06

support both uh

9:10

and secret model. If you look at the

9:13

diagramraph the blue part is meatron

9:16

batch training engine w for data buffer

9:19

and opposite ways and the green pause is

9:22

high throughout intervention coaster. So

9:25

with a routine of dispatch request and

9:28

in the middle the data buffer act like

9:30

the share nist systems. So one side

9:33

connect training and other side

9:35

differential agent environments for

9:37

regular reinforcement learning cost. We

9:39

keep training and in uh inference on the

9:41

same GPU pool using with a sim mode and

9:44

dynamic sampling instant update and the

9:47

maximum throughputs. Once we switch

9:49

reach to complete agent task we move to

9:52

a decouple and synchroniz mode. So the

9:55

row outside port talks directly to real

9:58

environments and just regenerate

9:59

tractory and write them into the buffer

10:02

and then the training side consume the

10:04

data in own space up the model and uh

10:09

periodically push new way.

10:14

So the nice thing is even if some tasks

10:16

super slow they don't block the whole

10:19

training pipeline. So on top of that we

10:22

have done a branch of efficiency

10:24

optimization

10:25

[Music]

10:26

like the main chain still run flow 16 uh

10:31

stability but after each policy update

10:34

we do blockwise FDA cronization on the

10:37

latest ways and send FPA version through

10:40

our work. So the most expensive part the

10:44

data generation and running FDA with

10:46

much higher output while training this

10:49

keep BF BF16 precision. So in practice

10:53

we will get the benefit both accuracy

10:56

and speed in this framework.

11:00

Now let's zoom in reasonal ISO and this

11:03

slide with some plots. So the first one

11:06

is about the two-stage curriculum we

11:08

use. We don't change all the fixed data

11:11

set from start to finish. Instead, we

11:13

use a two-stage difficulty curriculum.

11:15

In stage one, we use medium difficulty

11:18

problem. In each batch, some arrive in

11:21

some room. So, the rewards have various

11:24

in the grinding are meaningful. All the

11:26

model get stronger with which you

11:29

extremely hard problem in stage two. But

11:32

with 512 samples, you can still

11:35

occasionally get a correct solution. So

11:38

you can see on the pause the blue curve

11:40

is our method after switching to hard

11:42

problem the curve is keep going up.

11:45

However use the uh median difficulty the

11:47

way uh is not on the red curve. The next

11:51

picture is about a single stage

11:54

reinforcement learning as 64,000 tokens.

11:58

Some previous works such as multi-stage

12:00

reason uh reinforcement.

12:03

for is that uh for example is uh 16 then

12:07

32 then 48 and finally 6 64 but we found

12:12

that for model that is already been

12:14

trained with 64,000 token uh SFT those

12:20

shorter IO strange actually make you

12:22

forget it long content ability so

12:24

average upper listings and the finals

12:28

64k token stage can't fully recover the

12:31

loss

12:32

So the red curve here is our approach.

12:35

We start directly with 40 uh 64,000

12:40

uh token and train in one single stage

12:43

re is clearly outperform than the blue

12:47

middle multi-stage curve. Uh

12:51

the the picture below is about the code.

12:55

So on the left bottom we complain two

12:58

ways of committing the laws for code. So

13:01

the blue one is classic sequence means

13:03

loss is sequence has one loss value and

13:07

the red one is our token w means loss

13:10

which average over token instead of

13:12

sequence. The token w version converts

13:15

faster and most steadily and it reduce

13:18

the chances to generate very short

13:21

template answer just to get the rewards.

13:24

And the right you can see the data.

13:27

uh we do get a science reinforcement

13:30

learning on GPQA demons and the

13:34

messaging almost opposite of more data

13:37

is better. The red curve red curve is

13:40

trained only the small set of expert

13:43

verify but high quality multiple choice

13:47

question and the blue curve use mixed

13:50

quality data. So this result that a

13:52

small blood clean data set gives much

13:56

better performance. So for scientific

13:59

reasoning data quality really matters

14:02

more than raw size.

14:06

After talking about J 4.6 language model

14:09

we move to the multimodel.

14:14

J 4.5

14:16

supports the both image and video

14:19

understanding. It is our latest visual

14:21

understanding model and go and on

14:24

grounding and the image understanding

14:26

benchmark it shows strong performance

14:29

and the clear advantages over other open

14:31

source model release around the same

14:33

time.

14:35

So agriculturally you have the three

14:37

main PS here

14:41

the one is a version transforming coord

14:45

and then it's like with MLP projector

14:48

and the finally is 4.5 base model at the

14:52

coordinates. So we're trying hard to

14:54

keep the virtual input as original as

14:57

possible. So the model can see the image

14:59

negative resolution and accept ratio

15:02

instead of focusing everything into a

15:04

fixed square. So this matter a lot of us

15:07

screenshot and also long vertical image

15:10

and the pawn point slides. So looking

15:12

for the video we also insert a time

15:15

index token after each time basically

15:18

telling model this is the friend C and

15:21

this is the second T and they help it

15:23

understand the temporal order and

15:25

reience which is crucial for action

15:28

understanding and step by step uh

15:30

producer.

15:32

Uh we also use a method uh as we

15:35

researched before co uh in coke agent.

15:38

Now the GUI agent capability is also

15:40

support on GM 4.5 B. So you can like uh

15:45

it can also help you to control the

15:46

computer and also like the website to

15:49

control uh you can use the mouse or the

15:53

keyboard touching and to communicate

15:56

with your uh browser also computer or

15:59

mobile environments.

16:02

So how to use G 4.6 or G 4.5 V model.

16:07

The first one is using a open source

16:10

weight. Uh as we know this both these

16:12

tool is open source. So you can use the

16:15

echelon or v or other framework to

16:18

inference it. Uh along with the weights

16:22

on the release day we already had

16:24

achelon and the vlm integrated ready and

16:27

we also work with many third party open

16:29

source frameworks like the llama factory

16:31

or ms swift. So thank you to this

16:34

community there you have you can choose

16:36

uh any that framework you want and to

16:39

try our model but

16:42

the GM uh GM 4.6 model is a large model

16:46

with like more than 35 uh 355

16:51

billion parameter. So if you don't have

16:53

that 100 or like and other uh GPUs

16:58

there's an easier way to uh use our

17:02

model. So in this slide we show the

17:04

deploy uh command of using Helon or the

17:07

VLM. Here

17:09

the next slide uh we can use the GM on

17:13

the Z. AI uh Z.AI AI. This is a website

17:17

and you can try your uh directly

17:22

and you can use the writing code you can

17:24

using to generate powerpoints and so on.

17:28

And in this uh demo is uh using

17:33

one command to write the Google

17:35

searching in our dat

17:38

uh demo. So you can just uh communicate

17:41

with it

17:43

and also GM is famous in coding

17:47

capability.

17:48

So we also provides the GM coding plan

17:52

which connect GM with tools and other

17:55

plugins that cloud code or other coding

17:59

developer uh develop tools and to

18:02

provide a very strong coding assistant

18:05

experience. We also have a short demo

18:08

video that uh show how to replace Yodi

18:12

model in a cocoa livea with gem 4.6 here

18:16

and you can uh see the you can watch

18:19

this on YouTube

18:23

then is the uh our community activity

18:29

beyond today talk where regularly host

18:31

events both online and offline. So

18:34

whenever we release a new model we

18:36

usually run a several community session

18:38

afterwards as the first one there is AMA

18:42

in the Reddit and we also have some uh

18:46

we also have some offline and onsite uh

18:50

techn uh tech technology sharing so you

18:52

can join us

18:55

the final uh slide is some important

18:58

link you may to know uh is about website

19:03

as I as I mentioned mentioned before to

19:05

try GM model as ZAI and also our API on

19:10

here then we also provide GM 4.6 six

19:13

technical uh technical board and J4.5

19:17

tech reports. Uh you can check it and

19:21

you want to join our community here is

19:23

the discord link and also the GitHub

19:26

link is below with the open source model

19:29

including the readme to how to deploy on

19:32

the open source method. That's all of

19:35

today. Thank you very much.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The presentation introduces the 'Gem' model series, specifically highlighting the flagship 'Gem 4.6' for math and coding capabilities and 'Gem 4.5V' for multimodal understanding. The speaker details the training methodologies, including reinforcement learning frameworks, curriculum strategies, and data optimization techniques. Additionally, the presentation explains how developers can access and deploy these open-source models via various platforms, including community-driven tools and Z.AI.