Profit or Poverty: Translation Look-aside Buffer (TLB)

Watch on YouTube

Now Playing

Transcript

378 segments

0:00

Hello friends. So, I've got another uh

0:02

technical blog that I've written

0:04

relevant to the topic of low-latency

0:06

[snorts] systems. The reason I started

0:08

researching the TLB, which stands for a

0:11

translation lookaside buffer, is because

0:14

I spoke to someone that works in the

0:16

trading industry and they mentioned the

0:19

TLB when they were quizzing me for my

0:21

knowledge and I had no idea what it was.

0:23

I looked up what it is and it's a part

0:27

of the processor. It's actually on the

0:29

chip in the processor and its role is to

0:32

map virtual memory addresses to the

0:35

physical memory address on the RAM. So,

0:39

as part of looking into the TLB and how

0:42

to set it up so that you avoid latency,

0:46

>> [clears throat]

0:46

>> I had to do a brief deep dive into how

0:50

memory actually works, why that

0:52

structure exists on the CPU chip, how it

0:55

works in practice, and a few of the

0:57

related things along the way. So, I'm

0:59

just going to go through this uh blog

1:01

that I've written, which is going to be

1:03

linked in the description as well, and

1:05

it starts off talking about the illusion

1:07

of memory. And I mean, you can pause and

1:10

just read it yourself, but I'll just go

1:12

through this and try to compress it even

1:15

more while I'm speaking. To start off

1:18

with, I've written here that

1:19

applications don't allocate memory

1:21

directly to physical addresses. The

1:23

operating system gives them virtual

1:26

addresses. Why is that? And I've written

1:27

here that it's for the purpose of

1:29

pretending that the application has a

1:31

contiguous block of memory,

1:33

when in reality its memory is scattered

1:34

across the RAM in fixed-size chunks

1:37

called pages, which are then tracked in

1:39

a page table. Now, why Why does the

1:41

operating system do that? Let's take a

1:43

little detour here for a moment. If a

1:45

process started and it requested 500 MB

1:49

of memory and you have a 1 GB of memory

1:52

available, and then you allocate that

1:53

memory, and then another process comes

1:55

along and wants 300 megabytes and then a

1:59

third process comes along and wants 100

2:01

megabytes and then the one that needed

2:03

just 300 was killed and no longer needs

2:06

that memory. And then another process

2:08

comes along and it wants 400.

2:10

Technically, you have 400 available, but

2:13

it's not in a contiguous block. So, when

2:16

the process tries to

2:19

allocate to that memory, it could

2:21

accidentally allocate over the top of

2:24

the third processes memory. If you think

2:27

about how just the the basic

2:29

the basics of how an an array works,

2:32

you're doing arithmetic to find the

2:34

index the location of an item in that

2:37

array.

2:38

But if your memory is split across

2:41

fragmented blocks of memory, i.e. it's

2:44

not contiguous, then how do you do that

2:47

operation? And thus, my gut feel that

2:50

this concept of virtual memory was

2:52

devised in order to deal with perhaps

2:54

that problem and many other problems.

2:56

So, as I've detailed in that first

2:58

paragraph, the process doesn't know

3:01

necessarily that it's not directly

3:03

accessing physical memory. The operating

3:05

system tells it that it has a contiguous

3:07

block, but that contiguous block is

3:09

actually broken down into pages and they

3:12

are fixed size on Linux. I think the

3:15

default is 4 kilobytes. So, when a

3:18

process tries to allocate memory to a

3:21

particular address, there is a

3:23

calculation that is done to figure out

3:26

where that virtual address resides in

3:30

the physical RAM. The page table itself

3:33

lives in RAM. So, when the processor is

3:36

trying to access data at a particular

3:39

address, it has to first talk to memory

3:42

to figure out the mapping of the virtual

3:44

address that is being managed by the

3:46

operating system in order to get to the

3:48

physical memory, then load that and

3:51

perform an operation on it.

3:53

Anyway, let's just continue on with the

3:55

blog. When a process starts, the

3:58

operating system allocates a range of

4:00

virtual addresses in the form of pages,

4:02

but it leaves them blank until they're

4:05

accessed, which means that they are lazy

4:07

loaded. What does that effectively mean?

4:10

It means that when that process tries to

4:12

access or write to one of those pages,

4:15

since it's lazy loaded, they will

4:16

encounter a minor page fault, and that

4:19

is slow. So, when the process tries to

4:21

access memory, the operating system

4:24

follows its virtual memory address to

4:26

the physical memory address and hands

4:28

back the data from the physical memory

4:30

address. The translation of virtual to

4:33

physical is not performed by the OS. It

4:36

is performed by the TLB. And the TLB

4:39

resides in the MMU, which is the memory

4:42

management unit. This is an actual

4:45

hardware component on the processor

4:48

chip, which, if you looked at it in

4:50

physical reality, would just be It's

4:53

It's most likely

4:55

you know, microscopic electrical

4:57

components.

4:58

And as probably most people that would

5:01

be interested in this topic would

5:03

already know, that sometimes when RAM is

5:05

not accessed, it is swapped out and

5:08

written to disk. Let's say some data

5:10

that is infrequent- infrequently

5:12

accessed is swapped out to disk. Or

5:15

maybe if the system is running out of

5:17

memory, you can also swap to disk. So,

5:19

when the process tries to access that

5:21

memory again, that is a major page

5:24

fault, which means that the page is not

5:27

loaded in the RAM. So, it means that the

5:29

operating system has to find the the

5:32

data that's been swapped to disk, load

5:34

that into memory, and then update the

5:36

page table before the process can

5:39

read that memory. So, you can imagine

5:42

for a major page fault, for example, all

5:44

that time where the operating system is

5:46

retrieving that data from the disk,

5:48

putting it into memory, that is time

5:50

that is perhaps imperceivable to humans,

5:53

but is it makes the process sit there

5:56

and wait for that data to be available.

5:58

So, there's a slight lag there for that

5:59

memory access. Whereas a minor page

6:02

fault, which we've already kind of

6:04

talked about, is when the data is in

6:07

RAM, but the page table hasn't been

6:09

mapped to its physical location yet. So,

6:13

we have this system where memory is

6:16

behind this abstraction where it the

6:18

process thinks it has a contiguous

6:20

block, but the operating system is

6:21

managing it using page tables to map

6:23

those virtual addresses to physical

6:25

locations on the RAM. And every time you

6:29

are doing certain operations like

6:31

reading variables,

6:33

uh it's loading instructions, or saving

6:35

data, the operating system and the uh

6:37

processor and the memory are working

6:39

together, you know, billions of times a

6:41

second or whatever it is to perform

6:43

those actions. If your system had to do

6:46

this translation effort on every single

6:50

memory access, I actually looked into a

6:53

sort of detailed step-by-step of what it

6:56

would actually have to do. And this

6:59

paragraph here about a hardware page

7:01

table walk vaguely goes into that detail

7:05

where the MMU is looking at a register

7:08

to find it's a special register, sort of

7:11

like a hard-coded register sort of

7:13

thing, to find the physical address of

7:16

the root page table on the RAM. And then

7:19

it I I sort of alluded to this

7:21

calculation that happens with the

7:23

address, it takes some of the bits of

7:27

the address, goes to RAM, reads a page

7:29

table um at that index, which is like

7:33

the bits uh forms an index.

7:36

Then it takes the result from that look

7:37

up, goes back to the RAM, finds an entry

7:40

point to an intermediate table, then

7:42

another intermediate table, and then

7:44

another intermediate table, and a bunch

7:46

of other steps until it gets a physical

7:49

page frame number. And then it takes

7:52

that number compared

7:54

combined with a few of the remaining

7:56

bits from the original address to find

7:59

the exact byte in memory. So, imagine if

8:01

this had to happen on every memory

8:03

access, it would be I mean, if you

8:06

didn't know anything different, it would

8:07

be slow, but we have you know, very

8:11

intelligent people have come up with a

8:12

solution to this, which is the TLB.

8:16

Otherwise known as the trans translation

8:18

lookaside buffer. It's a cache in the

8:21

MMU. Sort of similar to it's in sort of

8:24

the same location as the L1, L2, L3

8:27

cache, but its purpose is more

8:30

specialized. As I've written here, it

8:33

remembers the recent mappings of a

8:35

virtual memory address to a physical

8:37

memory address to avoid the processor

8:40

from having to go to main memory to

8:42

check the page table and do that whole

8:44

page walk on every single memory access.

8:47

So, it's caching these mappings. I've

8:49

written here that the consequence of a

8:51

TLB miss is that the memory operation

8:54

can take 100 to 200 times longer. That's

8:57

probably in a bad case or perhaps a

9:00

worst case. And and in absolute terms,

9:04

this would be something like 100

9:06

nanoseconds, which is tiny. It's a tiny

9:09

amount of time, but if you consider the

9:11

volume, billions of times per second or

9:13

whatever it happens to be on your

9:15

system. 100 nanoseconds quickly becomes

9:19

a noticeable lag. And that's not to

9:21

mention that um if you can swap an

9:24

operation that takes 100 nanoseconds for

9:26

one that takes one nanosecond, that is

9:28

very appealing. So, how do we utilize

9:32

this information and this structure on

9:35

the hardware to lower the latency and

9:38

increase our

9:40

chances to make money in these low

9:43

latency environments. Well, we have this

9:45

structure, the TLB, and it's caching

9:48

mappings. And when we have a cache miss,

9:50

it's not good. So, we want to not have

9:54

cache misses. Now, obviously, this

9:56

hardware component is so low-level that

9:58

there is and there's no other

10:00

alternatives really unless we build some

10:03

something else that caches these

10:05

mappings in a different way. So, we're

10:06

sort of constrained to tune and optimize

10:09

the TLB to work better on on a given

10:12

system. And the way we can do that is to

10:15

make the cache hit rate more effective.

10:17

And the way we can make it more

10:18

effective is to perhaps reduce the

10:21

amount of things it needs to cache. And

10:23

that seems to be the typical strategy

10:25

that people employ in low latency

10:27

environments. It's notable that the TLB

10:30

has a hard limit on the number of

10:33

mappings that it can hold. It is not

10:36

something that can hold thousands and

10:38

tens of thousands of mappings. It can

10:40

hold somewhere between 500 to 1,000

10:43

mappings. Now, I mentioned earlier that

10:45

a typical page would be 4 kilobytes on a

10:47

Linux system by default. So, if you

10:49

imagine your system with

10:52

32 gigs of memory, you would need 8

10:54

million cache entries to map every page

10:58

in in the TLB. So, it just so happens

11:01

that there is a way to increase the page

11:04

sizes. And page sizes can be increased

11:07

to either 2 megabytes or 1 gigabyte. I

11:11

think that page sizes can be tuned up to

11:15

a certain number of kilobytes or certain

11:17

architectures may have different page

11:19

sizes. For example, the Apple Silicon

11:22

has 16 kilobyte page sizes. But, let's

11:25

just say that those are still too small

11:28

and we want to use huge pages. And

11:30

perhaps we want to use 1 GB page sizes.

11:34

If we're using 1 GB page sizes, then you

11:37

can imagine on that same system with 32

11:39

gigs of memory, you would only need 32

11:42

entries in the TLB. And that quite

11:44

neatly fits into its constraints. So,

11:48

how do we actually enable huge pages?

11:50

Well, there is a sysctl that you can

11:53

set, and I believe that will set it for

11:55

the system, but perhaps you might want

11:57

to set it from your application so that

12:00

only the process that you care about is

12:03

using huge pages, and the other

12:05

processes on that system can just use

12:07

the default 4 KB page size. And I've

12:10

written here that in C++, when you're

12:13

calling mmap, which is part of

12:18

that you can set to specify the huge

12:22

pages size for that memory

12:26

memory allocation. And similarly in

12:28

Rust, you would use some crates, and I

12:31

believe that they do pretty much the

12:33

same as lib C and C++.

12:37

Now, the other thing that you may want,

12:39

in fact, you most likely want, is to pin

12:42

the memory. The reason for this is

12:45

because you may have enabled huge pages,

12:48

but you can still encounter a page fault

12:51

if you access memory that hasn't been

12:52

mapped yet. The operating system does

12:54

not eagerly map your allocated memory.

12:57

There is another system call, mlockall.

13:00

In the manual for mlockall, it's advised

13:04

that you should call it before a

13:07

critical part of the program, which will

13:09

guarantee that you don't run into page

13:12

faults, thus avoiding the few hundred

13:14

nanosecond latency hit. So, let's say

13:17

you're about to begin a routine or a

13:20

call that needs to be real-time, then

13:24

you can call mlockall. You can also call

13:27

mlock. You can also unlock them

13:30

afterwards. But it should be noted that

13:32

if you unlock the memory, it will unlock

13:35

all locks even from multiple parts of

13:39

the program. Using this system call also

13:42

pins the memory so that the operating

13:43

system cannot swap them to disk, which

13:46

can obviously be very very handy. I've

13:48

got some examples here again in C++ and

13:51

Rust of how you would use mlockall in

13:55

your program. It's much like what I've

13:57

said verbally. You enter part of the

13:59

program, you import uh libc. Uh in C++

14:04

it's this header file here. Then you

14:06

lock the memory and you can unlock it if

14:08

so desired.

14:10

So, in conclusion, the illusion of

14:13

memory that we talked about earlier is a

14:15

double-edged sword. I know that's uh

14:17

something that's overused. Everything is

14:19

technically a double-edged sword, but in

14:21

this case, virtual memory does solve

14:24

problems around process isolation in

14:27

terms of their memory not overriding

14:30

some other random processes memory and

14:32

so on. However, because of that

14:34

abstraction, the uh

14:36

management of page tables was introduced

14:39

and performing multi-level page table

14:42

walks is a costly action. And in most

14:45

cases, it's it's not noticeable by the

14:47

overwhelming majority of users on any

14:50

given Linux system, but for low latency

14:52

environments and real-time applications,

14:55

it is something that introduces

14:58

non-determinism into the operation,

15:01

which is undesirable.

15:03

So, with this information that I've just

15:05

given you, you can use huge pages and

15:08

memory pinning with the examples that

15:10

I've got here to potentially reduce

15:13

latency, memory latency specifically, by

15:15

up to 100 to 200 times in certain

15:18

scenarios. And that's that that figure I

15:21

was talking about earlier where where

15:22

reducing a potentially 100 nanosecond

15:25

operation down to 1 nanosecond. So, if

15:27

you're in a low latency environment,

15:30

this is probably something that you've

15:31

already tuned because you know that

15:34

having this jitter introduced by memory

15:37

accesses and page faults is completely

15:40

unacceptable and it can be the

15:42

difference between profit and poverty.

15:44

Your competitors competitors can eat

15:46

your lunch in the few hundred

15:48

nanoseconds that you lost because your

15:51

system, for example, wasn't tuned. So,

15:53

thanks very much for watching and

15:55

listening and see you soon.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The video discusses the role of the Translation Lookaside Buffer (TLB) in CPU architecture, specifically in the context of memory management and achieving low latency. The speaker explains the concept of virtual memory, page tables, and how the OS manages memory addresses. It highlights how TLB misses and page faults can cause latency and introduces strategies like using huge pages and memory pinning (mlockall) to optimize performance in sensitive environments.