Profit or Poverty: Translation Look-aside Buffer (TLB)
378 segments
Hello friends. So, I've got another uh
technical blog that I've written
relevant to the topic of low-latency
[snorts] systems. The reason I started
researching the TLB, which stands for a
translation lookaside buffer, is because
I spoke to someone that works in the
trading industry and they mentioned the
TLB when they were quizzing me for my
knowledge and I had no idea what it was.
I looked up what it is and it's a part
of the processor. It's actually on the
chip in the processor and its role is to
map virtual memory addresses to the
physical memory address on the RAM. So,
as part of looking into the TLB and how
to set it up so that you avoid latency,
>> [clears throat]
>> I had to do a brief deep dive into how
memory actually works, why that
structure exists on the CPU chip, how it
works in practice, and a few of the
related things along the way. So, I'm
just going to go through this uh blog
that I've written, which is going to be
linked in the description as well, and
it starts off talking about the illusion
of memory. And I mean, you can pause and
just read it yourself, but I'll just go
through this and try to compress it even
more while I'm speaking. To start off
with, I've written here that
applications don't allocate memory
directly to physical addresses. The
operating system gives them virtual
addresses. Why is that? And I've written
here that it's for the purpose of
pretending that the application has a
contiguous block of memory,
when in reality its memory is scattered
across the RAM in fixed-size chunks
called pages, which are then tracked in
a page table. Now, why Why does the
operating system do that? Let's take a
little detour here for a moment. If a
process started and it requested 500 MB
of memory and you have a 1 GB of memory
available, and then you allocate that
memory, and then another process comes
along and wants 300 megabytes and then a
third process comes along and wants 100
megabytes and then the one that needed
just 300 was killed and no longer needs
that memory. And then another process
comes along and it wants 400.
Technically, you have 400 available, but
it's not in a contiguous block. So, when
the process tries to
allocate to that memory, it could
accidentally allocate over the top of
the third processes memory. If you think
about how just the the basic
the basics of how an an array works,
you're doing arithmetic to find the
index the location of an item in that
array.
But if your memory is split across
fragmented blocks of memory, i.e. it's
not contiguous, then how do you do that
operation? And thus, my gut feel that
this concept of virtual memory was
devised in order to deal with perhaps
that problem and many other problems.
So, as I've detailed in that first
paragraph, the process doesn't know
necessarily that it's not directly
accessing physical memory. The operating
system tells it that it has a contiguous
block, but that contiguous block is
actually broken down into pages and they
are fixed size on Linux. I think the
default is 4 kilobytes. So, when a
process tries to allocate memory to a
particular address, there is a
calculation that is done to figure out
where that virtual address resides in
the physical RAM. The page table itself
lives in RAM. So, when the processor is
trying to access data at a particular
address, it has to first talk to memory
to figure out the mapping of the virtual
address that is being managed by the
operating system in order to get to the
physical memory, then load that and
perform an operation on it.
Anyway, let's just continue on with the
blog. When a process starts, the
operating system allocates a range of
virtual addresses in the form of pages,
but it leaves them blank until they're
accessed, which means that they are lazy
loaded. What does that effectively mean?
It means that when that process tries to
access or write to one of those pages,
since it's lazy loaded, they will
encounter a minor page fault, and that
is slow. So, when the process tries to
access memory, the operating system
follows its virtual memory address to
the physical memory address and hands
back the data from the physical memory
address. The translation of virtual to
physical is not performed by the OS. It
is performed by the TLB. And the TLB
resides in the MMU, which is the memory
management unit. This is an actual
hardware component on the processor
chip, which, if you looked at it in
physical reality, would just be It's
It's most likely
you know, microscopic electrical
components.
And as probably most people that would
be interested in this topic would
already know, that sometimes when RAM is
not accessed, it is swapped out and
written to disk. Let's say some data
that is infrequent- infrequently
accessed is swapped out to disk. Or
maybe if the system is running out of
memory, you can also swap to disk. So,
when the process tries to access that
memory again, that is a major page
fault, which means that the page is not
loaded in the RAM. So, it means that the
operating system has to find the the
data that's been swapped to disk, load
that into memory, and then update the
page table before the process can
read that memory. So, you can imagine
for a major page fault, for example, all
that time where the operating system is
retrieving that data from the disk,
putting it into memory, that is time
that is perhaps imperceivable to humans,
but is it makes the process sit there
and wait for that data to be available.
So, there's a slight lag there for that
memory access. Whereas a minor page
fault, which we've already kind of
talked about, is when the data is in
RAM, but the page table hasn't been
mapped to its physical location yet. So,
we have this system where memory is
behind this abstraction where it the
process thinks it has a contiguous
block, but the operating system is
managing it using page tables to map
those virtual addresses to physical
locations on the RAM. And every time you
are doing certain operations like
reading variables,
uh it's loading instructions, or saving
data, the operating system and the uh
processor and the memory are working
together, you know, billions of times a
second or whatever it is to perform
those actions. If your system had to do
this translation effort on every single
memory access, I actually looked into a
sort of detailed step-by-step of what it
would actually have to do. And this
paragraph here about a hardware page
table walk vaguely goes into that detail
where the MMU is looking at a register
to find it's a special register, sort of
like a hard-coded register sort of
thing, to find the physical address of
the root page table on the RAM. And then
it I I sort of alluded to this
calculation that happens with the
address, it takes some of the bits of
the address, goes to RAM, reads a page
table um at that index, which is like
the bits uh forms an index.
Then it takes the result from that look
up, goes back to the RAM, finds an entry
point to an intermediate table, then
another intermediate table, and then
another intermediate table, and a bunch
of other steps until it gets a physical
page frame number. And then it takes
that number compared
combined with a few of the remaining
bits from the original address to find
the exact byte in memory. So, imagine if
this had to happen on every memory
access, it would be I mean, if you
didn't know anything different, it would
be slow, but we have you know, very
intelligent people have come up with a
solution to this, which is the TLB.
Otherwise known as the trans translation
lookaside buffer. It's a cache in the
MMU. Sort of similar to it's in sort of
the same location as the L1, L2, L3
cache, but its purpose is more
specialized. As I've written here, it
remembers the recent mappings of a
virtual memory address to a physical
memory address to avoid the processor
from having to go to main memory to
check the page table and do that whole
page walk on every single memory access.
So, it's caching these mappings. I've
written here that the consequence of a
TLB miss is that the memory operation
can take 100 to 200 times longer. That's
probably in a bad case or perhaps a
worst case. And and in absolute terms,
this would be something like 100
nanoseconds, which is tiny. It's a tiny
amount of time, but if you consider the
volume, billions of times per second or
whatever it happens to be on your
system. 100 nanoseconds quickly becomes
a noticeable lag. And that's not to
mention that um if you can swap an
operation that takes 100 nanoseconds for
one that takes one nanosecond, that is
very appealing. So, how do we utilize
this information and this structure on
the hardware to lower the latency and
increase our
chances to make money in these low
latency environments. Well, we have this
structure, the TLB, and it's caching
mappings. And when we have a cache miss,
it's not good. So, we want to not have
cache misses. Now, obviously, this
hardware component is so low-level that
there is and there's no other
alternatives really unless we build some
something else that caches these
mappings in a different way. So, we're
sort of constrained to tune and optimize
the TLB to work better on on a given
system. And the way we can do that is to
make the cache hit rate more effective.
And the way we can make it more
effective is to perhaps reduce the
amount of things it needs to cache. And
that seems to be the typical strategy
that people employ in low latency
environments. It's notable that the TLB
has a hard limit on the number of
mappings that it can hold. It is not
something that can hold thousands and
tens of thousands of mappings. It can
hold somewhere between 500 to 1,000
mappings. Now, I mentioned earlier that
a typical page would be 4 kilobytes on a
Linux system by default. So, if you
imagine your system with
32 gigs of memory, you would need 8
million cache entries to map every page
in in the TLB. So, it just so happens
that there is a way to increase the page
sizes. And page sizes can be increased
to either 2 megabytes or 1 gigabyte. I
think that page sizes can be tuned up to
a certain number of kilobytes or certain
architectures may have different page
sizes. For example, the Apple Silicon
has 16 kilobyte page sizes. But, let's
just say that those are still too small
and we want to use huge pages. And
perhaps we want to use 1 GB page sizes.
If we're using 1 GB page sizes, then you
can imagine on that same system with 32
gigs of memory, you would only need 32
entries in the TLB. And that quite
neatly fits into its constraints. So,
how do we actually enable huge pages?
Well, there is a sysctl that you can
set, and I believe that will set it for
the system, but perhaps you might want
to set it from your application so that
only the process that you care about is
using huge pages, and the other
processes on that system can just use
the default 4 KB page size. And I've
written here that in C++, when you're
calling mmap, which is part of
that you can set to specify the huge
pages size for that memory
memory allocation. And similarly in
Rust, you would use some crates, and I
believe that they do pretty much the
same as lib C and C++.
Now, the other thing that you may want,
in fact, you most likely want, is to pin
the memory. The reason for this is
because you may have enabled huge pages,
but you can still encounter a page fault
if you access memory that hasn't been
mapped yet. The operating system does
not eagerly map your allocated memory.
There is another system call, mlockall.
In the manual for mlockall, it's advised
that you should call it before a
critical part of the program, which will
guarantee that you don't run into page
faults, thus avoiding the few hundred
nanosecond latency hit. So, let's say
you're about to begin a routine or a
call that needs to be real-time, then
you can call mlockall. You can also call
mlock. You can also unlock them
afterwards. But it should be noted that
if you unlock the memory, it will unlock
all locks even from multiple parts of
the program. Using this system call also
pins the memory so that the operating
system cannot swap them to disk, which
can obviously be very very handy. I've
got some examples here again in C++ and
Rust of how you would use mlockall in
your program. It's much like what I've
said verbally. You enter part of the
program, you import uh libc. Uh in C++
it's this header file here. Then you
lock the memory and you can unlock it if
so desired.
So, in conclusion, the illusion of
memory that we talked about earlier is a
double-edged sword. I know that's uh
something that's overused. Everything is
technically a double-edged sword, but in
this case, virtual memory does solve
problems around process isolation in
terms of their memory not overriding
some other random processes memory and
so on. However, because of that
abstraction, the uh
management of page tables was introduced
and performing multi-level page table
walks is a costly action. And in most
cases, it's it's not noticeable by the
overwhelming majority of users on any
given Linux system, but for low latency
environments and real-time applications,
it is something that introduces
non-determinism into the operation,
which is undesirable.
So, with this information that I've just
given you, you can use huge pages and
memory pinning with the examples that
I've got here to potentially reduce
latency, memory latency specifically, by
up to 100 to 200 times in certain
scenarios. And that's that that figure I
was talking about earlier where where
reducing a potentially 100 nanosecond
operation down to 1 nanosecond. So, if
you're in a low latency environment,
this is probably something that you've
already tuned because you know that
having this jitter introduced by memory
accesses and page faults is completely
unacceptable and it can be the
difference between profit and poverty.
Your competitors competitors can eat
your lunch in the few hundred
nanoseconds that you lost because your
system, for example, wasn't tuned. So,
thanks very much for watching and
listening and see you soon.
Ask follow-up questions or revisit key timestamps.
The video discusses the role of the Translation Lookaside Buffer (TLB) in CPU architecture, specifically in the context of memory management and achieving low latency. The speaker explains the concept of virtual memory, page tables, and how the OS manages memory addresses. It highlights how TLB misses and page faults can cause latency and introduces strategies like using huge pages and memory pinning (mlockall) to optimize performance in sensitive environments.
Videos recently processed by our community