HomeVideos

The Cost of Concurrency Coordination with Jon Gjengset

Now Playing

The Cost of Concurrency Coordination with Jon Gjengset

Transcript

1102 segments

0:01

(audience applauding)

0:04

Thank you.

0:05

Hi, my name is John, as you heard,

0:08

that's who I am described in a few beautiful words.

0:11

So I'm here to like dig into some pretty nerdy, deep stuff,

0:15

and I'm hoping that you're all here for it.

0:17

This is not gonna be Rust specific,

0:19

even though a lot of the stuff that I do is Rust.

0:21

But it happens to intersect quite well with people

0:24

who care about Rust, tend to care about things

0:26

like concurrency, and parallelism,

0:27

and speed, and data systems, and algorithms.

0:29

And I have a feeling that much of this audience

0:31

cares about similar things,

0:33

even if it's not because of the Rust programming language.

0:35

So I'm hoping this will apply more broadly.

0:38

You'll see a couple of slides that have tiny bits

0:39

of Rust code on them, but I think you'll be able

0:41

to recognize and map it sort of mentally

0:43

to whatever is your language of preference.

0:46

This talk is titled "Are Mutexes Slow?".

0:49

And there's a, I think it's Goodridge's law of headlines

0:52

that says you can answer any such question with no.

0:54

And of course, the answer to this is also no,

0:57

perhaps unsurprisingly, but the reasons why are interesting

1:00

because we have all observed,

1:02

or at least most of us have, that a mutex was slow,

1:05

that a program was slow because it was stuck taking a mutex

1:08

or mutexes were taking forever to load.

1:10

So how can it be that I claim that they're not slow

1:13

if I've observed that they are slow.

1:16

And in fact, you know, if you dig into and ask people

1:20

what are mutexes for,

1:21

you'll hear things like mutexes are slow.

1:23

You'll hear that reader-writer locks

1:25

are the thing that you should use instead.

1:26

If you have read-heavy workloads,

1:28

just use a reader-writer lock,

1:29

and then you can have as many readers as you want,

1:31

and the universe is happy, and everything is fast.

1:33

You'll also hear things like lock-free data structures

1:36

are the best, they're the fastest, right?

1:38

And you might wonder, well, what does that mean?

1:41

What is a lock-free data structure?

1:43

But also what is a lock?

1:44

Like a lock is a mutex, okay, but what is a mutex,

1:47

like really what's going on at the CPU level?

1:49

And that's the kinda stuff we're gonna dig into today

1:51

to try to give you some intuition

1:53

for if you take a given concurrent algorithm,

1:56

how is it going to perform?

1:57

Why is it going to be slow, if it is indeed slow?

2:00

And what does slow even mean?

2:03

And the reality sort of to give a preview,

2:07

the reality is it's complicated, it's nuanced.

2:10

There's like stuff further down

2:12

that influences why these are slow.

2:14

The test we're gonna use here

2:15

is actually a very straightforward one.

2:17

So we're gonna have a counter,

2:18

and we're gonna read that counter.

2:21

Nothing is going to update the counter.

2:22

We're just going to read it.

2:24

And we're gonna try to read it from many, many threads.

2:28

This is in Rust code, but the basic idea is,

2:30

you have some counter, it's shared.

2:32

In this case, it's atomically referenced counter,

2:34

but it's a way that you can have many threads

2:36

that point to the same one, the same counter.

2:38

And then in the loop

2:39

we're going to take a lock on that counter

2:43

and then we're going to read that counter.

2:45

That's the *guard here.

2:46

The reason I have black box here is to make sure

2:49

that the compiler will actually do the read.

2:51

'Cause otherwise, it would just optimize this code

2:53

to this loop is a loop that does nothing.

2:55

So I don't even take the lock,

2:56

and then, you know, I'm not measuring anything interesting.

2:59

And then I'm just gonna run lots of these in parallel,

3:03

and we're gonna like there're no rights.

3:06

So surely this should be fast.

3:09

And let's try and see what happens

3:11

if we try to use a mutex here.

3:13

So there's gonna be a mutex over a counter

3:15

where we just read.

3:16

This first graph, very easy.

3:18

It is a graph that starts very high

3:21

when there is one thread.

3:23

It gets about 250 million operations per second,

3:26

250 million reads per second.

3:28

Which if you have like a 2.5 gigahertz CPU,

3:31

that means you take about 10 instructions

3:33

to take the lock and do the read.

3:35

That's pretty good.

3:36

When you have two cores, it drops to about a 10th of that,

3:40

drops to about 25 million

3:42

instead of 250 million operations per second.

3:45

So the graph looks like this.

3:48

And then once you add more threads after that

3:50

it looks almost flat, it goes down a little bit more,

3:53

but then it just stays flat, very, very low.

3:57

And so this might make you conclude,

3:58

well, okay, mutexes are slow.

4:01

This is a code that just does reads.

4:03

What is the mutex doing?

4:05

Why is it so slow just because I added more threads?

4:08

What you might assume is that a mutex,

4:10

which is supposed to be mutual exclusion

4:12

only allows one thread to make progress at a time.

4:15

So you would not expect it to get better with more threads,

4:18

but you kind of expect it to stay the same, right?

4:20

The mutex should allow one operation to happen,

4:23

be happening at all times.

4:25

That is not what the graph looks like.

4:29

And instead, more threads give you worse performance.

4:32

Now, what does this look like for reader-writer locks?

4:36

Well, here there's another graph,

4:37

but this one's also very easy to explain,

4:39

because in this graph, there are two lines

4:41

that sit on top of each other.

4:43

The reader-writer lock,

4:44

when you again only take the read lock

4:47

and all you do is read, as you add more threads,

4:50

the performance starts out really good,

4:53

about the same as a mutex and then gets about 10x worse,

4:57

it ends up doing slightly better than the mutex.

5:00

But it also, as you'll see on the graph,

5:03

the reader-writer lock actually gets worse over time.

5:07

So as you add more readers,

5:09

the reader-writer lock ends up being worse than a mutex

5:13

as you keep going.

5:14

The mutex sort of drops and then stays flat.

5:18

And what I'd like to sort of probe you for here is

5:21

to see if you have intuition for why that might be.

5:23

And my hope or the intention of this talk

5:26

is that the answer is I don't really know why that would be.

5:28

That's weird.

5:29

And that I can hopefully try to tell you

5:31

why it behaves that way.

5:33

And why that's actually kind of what we would expect.

5:36

So clearly, there's something else going on here.

5:38

And so we need to talk about CPU caches

5:40

in order to understand the sort of underlying problem here.

5:44

So in this graph you'll see, as I said,

5:46

it starts out really high, and then it gets worse,

5:48

and then it stays about flat.

5:50

Then we will see a graph that shows both the mutex

5:54

and the reader-writer lock.

5:56

The blue line here is the reader-writer lock.

5:58

And again, as you see,

5:59

like it starts slightly worse than the mutex,

6:02

but it doesn't drop quite as much.

6:04

But over time, as you add more readers,

6:06

the reader-writer lock gets worse

6:08

than the mutex gets worse, the mutex sort of stabilizes.

6:12

And so this is where we need to talk about CPU caches,

6:15

and indeed CPU caches are these things

6:18

that CPUs have built into them to try to make it cheaper

6:21

to access the stuff that's stored in RAM.

6:23

Because the stuff that's stored in RAM

6:25

is really far away from the CPU.

6:27

It's far away in the sense that like the copper wire is far.

6:31

And so it can take you like hundreds

6:34

of instructions to reach main memory.

6:36

And hundreds of instructions are an eternity for a CPU.

6:39

And so you want to keep some memory much closer to the CPU,

6:43

but closer to the CPU, you also have less space.

6:46

So you don't get to put as much memory there.

6:48

You have to be very diligent

6:49

about your actual like physical space usage.

6:53

And so CPUs actually have this hierarchy of caches.

6:56

So you'll see this depends a little bit on the CPU.

6:59

So different CPUs have different architectures for this,

7:02

but it's very common that there are three levels of cache.

7:05

L1 cache is very close to the CPU,

7:08

it's like soldered onto the CPU.

7:11

L3 cache on the other hand, is pretty close to RAM.

7:14

It's not as far as RAM, but it is closer to RAM.

7:18

The L3 cache is usually shared

7:20

amongst all of the cores of your CPU.

7:22

The L1 cache is not shared.

7:25

The L2 kind of depends on the CPU,

7:27

it's just somewhere in the middle between L1 and L3.

7:30

It's latency is somewhere between L1 and L3,

7:33

and these also increase in size as you go up.

7:35

So the L1 cache is very small,

7:37

the L2 cache is a little larger,

7:39

and the L3 cache is decently large.

7:41

And then RAM, of course, is humongous as we all know.

7:45

Now, what happens if one CPU has some bit of memory

7:50

like the counter we were talking about in its local cache

7:55

and then some other CPU core wants to read that memory,

7:58

what happens?

7:59

Well, there needs to be

8:00

some kind of cache coherence protocol

8:03

to negotiate between them who's allowed to write it when

8:06

and to make sure that the right value

8:08

makes it back to RAM eventually.

8:11

And that protocol is called MESI, M-E-S-I.

8:15

There are a bunch of variants of it.

8:17

So there's also MSI and there's MOESI,

8:21

and there's like a bunch of different

8:24

various combinations of letters protocols,

8:26

but most of them are MESI-like.

8:30

MESI stands for modified, exclusive, shared, and invalid.

8:34

And those are the four states

8:36

that something can have in the cache.

8:38

The cache keeps an entry for,

8:41

you can think of it logically as like every piece of memory,

8:44

like every address in RAM.

8:46

The reality is it's not actually every address in RAM,

8:48

it's every chunk of RAM.

8:50

So the CPU has this notion of cache lines.

8:54

So it takes all of your main memory,

8:56

it divides it into 64-byte chunks,

9:00

and those 64-byte chunks are a cache line.

9:03

And so you took RAM, and you just like write it out

9:06

like the first 64 bytes is one cache line,

9:08

the next 64 bytes is one cache line, and so on.

9:11

And the cache keeps track of memory

9:14

on the sort of boundaries of cache lines,

9:17

and keeps track of which CPU cache has exclusive access

9:21

to which cache lines or which CPU cores share access

9:25

to a given cache line.

9:28

And this is where we get into some really interesting stuff.

9:32

So when the state is modified,

9:34

it means that I have the only copy.

9:37

So if it's in my cache as a CPU core,

9:39

and it's marked as modified, I have the only copy,

9:43

and it is dirty, as in it differs from what's in RAM.

9:47

So that means if anyone else were to read it,

9:49

not only would I need to give up my permissions,

9:52

but I would also need to write it back to a place

9:53

that they could read it from.

9:55

Exclusive means that I have the only copy

9:58

but it's not dirty.

9:59

So if someone else wants it,

10:00

they can just take it without having to talk to me first

10:03

or they would need to make sure that I move mine to invalid.

10:06

But they don't need any data from me,

10:09

because that data hasn't changed.

10:11

Then we have shared, which is multiple caches

10:13

have entries for these and those cache entries

10:15

are all the same.

10:16

They all have the same value,

10:17

otherwise shared would not be a legal state.

10:20

And then we have invalid, which is,

10:21

I don't have a copy of this cache line.

10:24

This could be that like it used to be like it is in my cache

10:27

in the sense that I had it previously,

10:28

but I'm not allowed to use that cache line entry anymore.

10:32

Because it's been given away

10:33

to someone else to be exclusive, for example.

10:36

If you, for example, are a core,

10:38

and you have a particular cache line in the shared state,

10:41

and you want to write to it,

10:43

the protocol requires that you first make sure

10:46

that no one else has it in the shared state,

10:49

so that only you do.

10:50

And then that everyone realizes that you have now moved it

10:53

to the exclusive state.

10:54

And at that point, you're allowed to read and write it

10:56

as you wish.

10:57

You don't have to tell anyone

10:58

that you're now modifying it,

11:00

even if you did multiple modifications.

11:02

And then, eventually, if someone else

11:04

wants to modify or read that cache line,

11:06

they have to tell you to convert it back into shared

11:09

so that they can then get a copy

11:10

that they can read from but cannot write to.

11:15

And the interesting observation here

11:18

is that writes to shared data

11:22

requires coordination between cores.

11:25

If I have a shared version of a cache line

11:29

and I want to modify it,

11:30

I must talk to basically all the other cores,

11:34

because I need to make sure they know not to modify it,

11:37

and they know not to read it either.

11:40

And this is where some of that costs come from.

11:41

There's a bunch of cross-core communication.

11:45

And in fact, if we look at a reader-writer lock,

11:48

then a reader-writer lock, the implementation

11:51

of taking the read lock is actually

11:54

that you increment the reader count by one.

11:57

So there's a counter inside of the lock,

11:59

not the counter that we're, you know, benchmarking here,

12:02

but inside of the lock there's a counter

12:04

that keeps track of how many reader threads are there.

12:07

And that counter, in order to take the read lock

12:10

you have to write to.

12:12

And so if you have a hundred threads

12:14

that all wanna take the read lock,

12:15

they're all doing it right on a shared field.

12:19

This means that every core needs to, at some point,

12:23

get that one in exclusive mode, update it,

12:26

and then make it shared again.

12:27

And then the next core that wants to read the lock

12:29

needs to get an exclusive mode, modify it,

12:31

and then share it again.

12:32

And so on through all a hundred cores.

12:34

And so in a way, you can think of this

12:36

as there's a sequential part of every reader-writer lock.

12:41

But after you've done this,

12:43

now you don't have to talk to the other cores,

12:45

but you do have to do it for the setup.

12:48

So let's walk through what that actually looks like.

12:51

Let's look at if we have two readers

12:53

that are both trying to acquire a read lock.

12:55

So here we have Core 0 and Core 1,

12:58

the S here is the cache line state, for the cache line

13:01

that they're operating on, whatever that might be.

13:03

And then the note is just to explain what's happening.

13:05

So initially, let's say that Core 0 has exclusive access

13:09

to this cache line and Core 1 does not have a valid entry.

13:13

And then Core 0 wants to do the fetch add, right,

13:16

because it wants to acquire the read lock.

13:17

Well, that's easy, it has exclusive access,

13:20

so it just marks it now as modified,

13:21

and then updates the one, no problem.

13:23

There's no cross-core traffic.

13:25

Now, Core 1 wants to acquire the lock.

13:29

Well, it observes that this entry in the cache

13:34

is already modified by Core 0.

13:37

And so there needs to be now a writeback

13:40

from Core 0 to Core 1's cache

13:42

to inform it of the updated value,

13:44

'cause the value got updated at step one.

13:46

So that edit needs to be commuted to Core 1.

13:50

When that's happened, there's sort of a,

13:52

the protocol has some various kinds of optimizations

13:54

including things like you don't need to go

13:56

via the shared state.

13:57

So you can just give the data over

13:59

and say, I'm sort of giving you ownership

14:01

alongside the data.

14:03

And so at this point now, Core 0 has invalid

14:06

and Core 1 has modified, because it got to do its fetch add.

14:11

And then Core 0 wants to release its read lock.

14:14

Well, the whole process happens again.

14:16

So now, I need Core 1 to write back to Core 0

14:20

so that Core 0 can get it in exclusive state,

14:23

and then decrement the counter of reader nodes.

14:26

And so what we see is if you have one instance

14:30

of like I want to acquire the lock and release the lock,

14:33

you actually require two cache line transfers,

14:37

one to get it so you can do the add

14:41

and one later, so you can get it to do the release,

14:44

the subtraction.

14:47

And each of these transitions costs about 30 seconds.

14:51

So there's sort of a,

14:53

this cache line ping-pong ends up being

14:56

a really expensive operation

14:57

because talking to different parts of memory

15:00

costs different amounts of time,

15:01

because there are different distances from the cores.

15:04

You'll see here that the,

15:06

like if you want to read something

15:08

or access something in L1 cache, it's one nanosecond,

15:11

it's like no time at all.

15:13

But if you have to go to cross-core communication,

15:16

this is what you need to do

15:18

in order to do that cache line transfer,

15:20

that takes about 30 nanoseconds.

15:23

And now imagine that I have to do two of those,

15:24

one's to acquire and one's to release.

15:27

So that's 60 nanoseconds in the good case.

15:30

Accessing main memory is a hundred nanoseconds,

15:32

that's the thing that we think CPUs think take forever

15:35

and we're already over halfway there

15:38

just by acquiring the read lock alone.

15:41

And the other thing that I think is really interesting here

15:45

is that for mutexes, it's not quite as bad.

15:50

For mutexes, when you acquire the lock, you hold the lock,

15:54

and you know that no one else

15:55

is going to try to increment that counter from you

15:58

because well, they can't acquire the lock.

16:01

So the problem in a way is reader locks,

16:05

because if you have an exclusive lock when you release,

16:07

you know that you're still the owner of it.

16:09

Someone else might have gotten a shared copy

16:11

but they will not have an exclusive copy,

16:13

they can't have anything to write back,

16:14

'cause all they can have done is read.

16:18

And so reader locks almost scale worse than mutexes here.

16:22

And this is one of the reasons

16:23

why that reader lock thing gets worse

16:25

is because as you get more readers

16:27

they contend more and more with each other.

16:29

Whereas if you have many mutexes,

16:31

they don't really contend with each other

16:33

because they're fine with going one at a time in a sequence

16:37

and not causing all this cache line ping-pong.

16:39

The reason why people don't think about this problem

16:42

that much is because, usually, when you interact with locks

16:46

you have a bunch of code in there.

16:47

Like after you take the lock, you run a bunch of code,

16:50

and then you release the lock.

16:51

And so the cost of acquiring the lock

16:54

is not really something that matters that much to you.

16:57

But this is not true if you have short, critical sections

17:00

like short amounts of code between acquiring the lock

17:03

and releasing it because if that's the case,

17:06

you're going to find that the time it took

17:09

for the cache miss is more than the time it took

17:12

for you to run your code.

17:15

And so in general, you'll find that the overhead

17:17

of using a lock goes up the more you need that lock

17:21

to be fast, sort of the inverse proportion

17:24

of what you want here.

17:25

For longer things, mutexes are totally fine,

17:27

use 'em as much as you wish.

17:29

It's primarily when you hit those hot, critical path loops

17:32

where you also know you have contention

17:34

where you need a mutex because there's contention

17:37

that a mutex doesn't really work very well.

17:41

And this is also why real-world code

17:42

tends to not worry too much about these problems

17:44

and it's only when you start writing pretty specialized code

17:47

in those deep, hot loops that you need to start knowing

17:49

about things like MESI.

17:53

And so I want to now take a sort of side step to,

17:56

well, what can you do instead?

17:57

So if you can't use mutexes,

17:59

and you can't use reader-writer locks, what can you use?

18:02

And I'm not gonna propose that this

18:04

is the correct algorithm, as I'll get to later,

18:07

this is a bunch of trade-offs too.

18:08

I just wanna sort of inject in your mind

18:10

other kinds of alternatives to sort of let you think about

18:13

what else could you do

18:14

that doesn't necessarily have this problem

18:16

but might have other problems.

18:18

So I wanted to talk to you

18:19

about the cores data structure.

18:21

The left-right data structure is something

18:23

that I used in my PhD research

18:25

where I had enormous amounts of readers and very few writes.

18:30

So I basically hit this problem.

18:32

And the thing that I did when I had the read lock

18:35

was a hash map lookup and then nothing else.

18:38

So the read was very, very short,

18:40

and so the overhead of the reader lock was just killing me,

18:43

and so I needed something else.

18:47

So what's interesting here

18:49

is it was trying to solve this problem

18:51

of why do all the readers need to access

18:54

a shared cache line.

18:56

It feels like if the readers are supposed to be independent,

18:59

they need to have their own cache lines, not a shared one.

19:02

So how can you design a concurrency primitive

19:05

that is still safe for having there be many reads

19:08

and a writer but is safe in such a way

19:11

that like the readers are cheaper

19:12

and the writer has to do more work.

19:15

Because my writes happened relatively infrequently.

19:18

So I was willing to pile more work onto there

19:20

if the readers could be highly parallel.

19:23

So the way left-right works is that you keep two copies

19:26

of your data, you keep a left copy and a right copy.

19:30

You have a pointer in the center,

19:33

sort of an atomic pointer that every reader

19:35

and the writer has access to.

19:36

And that pointer points to the side

19:39

that the readers should read from,

19:42

it points to the copy that the reader should read from.

19:44

The writer will always be modifying the one

19:46

that's not pointed to by the pointer.

19:49

And then when the writer has made a bunch of modifications

19:53

to the writer-owned right copy here,

19:56

then it switches the atomic pointer over

19:59

to now point to the other side.

20:02

When it points to the other side,

20:04

then now, all of the readers will start reading

20:06

the right-hand side copy and the writer can eventually

20:11

start modifying the left-hand side copy.

20:14

And the intent here was the readers

20:17

might not need to coordinate at all

20:19

because the writer doesn't really care

20:21

what the readers are doing with the exception

20:24

of it needs to know when it's safe

20:26

for it to start modifying the copy it's switched away from.

20:30

So imagine you have a bunch of read threads

20:32

that are reading the left copy

20:34

sort of before we swap the pointer,

20:35

and then the writer swaps the pointer.

20:37

Those read thread are still in the left copy

20:40

and so some, sorry, left copy,

20:43

but so something has to tell the writer,

20:46

all the readers have read the updated pointer value now,

20:49

they're in the right copy, and so it's fine for you

20:51

to just now start modifying the left copy.

20:54

So some mechanism is needed here,

20:55

some synchronization primitive.

20:58

And it turns out that you can do this

21:00

by giving every reader their own cache line.

21:04

They have their own counter that basically keeps track

21:06

of how many times they've read the pointer.

21:09

And then the writer's job is going to be look at the counter

21:13

for every reader.

21:14

It has a giant foreloop where it loops over

21:17

all of the readers that exist,

21:19

reads all of their individual counters,

21:22

and observes that all of them have changed

21:25

by at least one since when the pointer was swapped.

21:30

The reason this works is because when you switch the,

21:32

so you switch the pointer,

21:34

then you read all of the counters,

21:37

and then when you read them again,

21:39

if they've changed by at least one,

21:41

it means that that reader has read the updated pointer,

21:44

it's now doing a second operation.

21:46

And if it's read the pointer,

21:47

it has read the changed pointer,

21:49

and therefore it's no longer in the original map.

21:52

The interesting thing about this design

21:54

is the readers do not share states with each other.

21:59

The readers just keep their own counter

22:01

for how many times they have read the pointer.

22:03

They do not need to talk to each other.

22:05

In other words, if we think back to the MESI protocol,

22:07

this means that the core

22:09

that's running any given reader thread

22:11

can just keep that cache line in the exclusive state,

22:14

never has to be shared except for when a writer

22:17

wants to sort of do the pointer switch.

22:20

Because then the writer thread,

22:22

the core that runs the writer

22:24

has to read the counter from every reader.

22:27

And then you know, turn it into the shared state,

22:30

read the value, and then it goes back

22:32

to the reader again to update.

22:34

But only when the writer wants to contend with the readers

22:37

does anything interfere with what the reader wants to do.

22:40

Otherwise, the readers are completely independent.

22:44

And readers become even better than that.

22:47

They now don't have to wait for anything,

22:50

they don't have to wait at all,

22:51

even if there is a concurrent right,

22:53

they can just keep reading their copy.

22:55

So the reads end up being lock-free.

22:58

The reads do not pause for anything.

23:00

In fact, they are wait-free.

23:02

They do not wait at all.

23:04

They don't take any locks, they don't have any loops.

23:07

They just read the pointer, bump their counter,

23:10

and then access the data, and that's all they need to do.

23:13

And the net result of this is you get a graph

23:15

that looks like this maybe, yeah.

23:18

And so this graph is linear, it goes up and to the right,

23:22

and exactly the way you would hope

23:24

when you increase the number of threads.

23:26

And again, we now have the intuition to understand why,

23:30

it is because the reader threads,

23:32

the cores that run the reader threads

23:34

do not have to coordinate,

23:35

and therefore, it can just run fully in parallel.

23:41

And, yeah.

23:42

If one of the threads-

23:44

Here, take this.

23:46

If one of the threads is dormant

23:48

for some reason or maybe doing a lot of work,

23:50

then they just still hold the old reference

23:53

and don't increment it, is that an issue?

23:54

So the algorithm is slightly more complicated

23:58

than I told you.

24:00

So the algorithm is actually that a reader

24:02

will increment its value, its counter value

24:08

before it starts an operation

24:11

and again at the end of its operation.

24:14

And so what the writer needs to look for

24:16

is any reader where the number has gone up

24:18

or where it's even.

24:21

And that way you know,

24:22

that like dormant readers are not a problem.

24:24

Readers that continue to hold onto their access is a problem

24:28

because they prevent the writer

24:29

from starting to modify the other one.

24:31

But it doesn't prevent readers from continuing to read.

24:34

It just prevents the updates to the map.

24:37

So the next thing you'll observe here

24:39

is like the number here is pretty high,

24:42

like this is 3 billion operations per second

24:46

that we get to read.

24:47

But again, it's just because we had wave hand

24:51

about 10 cores and so it should be about 10 times

24:55

what we had at one core, which is not what you get

24:57

with a reader-writer lock,

24:58

even though the name sort of implies you might.

25:02

And the key here, though, is that this only works

25:05

if writes are rare.

25:08

If writes are rare then this performs really well,

25:11

really, really well because the only writes

25:15

can make contention in this system.

25:18

On the other hand, if writes are not rare,

25:21

it doesn't perform any better than what you get

25:23

with a reader-writer lock.

25:24

And in fact, if you start flipping,

25:26

I didn't include that in this graph,

25:27

if you start flipping the ratios,

25:29

you do more rights than reads, this does worse

25:32

because the writer has to do a bunch more work,

25:35

and you don't really get the speed performance win

25:37

from the readers being lower contention.

25:42

And the intuition here for why the reader-writer lock

25:45

gets worse is again the same one we talked about earlier

25:48

for why the reader-writer lock graph goes down.

25:51

Is that the reader-writer lock gets punished more

25:54

as you add readers.

25:55

So here, there are nine times more readers

25:58

than there, right?

25:59

So in that case you, you would expect to get a slowdown,

26:03

because they contend more with each other.

26:05

And I wanna give you a debugging story here

26:07

because there's a useful insight as part of this.

26:11

So while benchmarking left-right

26:12

to produce the previous graph

26:14

that went up into the right linearly,

26:16

I discovered that it did not go up into the right linearly.

26:19

There was sort of this weird dip where I was expecting it

26:22

to keep going linearly and instead it dropped significant,

26:26

it dropped by like 10x when I got to four cores.

26:29

And I was like, why is this happening?

26:31

This shouldn't be happening.

26:32

It does not match my intuition for how CPUs work.

26:36

And it turned out that the problem here

26:40

was not like crossing between NUMA architectures

26:44

or like you ended up going

26:45

to like a different CPU package that's further away.

26:48

Like I went through all of these like ideas

26:50

of maybe it's this, maybe it's that.

26:52

Turned out it's actually pretty simple

26:53

and it's something that you like,

26:56

it's the one of the first things you learn

26:57

when you deal with concurrency at this level.

26:59

It was something called false sharing.

27:01

False sharing happens because remember how you said

27:05

the cache doesn't operate on individual memory addresses,

27:08

it operates on cache lines, 64-byte lines.

27:12

Well, a counter, you can have more than one counter

27:16

in 64 bytes.

27:18

And so if you get unlucky, the counters for,

27:22

so this is like the, not the counter of a benchmarking

27:25

but the counter inside of left-right.

27:27

If multiple threads have their counters

27:29

on the same cache line, then it doesn't matter

27:33

that one is only modifying this counter

27:35

and one is only modifying this counter,

27:37

they're on the same cache line and the ownership,

27:40

the MESI ownership is determined based on the cache line.

27:43

So when one writes and then the other writes,

27:46

the cache line will still bounce

27:47

even though they update distinct parts of that cache line.

27:51

And so the fix was like a one-line fix,

27:53

it was to add 64-byte alignment

27:57

to the type that holds the counters

27:59

because now they must be on different cache lines

28:02

and the problem went away.

28:04

Like it was just that.

28:06

There's now a new release of left-right

28:08

that has this fix in it.

28:09

So now, it should go up into the right for you as well.

28:11

I contributed upstream to myself.

28:16

But there's an important distinction here

28:18

which is that lock-free does not mean contention-free.

28:22

Just because there are no locks

28:24

does not mean that the CPU isn't gonna punish you

28:27

because you made memory move around a bunch more.

28:30

Like moving things across copper wires is expensive

28:32

and if you can avoid doing it,

28:34

you will have a faster program.

28:36

This is also why it's important to know about this stuff

28:38

to have some amount of the intuition

28:40

for why did this thing perform that way

28:43

or why did it not perform the way that I wanted to,

28:45

and try to like debug in a way

28:47

where if you like threw this at a like perf

28:50

or something, it wouldn't really tell you.

28:52

Because the thing that's going wrong

28:53

is like the CPUs have to talk too much to each other,

28:57

which is sort of a layer beyond the abstraction layer

29:01

that we usually reason at.

29:03

And what I want the sort of takeaway of this talk to be

29:06

is that I want you to start choosing a little more wisely.

29:10

If you have to write code that needs to be highly parallel,

29:13

highly concurrent, and is sort of sitting

29:15

in these critical loops, then it's not as though

29:19

you should just take the first algorithm you find,

29:22

even if it's one that's been benchmarked,

29:24

it gets really good performance reviews

29:26

or like you like look on the OCaml version of crates.io

29:31

for the fastest mutex or the fastest reader-writer lock

29:35

and you just use that one.

29:37

Because the reality is you have to pick the algorithm

29:40

that best suits the sort of data transfer patterns

29:45

that your application has.

29:47

Because the way we make concurrent programs faster

29:50

is that we pick algorithms

29:51

that align with what your program wants to do.

29:54

You make use of every single like weird oddity

29:58

that your workload has.

30:00

All of those can be exploited

30:01

to make the algorithm slightly better for your use case.

30:05

And I mean left-right here is a really good example.

30:07

It is not a drop-in replacement.

30:09

There are a bunch of things that left-right does

30:11

that you will not want to do for most applications.

30:14

Like obviously, it keeps two copies of all your data.

30:16

That's pretty expensive.

30:18

Readers can see stale data, right?

30:21

So the writer gets to modify the stuff,

30:23

and thinks it's like modified the stuff,

30:25

but the readers don't get to see it

30:27

until the next time they read after a pointer swap.

30:30

Left-right has a mechanism where the writer can say,

30:32

I now want to wait, like don't return

30:35

until all the readers have moved on.

30:37

So you can do that, but that then comes

30:39

at additional costs to the writes.

30:42

It's also single writer only.

30:44

You cannot have multiple writers in left-right?

30:46

It assumes that there's only one, if you want multiple,

30:49

you have to put the write side of it in a mutex.

30:52

And then now you have a mutex

30:54

on top of a concurrent data structure

30:55

and you're gonna get sadness.

30:57

It's possible but it's not what it's designed for.

31:00

Operations also have to be deterministic

31:02

so that you can take an arbitrary data structure,

31:04

and turn it into a left writeable data structure.

31:07

The reason for this

31:08

is because when you have this left-right split,

31:11

imagine that you do a bunch of operations

31:12

to the, for you left side, left copy.

31:17

And now you switch the pointer over.

31:19

Now you need to take the right copy

31:21

and apply the same changes to that right copy

31:24

as you applied to the left copy while it was the active one

31:28

because the right copy is sort of stale now,

31:30

it didn't see those updates.

31:32

That means you need to be able to log,

31:34

like keep a log of the updates that have been applied here

31:37

but haven't been applied here.

31:39

And then you need to be able to apply them to the other one,

31:41

and they need to have the same result.

31:44

Otherwise, you would end up with the two copies

31:46

having different state

31:47

and then readers will get super confused.

31:50

And so you can only do this for data structures

31:53

where you're able to keep an operational log

31:55

that's deterministic in how it affects the data structure.

31:58

Not all data structures have that property.

32:01

Writers also have to wait for all readers to exit,

32:04

which like may or may not be a problem for you.

32:06

Like maybe you actually want them

32:08

to be able to just run a bunch of writes

32:11

while the reads are going, we don't really care.

32:12

Maybe that's the application you want.

32:13

You want weaker semantics.

32:15

You could do that and probably build

32:16

an even faster data structure.

32:18

But the nuances of what you need

32:20

is what gives you the performance.

32:23

There's a bunch of other stuff

32:24

like it's not linearizable,

32:25

so it's only eventually consistent,

32:28

which may or may not be a problem for you.

32:30

For example, I would not use left-right

32:31

for financial transactions but I would use it

32:34

for things like caches or lookups or those kinds of things.

32:39

So like your mileage may vary.

32:42

The reality is that every solution

32:43

has trade-offs here, right?

32:44

Mutexes are the correct choice

32:47

for a bunch of concurrent algorithms.

32:49

Just like reader-writer locks are the correct choice

32:51

for a bunch of algorithms, and left-right

32:52

is the correct choice for a bunch of them.

32:54

But it depends on what your needs are.

32:58

And that's what I really want you to take away from this.

33:01

Some of the questions I would recommend

33:02

that you try to ask yourself here

33:03

is like what is my ratio between reads and writes?

33:05

That tends to really affect what algorithm you use.

33:09

You should ask yourself how long is the code

33:12

that I want to execute in a synchronized way?

33:15

'Cause if it's short versus long,

33:17

that changes the sort of budget you have

33:19

for the synchronization.

33:21

Think about how many threads you're gonna have.

33:22

If you're gonna have three threads,

33:24

none of this matters, right?

33:26

If you care about like it should be faster at two threads

33:29

than one, then okay, maybe it matters a little,

33:31

but like, you know, you saw the dramatic graph,

33:34

it can matter.

33:35

But usually, that's not the kind of scale or concurrency

33:38

where this stuff really becomes relevant.

33:42

Can I tolerate eventual consistency?

33:44

Like do I need strong consistency

33:45

in my reads versus my writes?

33:47

Do I need, for example, to be able to read my own writes?

33:50

So imagine that I, as the writer,

33:53

I make a modification to the data structure,

33:55

and then I try to read from the data structure.

33:57

Should I be guaranteed

33:58

that I read the modification I just wrote?

34:00

Left-right does not give you that guarantee,

34:02

'cause you might read from another map

34:04

that's like the other one than where you were writing.

34:07

But if you need that,

34:08

well, then you need to take that into account

34:10

in your design of the algorithm as well.

34:12

Do you need linearizability?

34:13

Like do you leave even stricter guarantees

34:15

about what two threads can possibly see

34:19

in relation to another,

34:20

like which interleavings are possible?

34:24

And I'm not gonna teach you all of those things now.

34:26

It's more that these are the kinds of things

34:29

you should ask yourself if you want to go

34:31

the sort of step beyond of just using

34:33

an off-the-shelf concurrent algorithm.

34:36

Always measure before you start optimizing these things.

34:39

And the real lesson is that it's not about locks,

34:41

it was never about locks, right?

34:43

Coordination is expensive because of cache coherence,

34:46

because of what the CPU needs to do

34:48

to guarantee your program executes correctly.

34:51

The locks are just the interface you happen to be using.

34:54

And there are a bunch of other ways in which CPU caches

34:56

can burn you as well, but this is the one you're most likely

34:58

to run into without necessarily realizing

35:00

that that's the thing that bit you in the ass.

35:03

A couple of other articles that are really useful

35:05

that talk about more of the details here.

35:07

This top one I really like, which is a breakdown

35:10

of like how long do things take in the CPU?

35:14

Like how long does it take to read from disk

35:17

versus talk to your L2 cache, versus send a TCP packet,

35:21

versus send a TCP packet to the other side of the world.

35:24

Like what is the relative timescale of these things?

35:28

And some of the other ones are also just interesting

35:31

other data structures you might be able to make use of.

35:34

So for example, scalable reader-writer locks

35:37

talks about a way where you take a reader-writer lock

35:39

and instead of having one counter for the number of readers,

35:43

you have one counter per core per reader.

35:47

So there's like, you don't have as many,

35:49

you don't have per reader, you only have per core,

35:52

and therefore most operations are core-local,

35:54

and the writer walks across all of them.

35:58

And so this also can be the kind of thing

36:00

that like maybe that fits your use case better,

36:03

but ultimately, I hope you give some of these a read

36:05

to give you a better idea

36:06

of what the trade-off space looks like

36:08

and what options you have if you wanted to write

36:11

some really concurrent code.

36:13

That's all I have for you.

36:14

Thank you.

36:15

(audience applauding)

36:19

Questions?

36:22

Do you have questions now

36:23

because you had questions earlier?

36:26

Sure, I'll ask one more.

36:27

I'm not sure how helpful this will be,

36:29

but have there been any interesting hardware innovations

36:31

for parallel programming?

36:33

Like somebody released some new operation

36:35

that made like mutexes like 10 times faster

36:37

because it was a native operation.

36:40

So I wouldn't say

36:42

there's been like revolutionary new things

36:46

where like just whole workloads are now 10x faster.

36:49

But the MESI protocol started out as the MSI protocol,

36:53

and then they added E, like the exclusive state,

36:56

'cause they found that certain workloads

36:59

could avoid like half of their cache line contention

37:03

by having this.

37:04

And I think they're up to like seven or eight letters now

37:07

in some of the more modern CPU architectures.

37:10

Part of the problem

37:10

is these are often proprietary protocols.

37:13

Like we know that they exist

37:15

because they are documented in the spec for the CPU,

37:18

but they don't tell us like what does the CPU do internally?

37:21

They just sort of tell us how to think about it.

37:24

Some of them are like earlier versions were open,

37:26

but later versions we don't know exactly

37:28

what the CPUs are doing,

37:29

but we know that they work roughly like this.

37:32

But nothing where we've been like, this is crazy.

37:36

I think the closest I would say

37:37

is there's been some work

37:40

on like when they build CPU packages

37:42

to build them more in 3D.

37:44

And the reason 3D is cool

37:46

is because you can have shorter distances to things, right?

37:49

So you can put your cache above your CPU,

37:52

and so now that cache can be bigger,

37:54

because it has more space to grow.

37:57

And so now you have a bigger L3 cache,

37:59

and you can maybe share it among more cores.

38:01

This is like the, for AMD processors,

38:04

they have this thing called 3DX,

38:07

I think is like you can buy like their CPU

38:09

or the same CPU with 3DX in the name.

38:12

And the 3DX is like, they stuck an L3 cache on top

38:15

and so we had more L3 cache now.

38:17

And that ends up helping for some of this stuff, right?

38:20

Because it means you don't need to go as far

38:23

when you need to access data

38:25

that another core has written, for instance.

38:28

But not really anything that's been mind-blowing.

38:31

There has also been some cool work on fetch add.

38:36

So one of the things that's neat about fetch add

38:39

is that it's fully commutative, right?

38:44

The fetch add operation is add one

38:46

and tell me what the number was when you added one.

38:50

And what you can do is you can actually have lots of CPUs

38:53

sort of pool their fetch adds

38:55

and then one core runs all of them

38:58

and then sends the results back to the other cores.

39:01

And so that way you don't have to bounce

39:02

the cache line around as much.

39:04

I don't know of any current platforms

39:06

that do this optimization, but I know for a period they did.

39:10

And again, part of what makes the answer here unsatisfying

39:15

is that so many of these systems are proprietary.

39:18

And so it's just we don't necessarily know.

39:21

We know what was published many years ago

39:24

and that we don't know how much of it is still in use.

39:25

So a lot of it is also speculation.

39:30

Thank you.

39:31

Yeah.

39:34

Cube.

39:35

How would a performance

39:37

of left-right change if you had many writers

39:39

or many readers joining and leaving over time?

39:43

For which data structure?

39:44

Oh, for left-right.

39:45

For left-right,

39:46

if you have many readers joining and leaving over time.

39:48

So the way left-right is currently set up

39:52

is that the list of readers is held in a mutex.

39:58

And so if a reader wants to join

40:00

it has to take that mutex, allocate its own counter,

40:04

stick it into the list of that mutex,

40:06

and then unlock the mutex.

40:08

And the writer uses that same list

40:10

to figure out all the readers to read from.

40:12

But the readers once created, never have to access the mutex

40:15

unless they want to go away.

40:17

And in fact, even if they go away,

40:18

they could just leave their stuff there

40:20

because then it's even, and so the writer ignores them,

40:22

but they should clean it up.

40:26

If this ends up becoming a bottleneck,

40:27

then you could easily make that list

40:31

be a lock-free linked list instead.

40:35

Because you only ever need to add to it or walk it.

40:38

You never need index lookups.

40:40

And so you could make it, like you could allow readers

40:43

to join in parallel as well.

40:45

The only thing you'd have to be a little bit careful about

40:47

is to make sure that the writer reads all of the counters

40:53

after it does the swap.

40:55

If it reads any of them before the swap,

40:57

then it can get confused.

40:58

But should be doable.

41:00

It's just, I haven't needed that optimization

41:02

because readers joining has been not the bottleneck.

41:10

Kind of getting at a similar thing

41:13

to the hardware question,

41:15

how much variety is there in implementation

41:17

of mutexes on a software level

41:19

and how much of a difference does that make?

41:21

On the software level, mutexes are

41:25

a little bit of weird beasts.

41:26

So there was a period of time where

41:30

there were a lot of like every language

41:32

had their own implementation,

41:34

and then a bunch of companies wrote faster implementations

41:37

they wanted to use, and then eventually,

41:39

we got better operating support for mutexes.

41:42

So futexes, for example, are fast user-space mutexes

41:46

in Linux, and you can usually just use those,

41:49

like you don't really need a lot of packaging on top of it.

41:52

And then in later years,

41:53

we found people who've tried to do

41:56

like different kinds of optimizations on the mutexes.

41:59

So they're not necessarily about decreasing the contention

42:03

'cause that's gotten to the point

42:04

where like it's just the baseline contention.

42:06

You can't really do any better,

42:08

but they try to decrease the size of the mutex.

42:12

So there are data structures, for example,

42:13

where I think the size of a mutex is one byte,

42:17

but they can still make it work.

42:19

Is it two bits now?

42:20

We're down to two bits.

42:21

You get down to two bits.

42:22

Yeah.

42:23

But the problem, of course,

42:24

is then if you start packing many of them together,

42:27

then now you hit the cache line problem.

42:29

And so now they're contending with each other,

42:31

and so it doesn't really help to make them smaller

42:33

unless you strategically choose mutexes that are like,

42:36

don't contend with each other on the same cache line,

42:39

and you can see how deep this rabbit hole goes.

42:42

I see.

42:43

But in terms of quality of implementation,

42:44

I would say they're all roughly equivalent now.

42:47

The exception is when you start looking

42:49

at things like reader-writer locks,

42:51

where different operating systems

42:53

have very different mechanisms for them.

42:55

And they need to, just like mutexes,

42:57

they have to hook into the operating system

42:58

to make sure that a thread

43:00

that can't take the lock right now,

43:02

is unable to take the lock, for example, a reader comes

43:05

but a writer is currently holding the lock.

43:07

That thread wants to go to sleep,

43:09

so it needs to talk to the operating system

43:11

and then the operating system needs to have a mechanism

43:12

to wake it back up.

43:14

And the sort of how reader-writer lock should work

43:17

in the operating system,

43:18

I think there's still some iteration on.

43:20

Also in user space.

43:25

Like how can you take the reader-writer mechanism

43:28

the operating system gives you

43:30

and make it more scalable, for example,

43:32

by keeping multiple copies, one per core.

43:36

Where what like a given thread will do

43:38

is look up what core it's on, pick that reader lock,

43:41

and then take that one.

43:43

And so that way you can like build better things

43:46

on top of the operating system

43:46

so that there's enough more sort of design space

43:50

for reader-writer locks than there is from mutexes.

43:52

Nice, thanks.

43:53

Got a question back there?

43:54

No.

44:03

You mentioned in your earliest code example

44:07

that you needed to provide a little compiler hint

44:11

to prevent the compiler from optimizing away

44:13

your entire benchmark setup.

44:15

I'm curious about like what sorts of optimizations exist

44:18

in compilers around mutex handling

44:21

and if there are any particularly surprising

44:23

or shocking ones you've seen?

44:25

So compilers tend to be kind of careful around mutexes,

44:29

because users tend assume

44:32

that they will operate in certain ways.

44:35

But the way they do that is usually by having barriers.

44:39

So there's like, I don't wanna get too deep into like

44:42

the different kinds of barriers

44:44

and like the various like SeqCst

44:47

versus acquire versus release and stuff,

44:49

that's gonna get so painful.

44:51

But the compiler basically injects

44:54

these little special sparkly bits in your assembly,

44:59

in the generated code, where sometimes

45:01

it's for the compiler, sometimes it's for the CPU,

45:03

sometimes it's for both, that says you're not allowed

45:07

to move reads before this point

45:10

or you're not allowed to move writes after this point,

45:14

and vice versa.

45:15

To try to basically ensure that, for example,

45:17

if you take a lock and then you read something

45:20

after you took the lock, the CPU can't do

45:23

out of order execution and do the read

45:24

before it took the lock.

45:25

'Cause that would be not okay.

45:29

But it's not really a like a compiler thing.

45:35

It's more of the, implementation of the data structure

45:38

needs to know to inject those sparkly bits

45:43

to make this work correctly.

45:44

And if you don't, you would end up with a mutex

45:46

that looks like it works correctly,

45:48

but sometimes you read the value before you got the lock.

45:51

Compilers do tend to be like very good at optimization,

45:56

a little bit better than we would like them to be.

45:59

Right?

46:00

But we would rather that than the other way around,

46:02

probably, maybe.

46:03

There has been an argument in like recent years

46:06

for dumber compilers, 'cause they're like,

46:08

CPUs have gotten faster, so make the compilers dumber

46:11

so that we can do more stupid shit

46:12

than have it do the correct thing.

46:15

But I wouldn't say like compilers are,

46:19

they're not like deeply tied to how mutexes work at all.

46:28

I think we're good.

46:30

All right.

46:31

Thank you, everyone.

46:32

(audience applauding)

Interactive Summary

The video discusses the performance of mutexes and reader-writer locks, challenging the common assumption that mutexes are inherently slow. It delves into the underlying reasons for performance issues in concurrent programming, focusing on CPU cache coherence protocols like MESI and their impact on performance. The presenter introduces the left-right data structure as an alternative for read-heavy workloads, highlighting its advantages and trade-offs, such as the potential for stale data and single-writer limitation. The discussion also touches upon hardware innovations and software implementations of synchronization primitives, emphasizing that the choice of algorithm should be based on specific application needs and data access patterns. The key takeaway is that coordination between CPU cores, driven by cache coherence, is the primary source of performance bottlenecks in concurrent systems, rather than the locks themselves.

Suggested questions

5 ready-made prompts