Cosmos DB Optimization

Watch on YouTube

Now Playing

Transcript

1258 segments

0:00

Hey everyone, in this video I want to

0:02

talk about optimizing your Cosmos DB

0:05

architecture specifically around

0:07

optimizing your cost. Now sometimes when

0:10

we look at Cosmos DB there is this

0:11

initial this is expensive

0:14

but really there's a lot of different

0:16

dimensions and options you have to

0:20

understand and configure correctly. Yes,

0:23

about the particular type of service,

0:25

the amount of RUS which we're going to

0:27

talk about in your data model. How am I

0:29

going to actually interact with the

0:30

data? So I want to walk through some of

0:33

the key options so we can make the right

0:36

decisions.

0:37

Now firstly there are a number of

0:40

different SKUs available. So let's walk

0:42

through those. Now when I create a

0:46

Cosmos DB resource, what I'm actually

0:50

creating is an account. So I think about

0:54

my Cosmos DB. The actual resource is an

0:59

account.

1:01

So this is the ARM resource.

1:05

Now the account contains

1:10

n number of databases

1:15

and the database contains n number of

1:18

containers

1:21

into which we put our various documents

1:25

that that make up our data set. And

1:27

we're going to come back to some

1:28

specifics around this later on because

1:31

there are a number of different options

1:33

we can set both at the account level and

1:37

the container level, sometimes the

1:40

database level, but not option, but

1:42

they're really important when I think

1:44

about optimization and the associated

1:46

costs.

1:48

Now, one of the terms I'm going to use a

1:50

lot is something called request units,

1:53

RUS. You think about a certain number of

1:54

request units per second and you can

1:57

really think about an RU as the compute

1:59

unit of Cosmos DB. Different types of

2:03

interaction cost a certain number of

2:06

RUS. And there's a really nice Microsoft

2:08

site we'll just look at quickly

2:11

that really kind of demonstrates this.

2:14

So you can think about okay a specific

2:16

read of something is an RU but an insert

2:20

an upsert a delete a query well they're

2:24

going to be a certain number of RUS it's

2:27

not just a single RU and that number is

2:30

going to depend on the type of

2:32

interaction how the data has been

2:34

distributed the number of regions and a

2:36

lot more so there's some different

2:39

things we have to consider when I start

2:42

to think about the number RUS we're

2:44

going to use.

2:46

Now a really important point to

2:48

understand is when we set a number of

2:51

request units for a container.

2:55

This is the number of request units we

2:57

can consume per second.

3:00

We don't times that number by 60 by 60

3:04

to get an hourly cost. This is your cost

3:07

for the hour. And it really actually

3:09

scales down to if I continue consuming

3:11

that number of RUS every second for the

3:14

month, that is my monthly cost.

3:17

Otherwise, yeah, it would seem really,

3:18

really expensive. So, if I actually go

3:20

and look at cost for a second,

3:24

what you'll see in the pricing is it's

3:26

telling me, hey, for 100 RUS per second

3:30

for a single region, it is $5.84

3:35

per month. So every single second

3:41

of every minute of every hour that

3:43

month, I use that number of RU. So if my

3:47

workload was completely flat and I only

3:49

ever used 100 RUS and that's all I

3:51

configured, that's what it's going to

3:53

do. Now the thing is how we assign that

3:58

number of RUS is going to vary greatly

4:00

and that is going to be the key focus

4:01

for this video. also I'm going to pay

4:04

for the amount of capacity I consume.

4:07

Now if I think about the SKs that are

4:09

available to us

4:12

well the first one is we do have a free

4:15

skew. So I can think about for the free

4:22

what I get here is it's a fixed number

4:24

of RUS. So what I actually get for this

4:28

is I get

4:30

1,000

4:33

I'll use per second

4:36

I get 25

4:40

gigabytes of storage.

4:45

So every single second I can consume a

4:48

th000 RUS and I don't pay anything. This

4:51

is great when I think about um

4:53

development,

4:55

maybe some testing,

4:57

maybe some prototyping.

5:02

It's really good for those scenarios. I

5:05

can have one of these

5:09

per subscription.

5:11

So every subscription I have, I can have

5:15

a free tier of my Cosmos DB and I can

5:19

consume those thousand

5:22

RUS per second. So it's a nice way to go

5:24

and play and not spend any money. When

5:27

we start to move into the real

5:29

production and the scenarios we're going

5:31

to use, then obviously we get into the

5:34

paid type of tiers. And where we're

5:36

going to focus here is the idea of

5:40

provisioned. I'm going to set a certain

5:43

number of request units. So I'm going to

5:46

think about provisioned

5:53

and within provisioned

5:56

there's two different types.

5:58

Now the first one we're going to think

6:00

about is manual.

6:04

So I have my manual option.

6:10

Now sometimes you would hear that called

6:12

provision which is inaccurate because

6:14

the other one is also a provisioned

6:16

amount. So this is manual and this is

6:19

where I set a fixed number of RUS. So

6:23

what I'm configuring is

6:25

X amount of RUS per second. I just set

6:29

that number and I'm going to pay for

6:32

that whether I use that amount or not.

6:35

So I could think about over some number

6:38

of time because this is what I'm paying

6:40

for.

6:42

I'm setting

6:45

that number of RUS. So it's it's that

6:47

fixed amount. if my actual workload

6:51

varied

6:56

on actually how many I'm using each

6:58

second across different partitions

7:01

across regions

7:02

I'm still paying for that fixed amount

7:05

of R use and because of that

7:09

I need a really really consistent amount

7:13

of use

7:15

to make that a good fit. So when I think

7:18

about using this provisioned, what I

7:21

want is a super predictable

7:29

and constant

7:32

amount of work for that to make sense.

7:36

I'm also going to pay for the capacity

7:42

that I'm consuming, the amount of data.

7:45

So I pay for that as well.

7:48

Um, if I have additional regions,

7:52

I would also pay for that. So, there's

7:53

obviously going to be some certain

7:55

multiplication

7:56

of however many regions I want this

7:59

available to. And I have a configurable

8:05

consistency.

8:08

And this is actually a really big deal.

8:11

Um, when I think about this consistency

8:15

is with Cosmos DB, one of its

8:17

superpowers, I can have, if I wanted to,

8:20

multiple writable instances across many

8:22

different reg regions, and then it will

8:25

work out how to make it consistent,

8:27

maybe over a session, maybe I just need

8:30

it absolutely consistent over all of

8:31

them, which means you're going to get a

8:32

latency when I'm trying to actually

8:34

write to the thing. But I can have that

8:39

ability to have many many copies in many

8:41

many regions with a configurable

8:43

consistency. And obviously I'm going to

8:46

pay

8:48

also for the RUS assigned for that

8:51

region. Now what is interesting here is

8:54

they don't have to be the same.

8:56

So for each region I add I can add

9:00

whatever the RUS

9:04

I want for that particular region. The

9:08

capacity is going to be roughly the same

9:10

for all of them. There's going to be

9:11

some network egress as well for

9:13

replicating the data.

9:16

But obviously I'm going to pay where I

9:17

basically have copies of the data and I

9:20

have additional sets of compute to serve

9:22

and respond to requests for things that

9:24

are going to hit that particular region.

9:27

But really the key point here is to use

9:31

manual provisioned. I am just setting a

9:34

flat amount of RUS that I pay for

9:38

whatever my usage for that to make

9:40

sense. that usage really needs to be

9:43

pretty much up there with that line. So

9:45

I want a super predictable, constant,

9:48

uniform amount of compute usage for this

9:52

manual provision to actually make sense.

9:55

Now there are things that can help make

9:57

this make sense. Azure reservations.

10:00

If I know I need a certain amount for a

10:03

really long period of time, we think 1 3

10:06

years,

10:08

then with that discount I get, I can

10:12

accept a variance in that usage. I could

10:15

be a lower amount of use, but still make

10:18

financial sense. If we look at the

10:20

documentation,

10:23

I can go and look at sort of

10:24

reservations. And yes, there's an idea

10:27

of a sort of a bucket that I can just

10:31

keep applying these buckets of 100s and

10:33

I get a 20% or a 30% discount.

10:37

But if I'm willing to commit to much

10:40

larger amounts,

10:42

you can get some crazy discounts

10:45

when you start getting into really high

10:47

numbers. And obviously at that point if

10:50

I'm committing to super huge amounts

10:52

then hey maybe I'm willing to accept

10:54

some more idle because I'm still saving

10:56

a lot of money on that. But really what

10:59

we're thinking about here is it's a very

11:00

predictable very constant amount of

11:02

work.

11:04

Now the next thing we think about in

11:06

terms of our different options

11:08

available. So, yes, we had manual, but

11:11

I'm going to draw this one as kind of

11:13

our for the most part the superstar

11:16

that I would probably say 99% of the

11:20

time is the one you're probably going to

11:21

end up using is it's provisioned,

11:25

but it's this time it is auto scale.

11:35

So the big deal here is it is a this is

11:39

the default going forwards

11:42

because

11:44

99% of the time this is going to be the

11:45

best option for you but it is dynamic

11:50

in nature in terms of the RU you're

11:53

actually getting bills for. Now what

11:55

happens is yes I still set

12:00

X amount of R use

12:05

but what's going to happen here is the

12:07

actual amount you're getting build for

12:10

is going to depend on the amount you're

12:12

consuming. So the whatever you set here,

12:16

that's kind of the 100%. That's the

12:18

maximum that it's allowed to use, but it

12:22

can go all the way down to 10%

12:27

of that number of RUS.

12:30

And my bill, those dollars

12:34

is based on what is it actually

12:38

consuming.

12:42

Because of this autoscale nature, it has

12:45

a premium on the number of RUS it's

12:50

essentially consuming to do the various

12:53

interactions. And the premium

12:56

is essentially 1.5x.

13:01

So I'm paying a 50% premium to get this.

13:05

And we can see this in the pricing. So

13:07

if I jump back over,

13:09

I go back to the pricing page. And this

13:13

time I'll change it to the automatic

13:16

provision throughput.

13:19

And we look at the pricing.

13:21

It tells us that 1.5

13:25

So we pay a premium

13:29

because of that idea.

13:32

We're going to pay for what we use. Now

13:35

on top of that, I can still have any

13:38

number of regions.

13:43

And the nice thing here is what's

13:45

happening

13:46

13:48

the amount it's going to bill me

13:52

is independent

13:55

utilization

13:57

per partition per region. So it is a

14:01

peretition

14:04

which is going to make more sense later

14:05

14:07

per region

14:12

utilization

14:14

i.e. I don't just get if I'm using 90%

14:18

in a region I don't get build 90% for

14:22

all of the regions. If one region's at

14:24

90%, one region's at 20%, one's at 40%.

14:29

Each region is independently going to

14:31

bill me for the amount of utilization at

14:34

that particular region at the petition

14:37

level. So, it's a really good option.

14:40

And so, as you can see, I'm paying more,

14:44

but because it only charges you for what

14:46

you actually need all the way down to

14:48

10%. That's the lowest it will ever

14:50

charge you. In nearly every single

14:53

scenario,

14:55

you will end up paying less. And that's

14:58

why autoscale is the default.

15:01

Um the break even between these is

15:08

if my

15:10

average utilization across every

15:13

partition in every region

15:17

is above 66%

15:19

then sure manual would make more sense.

15:23

But if you look at the reality of nearly

15:25

every single deployment that is never

15:27

the case. So there is this idea and I'll

15:30

even write it down. So break even

15:35

is 66%. Because of the 1.5x price

15:39

premium, you can kind of work it out

15:40

from that. But this makes sense if

15:42

you're every single petition in every

15:45

single region is not above 66%. Then

15:50

automatic is going to make way more

15:51

sense.

15:53

What's also nice about both of these

15:56

things is you do have this idea of

16:00

there's a maximum value. It's whatever

16:03

I'm setting is the maximum. So if I want

16:07

a limit,

16:13

these are great because it can't go

16:15

above that number. That's as much as it

16:18

can consume and therefore that's as much

16:19

as I can possibly get build for. it will

16:22

cut off at that number. If I'm a vendor

16:26

and maybe the way I'm architecting my

16:30

application, the idea of a maximum on

16:32

the container could be really important

16:34

because as I plan for billing my

16:36

customers on different tiers of service,

16:39

I want to make sure it's not some

16:40

infinite scale thing. There is very much

16:42

a limit to what they can build for. So

16:45

again, when we think about provisioned

16:48

and you're looking at your options,

16:51

this really is the shiny scar. Nearly in

16:53

all scenarios, it's going to be the best

16:56

option.

16:58

Now, the way I configure this

17:02

is at the account level.

17:06

When I go and create a new account

17:10

then it is at the account level I set is

17:14

it provisioned or serverless. So my

17:17

decision here is that the account

17:22

I pick provisioned

17:27

or I guess I'll use a different color

17:29

which I'm about to introduce the idea of

17:34

serverless

17:36

and then at the container level

17:42

17:44

I have configured it as provisioned.

17:49

Well, then I set

17:53

manual

17:57

or auto scale

18:01

and I set the number of RUS that I want

18:04

to do,

18:06

excuse me.

18:08

And so only if I'm doing provisioned

18:13

do I then at the container level go and

18:15

set that manual or autoscale.

18:18

And we can see this if we jump over to

18:20

the portal for a second.

18:23

I'm looking at my Cosmos DB account

18:27

and I see I'm configured as provisioned

18:31

throughput. So that idea of using the

18:34

provision throughput is something I set

18:37

at that um account level.

18:41

Now I guess for the sake of being

18:43

thorough

18:45

I said you set the manual or you set the

18:48

auto scale

18:51

at the container level.

18:53

Technically

18:55

I can also at the database level set it.

19:01

If I wanted to, I can, this is very

19:04

optional, I could set the database

19:11

to be manual or auto scale when I I set

19:14

a certain number of RUS.

19:18

And what would then happen is the child

19:21

containers under that database

19:26

would share

19:28

whatever those number of RUS were. But

19:32

there's no guarantee of distribution

19:34

between the child containers in times of

19:36

contention. I could get noisy neighbor.

19:40

And also because it's not a guarantee of

19:41

even distribution,

19:44

it's generally not something that's

19:46

typically used. maybe in a dev test

19:47

scenario. So again, if we jump over and

19:50

take a look, I did both. So if I go to

19:52

data explorer,

19:55

one of the things we'll see is yes, on

19:58

my database,

20:00

I did opt to configure

20:05

it to be autoscale.

20:07

I set a number of 1,000,

20:10

which means remember it could be between

20:13

100 and 1,000. And it it's kind of

20:16

showing me that at the bottom

20:21

it could go as low as 10% which is 100

20:23

RUS or it could go up to 100% which

20:26

would be the number I'm specifying.

20:29

So technically at this level

20:32

any container

20:34

that is a child of this will just share

20:36

this number.

20:40

But yes, for dev test, maybe sharing

20:42

that is more cost effective.

20:47

Normally though, we just use this

20:51

database as a logical grouping. I don't

20:54

set an RU. Uh, and so for the most part,

20:57

you may consider that database fairly

21:00

pointless.

21:01

Now, let's say I did set a value at the

21:05

database level.

21:08

I don't have to set it at the container

21:09

level, but I can. And if I do set a

21:13

value at the container level as well, it

21:16

overrides any number set at the

21:18

database. Doesn't have to be the same or

21:20

less than, it is no longer part of that

21:23

database allocation. It just has its

21:25

own. So, if I was to set the database to

21:28

1,000

21:29

and then I had multiple containers under

21:32

it, but then one of those containers I

21:34

set to 1,000 or 2,000 or 5,000,

21:38

then it is not consuming any of the

21:41

database numbers anymore. The remaining

21:44

containers that don't have their own

21:46

allocation would then share whatever the

21:48

database one was. So again in my example

21:51

here

21:53

we already see I'm setting the database

21:56

to 1,00

21:58

but on my volcano

22:02

I didn't set any value. So it it is just

22:05

inheriting from the database. But when I

22:07

created my fault lines

22:12

I did set a value and I gave it its own

22:16

1,000.

22:19

So, volcano

22:21

is just consuming from the database

22:23

parent. Fault lines has its own which is

22:27

not out of the database parent value at

22:29

all. So, you you get choices in how you

22:32

want to do this.

22:35

Um, and as I said in this scenario, if I

22:39

do set this, technically I don't have to

22:41

set this anymore. I'll just share it.

22:43

But if I do set this, it's overriding

22:45

it. It It's had its own. This is

22:47

confusing. This is the worst bit I'm

22:49

going to talk about. It's an edge case.

22:51

You're typically not going to do this.

22:53

The happy path is we don't set values at

22:56

the database. We just set them on the

22:58

containers. But that that is the the

23:01

happy path on all of this.

23:04

One thing I can do

23:07

is remember we talked about we have the

23:09

free which is a th00and rus. Well, what

23:12

I I actually have the ability to do is

23:15

for that free

23:17

I can apply it

23:29

23:31

my provisioned

23:32

and so what that means is the first

23:34

thousand I'll use will be free and then

23:35

it would bill over that. Now, if you're

23:37

doing your free or even any of these and

23:41

you want to ensure at the account level,

23:44

it never either it never bills you

23:46

because the account is only for free or

23:48

I want to make sure, hey, there's work

23:50

going on and creating RUS underneath it.

23:53

There's an account throughput limit

23:55

feature and that way it will never let

23:58

you assign more RUS than you set at the

24:02

account level. So, if I go and look back

24:06

at my account, you'll see under

24:08

settings, there's this idea of account

24:10

throughput.

24:12

And if I select that,

24:15

you'll notice

24:17

firstly, if it was your free one, it's

24:20

actually going to set this by default,

24:22

limit the account to the amount included

24:24

in the free discount. So, it will not

24:25

ever let you go above a,000 RU. So, it

24:28

will never be able to bill you.

24:31

But I changed it to actually let me go

24:35

up to 2,000,

24:38

which means it could bill me a certain

24:40

amount of money. But again, I'm using

24:41

auto scale, so it's lower level is

24:44

actually 10%. So really 200 I use. So I

24:48

I'm still within that nice free 1,000.

24:50

Unless I do some peak of work, I still

24:53

won't get build. But this enables you to

24:56

set that limit. And I could set no limit

24:58

if I wanted to. But it's a way of

25:00

guaranteeing

25:03

uh my costing. So if I'm doing the free

25:06

and I want to make sure I never ever

25:08

ever get build, then yes, you can use

25:12

that account

25:17

throughput limit. And he would set it to

25:20

a,000. And again, I can use that account

25:22

throughput limit over there as well if I

25:24

wanted to

25:26

because hey, I don't want to go more

25:27

than 2,000 or 5,000, whatever it is. I

25:31

have control of that.

25:34

I want to go into a little bit more

25:35

detail about something that's going on

25:37

behind the scenes because it will make

25:39

more sense later on when we talk about

25:41

some of the other optimizations we can

25:43

do.

25:45

Now we we had this idea

25:48

of okay R use and a certain amount of

25:51

storage and then this auto scale

25:56

behind the scenes all of this compute

25:58

capability all of the storage well it's

26:01

enabled by physical partitions

26:04

and each physical partition

26:08

supports a certain amount of compute so

26:10

I can think about okay we drew this idea

26:13

what's actually happening for each

26:17

container when I'm doing this manual or

26:20

auto scale

26:23

it's actually going to n number

26:26

of physical

26:30

partitions

26:32

and for each of those partitions

26:35

what we get out of this is 10,000

26:41

request units per second. So for every

26:45

partition it can do 10,000. So I'll do

26:47

per partition.

26:50

Again, this is all about physical a

26:52

number of do physical.

26:56

There's a logical partition idea as

26:58

well.

27:00

10,000 RUS and 50 GB

27:05

of storage. And there's actually four

27:09

replicas of that storage as well.

27:13

So what that means is if you think about

27:16

this if I set 50,000 RUS as manual or

27:23

autoscale was the max what it goes and

27:26

does is create

27:29

five

27:35

physical petitions. So those five

27:37

physical petitions

27:39

then can support that 50,000

27:43

RU per second number that you set. So

27:47

even if I do autoscale, it doesn't it's

27:49

not removing these, which is kind of why

27:52

there's a a price premium element to

27:53

this.

27:55

I have five physical units that is going

27:59

to power this.

28:01

And if you consider how I go and create

28:04

a Cosmos DB account and then a

28:06

container, the only piece of information

28:09

known to Cosmos DB is the number of RU

28:13

doesn't know the capacity. You've not

28:14

written anything yet. So this number of

28:17

RUS drives the number of physical

28:20

partitions and number of partitions can

28:22

scale near infinitely, maybe even

28:24

infinitely, I don't know. Now remember

28:27

they are also storing data. So 50

28:30

gigabytes of data. The data is spread

28:33

out over those partitions based on

28:35

logical partitions. We'll talk about

28:37

something called a partition key later

28:38

on. Partition keys get hashed to

28:42

distribute them over logical partitions

28:44

and each physical partition stores any

28:46

number of logical partitions.

28:50

If it gets full, so if a particular

28:53

physical petition is full, then it gets

28:55

split. the petition gets split and

28:59

logical partitions distributed pretty

29:00

evenly over now the two physical

29:03

petitions for the one it had to split

29:07

um as you maybe increase our use to a

29:10

high number again physical petitions

29:13

would get added and there's a certain

29:14

amount of redistribution

29:17

so these are the building blocks added

29:21

as are you or storage necessitates and

29:25

as I already said For autoscale, if I'm

29:28

setting this to 50,000,

29:31

I have got five physical petitions. Even

29:34

if I'm running at 10% most of the time,

29:36

so I'm using this tiny sliver of it,

29:39

they're still there. So again, that's

29:41

why there's a price premium. Now, every

29:44

region that I add replica to has that

29:50

same number of partitions. The

29:52

utilization could be different. And

29:54

again, that's why if I set auto scale,

29:58

if I had a second region that was using

30:02

a lot less, was a lot less busy.

30:05

That's why you get build for the

30:07

utilizing per region per partition. And

30:10

it's why autoscale is nearly always more

30:12

costefficient. Even if maybe in one

30:14

region I'm fairly consistent and then I

30:16

have replicas in other regions, if those

30:18

are less used, autoscale, hey, it's

30:21

going to be way, way more efficient.

30:24

And we're going to come back to some of

30:26

this uh physical structure part because

30:28

as we talk about other sorts of

30:29

optimizations, it actually becomes

30:31

pretty important.

30:34

Now the next thing we do have

30:36

and this is fairly new and so it's it's

30:39

kind of getting built out still but it

30:42

is the idea of serverless. So then we

30:44

have make sure I draw this right. the

30:46

idea of serverless.

30:51

Now this is set at the account level. So

30:54

when I create the account I have to say

30:56

is it serverless or is it provisioned.

30:58

So an account is provisioned or

31:00

serverless. Um also at the account I can

31:02

set things like regions backup other

31:05

stuff. In serverless I set no value of

31:09

the container. I configure nothing. All

31:11

I do is I set the container to be

31:13

serverless. Now because of that

31:17

um it's it's built a little bit

31:20

differently.

31:21

The way it gets built is you get charged

31:25

for

31:27

well there's a value there's a number

31:31

per million

31:35

RUS

31:37

consumed.

31:41

So it's working very differently. There

31:43

is no per second allocation. It's not

31:46

rounded up to nearest million either.

31:48

You get build for the actual number of

31:50

RUS you use. So if every operation cost

31:53

a certain number of RUS, well they just

31:55

get added up and then hey, you're

31:58

consuming these units of 1 million. So

32:01

if we look at the pricing page again,

32:04

I don't think I've ever shown a pricing

32:06

page as often as I am. And now we change

32:09

this to serverless.

32:13

And that's now what we see. So it is

32:16

just now

32:19

price per million RUS you consume. So

32:21

it's 25 cents for the region I've got

32:23

selected. Obviously there's a certain

32:24

amount of regional difference etc. But

32:27

it's a very very different billing model

32:31

than the other ones.

32:36

But if you look at this,

32:39

for this to make sense,

32:42

I would want to have a very bursty

32:48

uh I would say almost sporadic

32:52

ad hoc pattern of usage. That's when

32:56

serless makes sense. For any regular

32:59

pattern, then autoscale will make more

33:02

sense.

33:04

If you look at any consistent workload,

33:07

actually it's a good idea. If you had a

33:09

consistent workload and you compared the

33:12

RU per second cost for the provision per

33:15

second for the monthly bill and then

33:17

compared that to the serverless, just

33:20

number of RUS you're consuming. I did a

33:22

bit of back of the napkin math and

33:25

serless was about seven and a half times

33:27

more expensive.

33:29

So it it does not make sense for any

33:32

kind of regular workload. Where it

33:34

shines is where I have those really

33:37

bursty, really sporadic patterns that

33:41

autoscale still wouldn't adequately

33:43

cover in terms of optimizing my cost.

33:47

Now it's serverless and normally we

33:49

would say it will scale to infinity and

33:51

beyond. I just wanted to say a buzz

33:52

light, but there are limits.

33:56

Today the actual number of RU's

34:04

based

34:05

on the amount of storage.

34:09

If you look a little bit and think about

34:11

it, how we know physical petitions work.

34:15

As you add storage, you're adding

34:17

physical petitions,

34:19

but that is going to change in the

34:21

future. The promise is it will scale

34:23

more proactively based on request

34:26

volume. Today for example I would see

34:28

about 20,000 RU for a terabyte of

34:31

storage.

34:32

But also something interesting here

34:36

today

34:37

this is not

34:41

guarantee

34:43

of the throughput.

34:46

Now it's in Cosmos DB interest to make

34:49

it available and for you to be able to

34:50

consume it but it is not guaranteed

34:53

also today and it's only today a lot of

34:57

this is changing but today it is one

35:00

region

35:03

at time of recording

35:06

when there are multiple regions you

35:08

would pay for the use at each region's

35:12

replica and whatever that is consuming

35:14

the number of RU consumed at that region

35:16

So every operation consumes a certain

35:18

number of RUS they just get added up and

35:21

hey for every million but again it does

35:25

divide by a million for the cost of RU

35:28

you get build for what you consume

35:31

one region

35:33

35:36

there are scale limits today being

35:38

worked on

35:40

but it's it's that bursty sporadic if it

35:43

is not bursty and sporadic today you're

35:46

just going to go and use autoscale.

35:48

So I guess the net net of this is for

35:53

nearly any workload

35:55

autoscale is going to be your best bet.

35:58

If you had super constant use across all

36:02

regions, across all petitions, are you

36:05

over that 66% threshold? Sure, maybe

36:09

manual is a better option for you,

36:10

especially I guess with some of the

36:12

Azure reservation stuff. If it's super

36:14

super sporadic, then sure, uh maybe

36:18

serverless is an option for you. But I

36:20

think and it's very new and I think that

36:23

use is going to grow over time.

36:28

So now we understand the the SKS and the

36:30

options.

36:32

What can we do to maybe optimize

36:35

how we use our use? How we think about

36:38

designing? How can we be efficient with

36:41

that? And one of the biggest things was

36:43

we start planning there are some things

36:46

we we have to understand. So one,

36:50

I need to know

36:58

the seasonality of the workload.

37:03

We talked about that, right? So that's

37:05

basically going to make you decide for

37:07

the most part between

37:09

ser uh auto scale,

37:14

maybe serverless and again maybe manual

37:18

but yeah I mean honestly probably not.

37:21

There's almost no scenarios where manual

37:23

is the better option.

37:26

I also need to be able to estimate the

37:28

number of RUS to pick the right number

37:30

of RUS unless I'm using serless. Now,

37:32

the autoscale makes it really nice. It

37:35

gives me a much better ability to set a

37:38

number and I've got a lot of wiggle room

37:41

in terms of my error. Now, if I set it

37:43

too low, I'll see throttling and then I

37:46

can go and adjust and make it higher.

37:50

And we talked a lot about this, but the

37:53

next part then is to think about, well,

37:54

how do I optimize

38:03

the RU use itself.

38:06

And when we think about optimizing the

38:08

RU use, I mentioned the word partition

38:10

key and that's huge. So which partition

38:16

key

38:19

do I use? And then also there's this is

38:23

not a relational database. There are no

38:24

joints. There are no foreign keys in

38:26

this. So what are the types of query I'm

38:29

doing?

38:31

What

38:33

do I have within my document in terms of

38:36

attributes? How big are my documents? Um

38:40

how do I think about separating what are

38:42

my bounded versus unbounded? So I then

38:45

think about the data model

38:49

and this is how I can then drive

38:51

efficiency

38:53

for my actual RU use for all the

38:57

different things I'm doing.

39:00

When I think about interacting

39:04

with my Cosmos DB data, I want to be

39:07

able to query using a partition key in

39:10

nearly all scenarios. that's going to

39:13

optimize the number of RUS for when I go

39:15

and interact with my data. Most

39:17

databases relational or otherwise are

39:19

very readheavy. So if I want to be able

39:22

to query and find data, when I do that

39:25

query,

39:26

I I want to be able to query based on

39:28

the petition key I have picked and I

39:31

pick the petition key at the container

39:34

level.

39:36

And so it's super important we

39:38

understand how we actually intend to

39:41

interact with our data so we can

39:43

optimize the RU use. So let's dive into

39:46

this. So we talked about the idea

39:50

that we create

39:52

any number of containers. I'm going to

39:53

create a particular container. So for my

39:56

container

40:00

when I create the container one of the

40:03

most important steps is to pick the

40:05

petition key. I cannot change it once I

40:07

create it. So it's super important I get

40:09

the right petition key. So as part of

40:11

this I have a

40:18

petition key.

40:22

What we write into our containers

40:26

are documents.

40:29

So I have my document.

40:33

So this is my document

40:35

and this is in JSON format.

40:39

So it's a JSON payload.

40:42

The document must contain within this

40:45

somewhere

40:47

the petition key. If it doesn't, it will

40:50

get rejected.

40:52

So whatever attribute I'm setting as the

40:54

petition key, the document has to have

40:57

that petition key because what's going

40:59

to happen is the value of that petition

41:02

key,

41:04

it gets hashed. It gets hashed and it's

41:07

used in now distribute over logical

41:10

petitions. So depending on that hash

41:12

value,

41:15

it gets put into a certain

41:19

logical

41:21

petition

41:24

that logical petition

41:26

lives within a certain um physical

41:32

petition.

41:34

So great, we have any number of logical

41:36

petitions

41:39

spread across these physical petitions.

41:45

Now maybe I've only got one physical

41:46

petition. Remember if it's 10,000 RUS

41:49

it's one but a physical petition

41:52

contains n number of logical petitions

41:56

and the value the hashed value of the

42:01

petition key value picks what logical

42:03

petition and then we're going to end up

42:04

with lots of logical petitions they get

42:06

then distributed over the physical

42:08

petitions. A logical petition is only

42:11

ever in one physical petition. A logical

42:14

partition has a 20 gigabyte limit.

42:19

If we need more than that, we're going

42:21

to have to do some other things. Um

42:23

there's something called a hierarchical

42:24

partition key. We're going to come back

42:25

to that.

42:27

Now, most workloads, as we mentioned,

42:28

are read heavy.

42:31

So our goal is

42:36

if I am read heavy

42:43

I'm running a query. So I've got some

42:47

query to go and find some set of data.

42:51

What I want is whatever that query has

42:56

the the value I'm using to find the data

42:59

is the petition key because what that's

43:02

going to do if it's using the petition

43:04

key

43:10

it knows which logical petition which

43:12

means it knows which physical petition

43:14

it has to go to. The whole goal of this

43:19

is I want to minimize

43:24

fan out.

43:26

Um, cross petition queries. They they're

43:29

way more expensive. If I'm querying on

43:33

some attribute that is not the petition

43:35

key, well, it has to go and look across

43:38

many many different physical petitions

43:41

because it's going to be across many

43:42

logical petitions. It's going to cost me

43:44

more RUS. Imagine I had user profiles

43:47

for people that want to log in. My

43:50

petition key is almost certainly going

43:52

to be user ID because often we're going

43:54

to look things up for the user. So the

43:57

user ID would make a really good

43:59

petition key because then hey I'm

44:02

looking stuff up for John.

44:05

I just know I need to read from a single

44:06

logical petition. Therefore a single

44:08

physical petition. I'm optimizing my RU

44:11

use.

44:14

Now one caveat to that a little bit I

44:16

would say I guess is when I think about

44:18

the petition key

44:22

one of the things we we want to ensure

44:24

we have when I pick that is something

44:27

called high

44:32

cardality

44:35

I want many many distinct values for

44:38

that partition key so whatever I pick I

44:42

want a lot of values val because then I

44:44

should also get a fairly even split of

44:47

the possible values I can have because

44:49

then when I hash them I will get a good

44:51

distribution over whatever physical

44:54

petitions I have. I have lots of ability

44:57

to have lots of logical petitions so

44:58

I'll get a good even distribution over

45:00

my physical partition. If I pick a poor

45:03

petition key that's got very low

45:05

cardality then I make it a very uneven

45:08

distribution over my physical petitions.

45:12

If I picked something that only had

45:13

three possible values, then I can only

45:17

distribute over three. And if one of

45:19

them had most of the data, my

45:21

distribution is going to be terrible. So

45:23

when I think about high cardality, we're

45:26

really thinking ideally in the thousands

45:30

of values for that partition key.

45:34

So that's really important. Make sure

45:36

you're thinking correctly about, okay,

45:39

this is how I'm going to interact with

45:40

the data. And then for that interaction

45:43

I need a high high cardality to ensure I

45:47

can have a really good distribution

45:49

over this. But then my interactions will

45:52

be based on that value. So I'm avoiding

45:54

that fan out that crossartition

45:56

searching.

45:58

Then I actually think about this idea of

46:01

the document size.

46:04

So my max size

46:09

is 2 megabytes

46:12

which is huge.

46:15

Um this is a a JSON document. I really

46:19

probably shouldn't I want a 2 megabyte

46:22

document. Remember there's a certain

46:24

amount of cost to these things to read

46:26

and then write and do those types of

46:28

things. So the general guidance for this

46:31

is when I'm thinking about the document

46:34

size

46:38

where we can keep it between one and 10

46:43

kilobytes. That's going to help optimize

46:46

the argu when we're reading and doing

46:48

various writes to it. Now there's also

46:52

always this idea of well what data do I

46:55

put in the same document? Which data do

46:58

I split between multiple documents?

47:00

Remember can have the same partition

47:01

key. This is no SQL. This is not

47:06

relational database. Every document I

47:09

write can have different data in it.

47:11

It's JSON. As long as all of them have

47:14

the same petition key for a particular

47:16

entity they relate to. Imagine a

47:19

blogging site. Hey, I could have the

47:22

blog article as one document. The

47:24

petition key would be article ID. every

47:27

comment someone makes about that, I

47:30

probably would not write to the blog

47:32

article document. I would have a

47:34

separate article per comment. It would

47:37

have the same partition key. So they're

47:39

in the same logical partition. So when I

47:40

want to go and fetch the blog article to

47:43

show it, it fetches the main blog

47:46

document and then all of the comments.

47:48

They're the same partition key. They're

47:49

in the same logical document. But

47:51

sometimes I wouldn't want to get all of

47:52

the comments. And so what I typically

47:55

would think about is if it is maybe a

47:59

bounded set. So if I have other bits of

48:03

information and it's bounded

48:08

maybe for example I've got a couple of

48:11

addresses. So it's two or three. It's a

48:14

very small number. It's finite. Then

48:17

sure I'll put them in the same document.

48:22

But if it is not bounded color

48:32

then certainly

48:34

put them in different documents. I want

48:37

to split the docs.

48:40

So I would have the blog article

48:44

article ID comment one its own document

48:48

using the same document ID as petition

48:49

key comment two etc etc etc. So think

48:54

about that when you're designing this

48:56

because the nice thing here is it's not

48:59

relational. Every document can have

49:01

completely different content that makes

49:04

sense for you.

49:07

So we thought about those things. Um,

49:10

we've picked good partition key with

49:12

high cardality

49:14

and most of our interactions we're

49:16

querying based on that partition key.

49:18

We're minimizing the fan out. Fantastic.

49:22

But then there's actually

49:25

another attribute we have to search on

49:27

fairly often.

49:30

So it would it would result in this fan

49:32

out behavior. Now, if I'm doing it a

49:34

lot, maybe that that doesn't make sense.

49:37

So we have another option.

49:40

So the other thing we can do is we have

49:43

something called a global

49:49

secondary

49:51

index.

49:53

And I can actually have n numbers of

49:54

these. It's not just one secondary. I

49:57

could have many of these.

49:59

So what this is basically going to do is

50:01

for each global secondary index it has

50:04

its own

50:06

partition key.

50:09

So that's partition key two for the

50:11

first secondary index. Second one is

50:13

partition key three etc etc. And what

50:16

it's going to do now is it will have a

50:20

duplicate set of the data. So it's going

50:23

to now have duplicate

50:30

Now the logical partitions will be

50:31

different. It's different petition key.

50:34

50:36

it's going to for this I'm going to use

50:39

autoscale. So I'm going to use auto

50:42

scale.

50:44

I can set obviously the RU. So I'll pay

50:47

for what it's using. But obviously it is

50:49

a another set of those partitions. So,

50:53

it's going to cost me money this

50:55

duplicate set. I'm paying for this. And

50:59

what it's actually going to do is it's

51:02

creating another container and it's

51:05

using the change feed to keep it in

51:08

sync. Now, what that does mean obviously

51:10

then so I'm paying for that. There's

51:12

also going to be a certain amount of

51:13

cost

51:15

for the rights because it it's doing

51:17

certain work to keep this in sync.

51:18

There's a certain RU cost there as well.

51:21

Now, I could do this already. I could do

51:23

this without the global secondary index.

51:25

I could go and set up another container.

51:26

I could use the change feed. People

51:29

weren't doing it. They found it too

51:31

complex. So, this removes all of that

51:34

setup and maintenance of creating the

51:36

container, setting up the replication,

51:38

setting up the change feed. This is a

51:40

better option.

51:45

I can set that RU value to a smaller

51:48

value because maybe this is used less

51:51

frequently.

51:53

but it's still used enough that it makes

51:56

sense to have it. So, and there's a

51:58

little bit of trickiness here. So,

52:00

decide to do this. Realize what you're

52:03

doing. This is versus

52:07

the fan out RU cost.

52:12

So that the criteria on this is the

52:16

frequency

52:20

of the query

52:24

on key2.

52:28

You would have to look at what is the

52:30

and I can see this. I can look at the

52:31

diagnostics.

52:33

What is the cost of the queries on this

52:37

non petetition key in this scenario? If

52:40

it's really expensive and I'm doing it a

52:43

lot, okay, what would the cost be to

52:45

just have a duplicate set of data with

52:47

this as the partition key?

52:51

It may actually be cheaper to do it. And

52:53

maybe there's other ones you do as well.

52:55

So, it' be actually beneficial to have

52:57

another one with a partition key three.

52:59

So this is where you get into this data

53:02

modeling and understanding how I

53:04

interact and what I can do to optimize

53:06

it because this may actually be cheaper

53:09

than that cost of doing the fan out

53:12

queries. You may get a slight latency

53:15

improvement as well when I'm searching

53:16

on partition key too because now it's

53:18

going to be in a certain physical

53:20

petition.

53:22

I'm not even going to bother writing it

53:23

on the board because by the time the

53:25

latency becomes a major factor, you're

53:29

going to probably get into so many

53:30

physical petitions that the are you cost

53:34

versus the duplication would be driving

53:37

the decision anyway. So the cost

53:38

optimization will cover that. Uh, funny

53:41

enough, I actually gave this as a bit of

53:43

feedback to the product group. Ideally,

53:45

there must be some mathematical

53:47

threshold

53:48

where it just makes sense to go and

53:51

create a global secondary index for

53:52

petition key 2. So hopefully in the

53:55

future um, Azure advisor may start

53:58

prompting on that. We we'll see.

54:01

Okay, so this was kind of some general

54:03

looking at read heavy scenarios which is

54:06

most of the world. And hey, we we look

54:09

we look at the cost. We look at the RU

54:11

use. We look at how we split the

54:13

documents. We make sure we've got the

54:15

good petition keys. We split data that

54:19

relates to maybe certain events based

54:21

on, hey, is it bounded or non-bounded?

54:23

The size of it. All about optimizing our

54:27

RU use. There were two other things I

54:29

wanted to just really quickly touch on.

54:31

I don't go into too much detail. Oh, I

54:32

know there's a lot to take on, but the

54:34

next one I would think about is a

54:37

storage heavy scenario.

54:47

So if I think about remember it was 20

54:50

gigabytes

54:52

per

54:54

petition key

54:58

cuz that's the size of a logical

54:59

petition.

55:01

If I need more than that for

55:05

a partition key value, then we have to

55:08

use something called a hierarchical

55:09

partition key. So what I'm now going to

55:12

have is this hierarchical partition key.

55:16

And it it's not super complicated. What

55:19

I do as a child of the primary partition

55:21

key is I add a second level.

55:27

So another attribute. I can if I really

55:30

had to I can add a third as well. That's

55:34

as many as I can do.

55:37

If you needed more than three, there's

55:38

something wrong. I'd probably use a

55:40

gooid at this point. If it was really

55:41

that bad, then that probably just

55:43

becomes a gooid, which is guaranteed to

55:45

be very high cardality and you're good

55:47

to go. So, what then happens when I have

55:50

this the petition key becomes the hash

55:53

of these really concatenated together

55:56

and therefore sharded over logical

55:58

partitions. therefore sharded over

55:59

physical petitions.

56:01

Um all of these together

56:06

so this hierarchal petition key this has

56:09

to be now less than 20 GB.

56:14

So when I'm concatenating them all

56:15

together what becomes the actual hash

56:18

that shs the data logical partition

56:20

still has to be less than 20 GB. So

56:22

these three levels together have to be

56:25

less than 20 GB. Now when I read I

56:28

actually don't have to pass all three

56:29

parts. It still helps filter down. So if

56:32

I only gave it the first part like the

56:34

primary partition key, it still helps

56:37

minimize the number of logical

56:38

partitions and physical partitions. If I

56:40

can give two parts, it filters it even

56:42

further. So I don't have to always give

56:45

all of them, but it's a way to get past

56:48

that 20 GB limit.

56:51

And the other scenario

56:54

is a lot I think rarer for the most part

56:57

but I could think about a right heavy.

57:04

This I think becomes more common.

57:05

Imagine I've got like an internet of

57:07

things. I've got a car, an elevator,

57:11

and they're just they have a device ID

57:14

and

57:18

it is just it also I mean it comes back

57:20

to the storage heavy. There's so much

57:21

data that honestly it probably exceeds

57:23

the 20 GB of data anyway, but maybe also

57:27

there's a certain performance element to

57:29

it.

57:30

And so here I need this same idea. I

57:33

need a second level. So very often what

57:36

they actually do is they do just add a

57:39

second level gooid.

57:41

I just need some way

57:44

to increase that cardality.

57:47

So I add a second level gooid. Now it's

57:49

actually possible

57:53

um if I don't actually intend to search

57:58

or query by the device ID. So if I am

58:02

not

58:04

searching by in this case device ID

58:08

then my other option is the petition ID

58:11

just becomes a gooid.

58:13

I don't even have the device ID as a

58:15

petition key. I I don't care. It's just

58:18

some kind of I don't know hot landing

58:21

zone for all of this data coming in from

58:23

the IoT. Again I've seen it on uh a

58:26

number of different scenarios around

58:27

vehicles and other types of things. I

58:31

don't actually intend to in Cosmos DB

58:34

query based on device ID. So in that

58:37

super right heavy scenario maybe it's

58:39

just a gooid gives me this massive

58:41

cardality. I don't care. It's going to

58:43

come and get extracted and processed by

58:45

something else afterwards.

58:48

So I hope that was useful. I mean that's

58:50

all the data I wanted to kind of cover

58:51

in here. I know we covered a lot. I

58:54

think the key thing here is

58:57

for nearly all of us we need to

58:59

understand how we plan to interact with

59:01

the data.

59:03

Um autoscale is nearly always going to

59:05

be the right choice unless again you

59:07

have that super sporadic bursty

59:09

workload.

59:10

Make sure when I'm designing my data

59:12

model I pick a petition key with very

59:14

high cardality lots of values so I can

59:17

get a good distribution.

59:20

If you also very often read using

59:24

different attributes, then look at the

59:26

global secondary index option. That

59:27

could be super super useful. So all that

59:30

said, um good luck and I will see you at

59:33

the next video.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

This video discusses optimizing Cosmos DB architecture for cost efficiency. It explains different SKUs: the free tier, manual provisioned throughput (fixed RUs, best for predictable workloads), autoscale provisioned throughput (dynamic RUs, generally recommended due to cost optimization despite a premium), and serverless (pay-per-use, ideal for sporadic bursty workloads). The speaker emphasizes the importance of Request Units (RUs) as the compute unit and how billing works. Key optimization strategies include selecting a partition key with high cardinality for efficient data interaction and minimizing cross-partition queries. The video also covers document size, the use of global secondary indexes for frequently queried non-partition key attributes, and hierarchical partition keys for storage-heavy scenarios or right-heavy workloads. Autoscale is highlighted as the best choice for most workloads.

Recently Distilled

Videos recently processed by our community