Cosmos DB Optimization
1258 segments
Hey everyone, in this video I want to
talk about optimizing your Cosmos DB
architecture specifically around
optimizing your cost. Now sometimes when
we look at Cosmos DB there is this
initial this is expensive
but really there's a lot of different
dimensions and options you have to
understand and configure correctly. Yes,
about the particular type of service,
the amount of RUS which we're going to
talk about in your data model. How am I
going to actually interact with the
data? So I want to walk through some of
the key options so we can make the right
decisions.
Now firstly there are a number of
different SKUs available. So let's walk
through those. Now when I create a
Cosmos DB resource, what I'm actually
creating is an account. So I think about
my Cosmos DB. The actual resource is an
account.
So this is the ARM resource.
Now the account contains
n number of databases
and the database contains n number of
containers
into which we put our various documents
that that make up our data set. And
we're going to come back to some
specifics around this later on because
there are a number of different options
we can set both at the account level and
the container level, sometimes the
database level, but not option, but
they're really important when I think
about optimization and the associated
costs.
Now, one of the terms I'm going to use a
lot is something called request units,
RUS. You think about a certain number of
request units per second and you can
really think about an RU as the compute
unit of Cosmos DB. Different types of
interaction cost a certain number of
RUS. And there's a really nice Microsoft
site we'll just look at quickly
that really kind of demonstrates this.
So you can think about okay a specific
read of something is an RU but an insert
an upsert a delete a query well they're
going to be a certain number of RUS it's
not just a single RU and that number is
going to depend on the type of
interaction how the data has been
distributed the number of regions and a
lot more so there's some different
things we have to consider when I start
to think about the number RUS we're
going to use.
Now a really important point to
understand is when we set a number of
request units for a container.
This is the number of request units we
can consume per second.
We don't times that number by 60 by 60
to get an hourly cost. This is your cost
for the hour. And it really actually
scales down to if I continue consuming
that number of RUS every second for the
month, that is my monthly cost.
Otherwise, yeah, it would seem really,
really expensive. So, if I actually go
and look at cost for a second,
what you'll see in the pricing is it's
telling me, hey, for 100 RUS per second
for a single region, it is $5.84
per month. So every single second
of every minute of every hour that
month, I use that number of RU. So if my
workload was completely flat and I only
ever used 100 RUS and that's all I
configured, that's what it's going to
do. Now the thing is how we assign that
number of RUS is going to vary greatly
and that is going to be the key focus
for this video. also I'm going to pay
for the amount of capacity I consume.
Now if I think about the SKs that are
available to us
well the first one is we do have a free
skew. So I can think about for the free
what I get here is it's a fixed number
of RUS. So what I actually get for this
is I get
1,000
I'll use per second
I get 25
gigabytes of storage.
So every single second I can consume a
th000 RUS and I don't pay anything. This
is great when I think about um
development,
maybe some testing,
maybe some prototyping.
It's really good for those scenarios. I
can have one of these
per subscription.
So every subscription I have, I can have
a free tier of my Cosmos DB and I can
consume those thousand
RUS per second. So it's a nice way to go
and play and not spend any money. When
we start to move into the real
production and the scenarios we're going
to use, then obviously we get into the
paid type of tiers. And where we're
going to focus here is the idea of
provisioned. I'm going to set a certain
number of request units. So I'm going to
think about provisioned
and within provisioned
there's two different types.
Now the first one we're going to think
about is manual.
So I have my manual option.
Now sometimes you would hear that called
provision which is inaccurate because
the other one is also a provisioned
amount. So this is manual and this is
where I set a fixed number of RUS. So
what I'm configuring is
X amount of RUS per second. I just set
that number and I'm going to pay for
that whether I use that amount or not.
So I could think about over some number
of time because this is what I'm paying
for.
I'm setting
that number of RUS. So it's it's that
fixed amount. if my actual workload
varied
on actually how many I'm using each
second across different partitions
across regions
I'm still paying for that fixed amount
of R use and because of that
I need a really really consistent amount
of use
to make that a good fit. So when I think
about using this provisioned, what I
want is a super predictable
and constant
amount of work for that to make sense.
I'm also going to pay for the capacity
that I'm consuming, the amount of data.
So I pay for that as well.
Um, if I have additional regions,
I would also pay for that. So, there's
obviously going to be some certain
multiplication
of however many regions I want this
available to. And I have a configurable
consistency.
And this is actually a really big deal.
Um, when I think about this consistency
is with Cosmos DB, one of its
superpowers, I can have, if I wanted to,
multiple writable instances across many
different reg regions, and then it will
work out how to make it consistent,
maybe over a session, maybe I just need
it absolutely consistent over all of
them, which means you're going to get a
latency when I'm trying to actually
write to the thing. But I can have that
ability to have many many copies in many
many regions with a configurable
consistency. And obviously I'm going to
pay
also for the RUS assigned for that
region. Now what is interesting here is
they don't have to be the same.
So for each region I add I can add
whatever the RUS
I want for that particular region. The
capacity is going to be roughly the same
for all of them. There's going to be
some network egress as well for
replicating the data.
But obviously I'm going to pay where I
basically have copies of the data and I
have additional sets of compute to serve
and respond to requests for things that
are going to hit that particular region.
But really the key point here is to use
manual provisioned. I am just setting a
flat amount of RUS that I pay for
whatever my usage for that to make
sense. that usage really needs to be
pretty much up there with that line. So
I want a super predictable, constant,
uniform amount of compute usage for this
manual provision to actually make sense.
Now there are things that can help make
this make sense. Azure reservations.
If I know I need a certain amount for a
really long period of time, we think 1 3
years,
then with that discount I get, I can
accept a variance in that usage. I could
be a lower amount of use, but still make
financial sense. If we look at the
documentation,
I can go and look at sort of
reservations. And yes, there's an idea
of a sort of a bucket that I can just
keep applying these buckets of 100s and
I get a 20% or a 30% discount.
But if I'm willing to commit to much
larger amounts,
you can get some crazy discounts
when you start getting into really high
numbers. And obviously at that point if
I'm committing to super huge amounts
then hey maybe I'm willing to accept
some more idle because I'm still saving
a lot of money on that. But really what
we're thinking about here is it's a very
predictable very constant amount of
work.
Now the next thing we think about in
terms of our different options
available. So, yes, we had manual, but
I'm going to draw this one as kind of
our for the most part the superstar
that I would probably say 99% of the
time is the one you're probably going to
end up using is it's provisioned,
but it's this time it is auto scale.
So the big deal here is it is a this is
the default going forwards
because
99% of the time this is going to be the
best option for you but it is dynamic
in nature in terms of the RU you're
actually getting bills for. Now what
happens is yes I still set
X amount of R use
but what's going to happen here is the
actual amount you're getting build for
is going to depend on the amount you're
consuming. So the whatever you set here,
that's kind of the 100%. That's the
maximum that it's allowed to use, but it
can go all the way down to 10%
of that number of RUS.
And my bill, those dollars
is based on what is it actually
consuming.
Because of this autoscale nature, it has
a premium on the number of RUS it's
essentially consuming to do the various
interactions. And the premium
is essentially 1.5x.
So I'm paying a 50% premium to get this.
And we can see this in the pricing. So
if I jump back over,
I go back to the pricing page. And this
time I'll change it to the automatic
provision throughput.
And we look at the pricing.
It tells us that 1.5
So we pay a premium
because of that idea.
We're going to pay for what we use. Now
on top of that, I can still have any
number of regions.
And the nice thing here is what's
happening
is
the amount it's going to bill me
is independent
utilization
per partition per region. So it is a
peretition
which is going to make more sense later
on
per region
utilization
i.e. I don't just get if I'm using 90%
in a region I don't get build 90% for
all of the regions. If one region's at
90%, one region's at 20%, one's at 40%.
Each region is independently going to
bill me for the amount of utilization at
that particular region at the petition
level. So, it's a really good option.
And so, as you can see, I'm paying more,
but because it only charges you for what
you actually need all the way down to
10%. That's the lowest it will ever
charge you. In nearly every single
scenario,
you will end up paying less. And that's
why autoscale is the default.
Um the break even between these is
if my
average utilization across every
partition in every region
is above 66%
then sure manual would make more sense.
But if you look at the reality of nearly
every single deployment that is never
the case. So there is this idea and I'll
even write it down. So break even
is 66%. Because of the 1.5x price
premium, you can kind of work it out
from that. But this makes sense if
you're every single petition in every
single region is not above 66%. Then
automatic is going to make way more
sense.
What's also nice about both of these
things is you do have this idea of
there's a maximum value. It's whatever
I'm setting is the maximum. So if I want
a limit,
these are great because it can't go
above that number. That's as much as it
can consume and therefore that's as much
as I can possibly get build for. it will
cut off at that number. If I'm a vendor
and maybe the way I'm architecting my
application, the idea of a maximum on
the container could be really important
because as I plan for billing my
customers on different tiers of service,
I want to make sure it's not some
infinite scale thing. There is very much
a limit to what they can build for. So
again, when we think about provisioned
and you're looking at your options,
this really is the shiny scar. Nearly in
all scenarios, it's going to be the best
option.
Now, the way I configure this
is at the account level.
When I go and create a new account
then it is at the account level I set is
it provisioned or serverless. So my
decision here is that the account
I pick provisioned
or I guess I'll use a different color
which I'm about to introduce the idea of
serverless
and then at the container level
if
I have configured it as provisioned.
Well, then I set
manual
or auto scale
and I set the number of RUS that I want
to do,
excuse me.
And so only if I'm doing provisioned
do I then at the container level go and
set that manual or autoscale.
And we can see this if we jump over to
the portal for a second.
I'm looking at my Cosmos DB account
and I see I'm configured as provisioned
throughput. So that idea of using the
provision throughput is something I set
at that um account level.
Now I guess for the sake of being
thorough
I said you set the manual or you set the
auto scale
at the container level.
Technically
I can also at the database level set it.
If I wanted to, I can, this is very
optional, I could set the database
to be manual or auto scale when I I set
a certain number of RUS.
And what would then happen is the child
containers under that database
would share
whatever those number of RUS were. But
there's no guarantee of distribution
between the child containers in times of
contention. I could get noisy neighbor.
And also because it's not a guarantee of
even distribution,
it's generally not something that's
typically used. maybe in a dev test
scenario. So again, if we jump over and
take a look, I did both. So if I go to
data explorer,
one of the things we'll see is yes, on
my database,
I did opt to configure
it to be autoscale.
I set a number of 1,000,
which means remember it could be between
100 and 1,000. And it it's kind of
showing me that at the bottom
it could go as low as 10% which is 100
RUS or it could go up to 100% which
would be the number I'm specifying.
So technically at this level
any container
that is a child of this will just share
this number.
But yes, for dev test, maybe sharing
that is more cost effective.
Normally though, we just use this
database as a logical grouping. I don't
set an RU. Uh, and so for the most part,
you may consider that database fairly
pointless.
Now, let's say I did set a value at the
database level.
I don't have to set it at the container
level, but I can. And if I do set a
value at the container level as well, it
overrides any number set at the
database. Doesn't have to be the same or
less than, it is no longer part of that
database allocation. It just has its
own. So, if I was to set the database to
1,000
and then I had multiple containers under
it, but then one of those containers I
set to 1,000 or 2,000 or 5,000,
then it is not consuming any of the
database numbers anymore. The remaining
containers that don't have their own
allocation would then share whatever the
database one was. So again in my example
here
we already see I'm setting the database
to 1,00
but on my volcano
I didn't set any value. So it it is just
inheriting from the database. But when I
created my fault lines
I did set a value and I gave it its own
1,000.
So, volcano
is just consuming from the database
parent. Fault lines has its own which is
not out of the database parent value at
all. So, you you get choices in how you
want to do this.
Um, and as I said in this scenario, if I
do set this, technically I don't have to
set this anymore. I'll just share it.
But if I do set this, it's overriding
it. It It's had its own. This is
confusing. This is the worst bit I'm
going to talk about. It's an edge case.
You're typically not going to do this.
The happy path is we don't set values at
the database. We just set them on the
containers. But that that is the the
happy path on all of this.
One thing I can do
is remember we talked about we have the
free which is a th00and rus. Well, what
I I actually have the ability to do is
for that free
I can apply it
to
my provisioned
and so what that means is the first
thousand I'll use will be free and then
it would bill over that. Now, if you're
doing your free or even any of these and
you want to ensure at the account level,
it never either it never bills you
because the account is only for free or
I want to make sure, hey, there's work
going on and creating RUS underneath it.
There's an account throughput limit
feature and that way it will never let
you assign more RUS than you set at the
account level. So, if I go and look back
at my account, you'll see under
settings, there's this idea of account
throughput.
And if I select that,
you'll notice
firstly, if it was your free one, it's
actually going to set this by default,
limit the account to the amount included
in the free discount. So, it will not
ever let you go above a,000 RU. So, it
will never be able to bill you.
But I changed it to actually let me go
up to 2,000,
which means it could bill me a certain
amount of money. But again, I'm using
auto scale, so it's lower level is
actually 10%. So really 200 I use. So I
I'm still within that nice free 1,000.
Unless I do some peak of work, I still
won't get build. But this enables you to
set that limit. And I could set no limit
if I wanted to. But it's a way of
guaranteeing
uh my costing. So if I'm doing the free
and I want to make sure I never ever
ever get build, then yes, you can use
that account
throughput limit. And he would set it to
a,000. And again, I can use that account
throughput limit over there as well if I
wanted to
because hey, I don't want to go more
than 2,000 or 5,000, whatever it is. I
have control of that.
I want to go into a little bit more
detail about something that's going on
behind the scenes because it will make
more sense later on when we talk about
some of the other optimizations we can
do.
Now we we had this idea
of okay R use and a certain amount of
storage and then this auto scale
behind the scenes all of this compute
capability all of the storage well it's
enabled by physical partitions
and each physical partition
supports a certain amount of compute so
I can think about okay we drew this idea
what's actually happening for each
container when I'm doing this manual or
auto scale
it's actually going to n number
of physical
partitions
and for each of those partitions
what we get out of this is 10,000
request units per second. So for every
partition it can do 10,000. So I'll do
per partition.
Again, this is all about physical a
number of do physical.
There's a logical partition idea as
well.
10,000 RUS and 50 GB
of storage. And there's actually four
replicas of that storage as well.
So what that means is if you think about
this if I set 50,000 RUS as manual or
autoscale was the max what it goes and
does is create
five
physical petitions. So those five
physical petitions
then can support that 50,000
RU per second number that you set. So
even if I do autoscale, it doesn't it's
not removing these, which is kind of why
there's a a price premium element to
this.
I have five physical units that is going
to power this.
And if you consider how I go and create
a Cosmos DB account and then a
container, the only piece of information
known to Cosmos DB is the number of RU
doesn't know the capacity. You've not
written anything yet. So this number of
RUS drives the number of physical
partitions and number of partitions can
scale near infinitely, maybe even
infinitely, I don't know. Now remember
they are also storing data. So 50
gigabytes of data. The data is spread
out over those partitions based on
logical partitions. We'll talk about
something called a partition key later
on. Partition keys get hashed to
distribute them over logical partitions
and each physical partition stores any
number of logical partitions.
If it gets full, so if a particular
physical petition is full, then it gets
split. the petition gets split and
logical partitions distributed pretty
evenly over now the two physical
petitions for the one it had to split
um as you maybe increase our use to a
high number again physical petitions
would get added and there's a certain
amount of redistribution
so these are the building blocks added
as are you or storage necessitates and
as I already said For autoscale, if I'm
setting this to 50,000,
I have got five physical petitions. Even
if I'm running at 10% most of the time,
so I'm using this tiny sliver of it,
they're still there. So again, that's
why there's a price premium. Now, every
region that I add replica to has that
same number of partitions. The
utilization could be different. And
again, that's why if I set auto scale,
if I had a second region that was using
a lot less, was a lot less busy.
That's why you get build for the
utilizing per region per partition. And
it's why autoscale is nearly always more
costefficient. Even if maybe in one
region I'm fairly consistent and then I
have replicas in other regions, if those
are less used, autoscale, hey, it's
going to be way, way more efficient.
And we're going to come back to some of
this uh physical structure part because
as we talk about other sorts of
optimizations, it actually becomes
pretty important.
Now the next thing we do have
and this is fairly new and so it's it's
kind of getting built out still but it
is the idea of serverless. So then we
have make sure I draw this right. the
idea of serverless.
Now this is set at the account level. So
when I create the account I have to say
is it serverless or is it provisioned.
So an account is provisioned or
serverless. Um also at the account I can
set things like regions backup other
stuff. In serverless I set no value of
the container. I configure nothing. All
I do is I set the container to be
serverless. Now because of that
um it's it's built a little bit
differently.
The way it gets built is you get charged
for
well there's a value there's a number
per million
RUS
consumed.
So it's working very differently. There
is no per second allocation. It's not
rounded up to nearest million either.
You get build for the actual number of
RUS you use. So if every operation cost
a certain number of RUS, well they just
get added up and then hey, you're
consuming these units of 1 million. So
if we look at the pricing page again,
I don't think I've ever shown a pricing
page as often as I am. And now we change
this to serverless.
And that's now what we see. So it is
just now
price per million RUS you consume. So
it's 25 cents for the region I've got
selected. Obviously there's a certain
amount of regional difference etc. But
it's a very very different billing model
than the other ones.
But if you look at this,
for this to make sense,
I would want to have a very bursty
uh I would say almost sporadic
ad hoc pattern of usage. That's when
serless makes sense. For any regular
pattern, then autoscale will make more
sense.
If you look at any consistent workload,
actually it's a good idea. If you had a
consistent workload and you compared the
RU per second cost for the provision per
second for the monthly bill and then
compared that to the serverless, just
number of RUS you're consuming. I did a
bit of back of the napkin math and
serless was about seven and a half times
more expensive.
So it it does not make sense for any
kind of regular workload. Where it
shines is where I have those really
bursty, really sporadic patterns that
autoscale still wouldn't adequately
cover in terms of optimizing my cost.
Now it's serverless and normally we
would say it will scale to infinity and
beyond. I just wanted to say a buzz
light, but there are limits.
Today the actual number of RU's
based
on the amount of storage.
If you look a little bit and think about
it, how we know physical petitions work.
As you add storage, you're adding
physical petitions,
but that is going to change in the
future. The promise is it will scale
more proactively based on request
volume. Today for example I would see
about 20,000 RU for a terabyte of
storage.
But also something interesting here
today
this is not
guarantee
of the throughput.
Now it's in Cosmos DB interest to make
it available and for you to be able to
consume it but it is not guaranteed
also today and it's only today a lot of
this is changing but today it is one
region
at time of recording
when there are multiple regions you
would pay for the use at each region's
replica and whatever that is consuming
the number of RU consumed at that region
So every operation consumes a certain
number of RUS they just get added up and
hey for every million but again it does
divide by a million for the cost of RU
you get build for what you consume
one region
um
there are scale limits today being
worked on
but it's it's that bursty sporadic if it
is not bursty and sporadic today you're
just going to go and use autoscale.
So I guess the net net of this is for
nearly any workload
autoscale is going to be your best bet.
If you had super constant use across all
regions, across all petitions, are you
over that 66% threshold? Sure, maybe
manual is a better option for you,
especially I guess with some of the
Azure reservation stuff. If it's super
super sporadic, then sure, uh maybe
serverless is an option for you. But I
think and it's very new and I think that
use is going to grow over time.
So now we understand the the SKS and the
options.
What can we do to maybe optimize
how we use our use? How we think about
designing? How can we be efficient with
that? And one of the biggest things was
we start planning there are some things
we we have to understand. So one,
I need to know
the seasonality of the workload.
We talked about that, right? So that's
basically going to make you decide for
the most part between
ser uh auto scale,
maybe serverless and again maybe manual
but yeah I mean honestly probably not.
There's almost no scenarios where manual
is the better option.
I also need to be able to estimate the
number of RUS to pick the right number
of RUS unless I'm using serless. Now,
the autoscale makes it really nice. It
gives me a much better ability to set a
number and I've got a lot of wiggle room
in terms of my error. Now, if I set it
too low, I'll see throttling and then I
can go and adjust and make it higher.
And we talked a lot about this, but the
next part then is to think about, well,
how do I optimize
the RU use itself.
And when we think about optimizing the
RU use, I mentioned the word partition
key and that's huge. So which partition
key
do I use? And then also there's this is
not a relational database. There are no
joints. There are no foreign keys in
this. So what are the types of query I'm
doing?
What
do I have within my document in terms of
attributes? How big are my documents? Um
how do I think about separating what are
my bounded versus unbounded? So I then
think about the data model
and this is how I can then drive
efficiency
for my actual RU use for all the
different things I'm doing.
When I think about interacting
with my Cosmos DB data, I want to be
able to query using a partition key in
nearly all scenarios. that's going to
optimize the number of RUS for when I go
and interact with my data. Most
databases relational or otherwise are
very readheavy. So if I want to be able
to query and find data, when I do that
query,
I I want to be able to query based on
the petition key I have picked and I
pick the petition key at the container
level.
And so it's super important we
understand how we actually intend to
interact with our data so we can
optimize the RU use. So let's dive into
this. So we talked about the idea
that we create
any number of containers. I'm going to
create a particular container. So for my
container
when I create the container one of the
most important steps is to pick the
petition key. I cannot change it once I
create it. So it's super important I get
the right petition key. So as part of
this I have a
petition key.
What we write into our containers
are documents.
So I have my document.
So this is my document
and this is in JSON format.
So it's a JSON payload.
The document must contain within this
somewhere
the petition key. If it doesn't, it will
get rejected.
So whatever attribute I'm setting as the
petition key, the document has to have
that petition key because what's going
to happen is the value of that petition
key,
it gets hashed. It gets hashed and it's
used in now distribute over logical
petitions. So depending on that hash
value,
it gets put into a certain
logical
petition
that logical petition
lives within a certain um physical
petition.
So great, we have any number of logical
petitions
spread across these physical petitions.
Now maybe I've only got one physical
petition. Remember if it's 10,000 RUS
it's one but a physical petition
contains n number of logical petitions
and the value the hashed value of the
petition key value picks what logical
petition and then we're going to end up
with lots of logical petitions they get
then distributed over the physical
petitions. A logical petition is only
ever in one physical petition. A logical
partition has a 20 gigabyte limit.
If we need more than that, we're going
to have to do some other things. Um
there's something called a hierarchical
partition key. We're going to come back
to that.
Now, most workloads, as we mentioned,
are read heavy.
So our goal is
if I am read heavy
I'm running a query. So I've got some
query to go and find some set of data.
What I want is whatever that query has
the the value I'm using to find the data
is the petition key because what that's
going to do if it's using the petition
key
it knows which logical petition which
means it knows which physical petition
it has to go to. The whole goal of this
is I want to minimize
fan out.
Um, cross petition queries. They they're
way more expensive. If I'm querying on
some attribute that is not the petition
key, well, it has to go and look across
many many different physical petitions
because it's going to be across many
logical petitions. It's going to cost me
more RUS. Imagine I had user profiles
for people that want to log in. My
petition key is almost certainly going
to be user ID because often we're going
to look things up for the user. So the
user ID would make a really good
petition key because then hey I'm
looking stuff up for John.
I just know I need to read from a single
logical petition. Therefore a single
physical petition. I'm optimizing my RU
use.
Now one caveat to that a little bit I
would say I guess is when I think about
the petition key
one of the things we we want to ensure
we have when I pick that is something
called high
cardality
I want many many distinct values for
that partition key so whatever I pick I
want a lot of values val because then I
should also get a fairly even split of
the possible values I can have because
then when I hash them I will get a good
distribution over whatever physical
petitions I have. I have lots of ability
to have lots of logical petitions so
I'll get a good even distribution over
my physical partition. If I pick a poor
petition key that's got very low
cardality then I make it a very uneven
distribution over my physical petitions.
If I picked something that only had
three possible values, then I can only
distribute over three. And if one of
them had most of the data, my
distribution is going to be terrible. So
when I think about high cardality, we're
really thinking ideally in the thousands
of values for that partition key.
So that's really important. Make sure
you're thinking correctly about, okay,
this is how I'm going to interact with
the data. And then for that interaction
I need a high high cardality to ensure I
can have a really good distribution
over this. But then my interactions will
be based on that value. So I'm avoiding
that fan out that crossartition
searching.
Then I actually think about this idea of
the document size.
So my max size
is 2 megabytes
which is huge.
Um this is a a JSON document. I really
probably shouldn't I want a 2 megabyte
document. Remember there's a certain
amount of cost to these things to read
and then write and do those types of
things. So the general guidance for this
is when I'm thinking about the document
size
where we can keep it between one and 10
kilobytes. That's going to help optimize
the argu when we're reading and doing
various writes to it. Now there's also
always this idea of well what data do I
put in the same document? Which data do
I split between multiple documents?
Remember can have the same partition
key. This is no SQL. This is not
relational database. Every document I
write can have different data in it.
It's JSON. As long as all of them have
the same petition key for a particular
entity they relate to. Imagine a
blogging site. Hey, I could have the
blog article as one document. The
petition key would be article ID. every
comment someone makes about that, I
probably would not write to the blog
article document. I would have a
separate article per comment. It would
have the same partition key. So they're
in the same logical partition. So when I
want to go and fetch the blog article to
show it, it fetches the main blog
document and then all of the comments.
They're the same partition key. They're
in the same logical document. But
sometimes I wouldn't want to get all of
the comments. And so what I typically
would think about is if it is maybe a
bounded set. So if I have other bits of
information and it's bounded
maybe for example I've got a couple of
addresses. So it's two or three. It's a
very small number. It's finite. Then
sure I'll put them in the same document.
But if it is not bounded color
then certainly
put them in different documents. I want
to split the docs.
So I would have the blog article
article ID comment one its own document
using the same document ID as petition
key comment two etc etc etc. So think
about that when you're designing this
because the nice thing here is it's not
relational. Every document can have
completely different content that makes
sense for you.
So we thought about those things. Um,
we've picked good partition key with
high cardality
and most of our interactions we're
querying based on that partition key.
We're minimizing the fan out. Fantastic.
But then there's actually
another attribute we have to search on
fairly often.
So it would it would result in this fan
out behavior. Now, if I'm doing it a
lot, maybe that that doesn't make sense.
So we have another option.
So the other thing we can do is we have
something called a global
secondary
index.
And I can actually have n numbers of
these. It's not just one secondary. I
could have many of these.
So what this is basically going to do is
for each global secondary index it has
its own
partition key.
So that's partition key two for the
first secondary index. Second one is
partition key three etc etc. And what
it's going to do now is it will have a
duplicate set of the data. So it's going
to now have duplicate
Now the logical partitions will be
different. It's different petition key.
Um
it's going to for this I'm going to use
autoscale. So I'm going to use auto
scale.
I can set obviously the RU. So I'll pay
for what it's using. But obviously it is
a another set of those partitions. So,
it's going to cost me money this
duplicate set. I'm paying for this. And
what it's actually going to do is it's
creating another container and it's
using the change feed to keep it in
sync. Now, what that does mean obviously
then so I'm paying for that. There's
also going to be a certain amount of
cost
for the rights because it it's doing
certain work to keep this in sync.
There's a certain RU cost there as well.
Now, I could do this already. I could do
this without the global secondary index.
I could go and set up another container.
I could use the change feed. People
weren't doing it. They found it too
complex. So, this removes all of that
setup and maintenance of creating the
container, setting up the replication,
setting up the change feed. This is a
better option.
I can set that RU value to a smaller
value because maybe this is used less
frequently.
but it's still used enough that it makes
sense to have it. So, and there's a
little bit of trickiness here. So,
decide to do this. Realize what you're
doing. This is versus
the fan out RU cost.
So that the criteria on this is the
frequency
of the query
on key2.
You would have to look at what is the
and I can see this. I can look at the
diagnostics.
What is the cost of the queries on this
non petetition key in this scenario? If
it's really expensive and I'm doing it a
lot, okay, what would the cost be to
just have a duplicate set of data with
this as the partition key?
It may actually be cheaper to do it. And
maybe there's other ones you do as well.
So, it' be actually beneficial to have
another one with a partition key three.
So this is where you get into this data
modeling and understanding how I
interact and what I can do to optimize
it because this may actually be cheaper
than that cost of doing the fan out
queries. You may get a slight latency
improvement as well when I'm searching
on partition key too because now it's
going to be in a certain physical
petition.
I'm not even going to bother writing it
on the board because by the time the
latency becomes a major factor, you're
going to probably get into so many
physical petitions that the are you cost
versus the duplication would be driving
the decision anyway. So the cost
optimization will cover that. Uh, funny
enough, I actually gave this as a bit of
feedback to the product group. Ideally,
there must be some mathematical
threshold
where it just makes sense to go and
create a global secondary index for
petition key 2. So hopefully in the
future um, Azure advisor may start
prompting on that. We we'll see.
Okay, so this was kind of some general
looking at read heavy scenarios which is
most of the world. And hey, we we look
we look at the cost. We look at the RU
use. We look at how we split the
documents. We make sure we've got the
good petition keys. We split data that
relates to maybe certain events based
on, hey, is it bounded or non-bounded?
The size of it. All about optimizing our
RU use. There were two other things I
wanted to just really quickly touch on.
I don't go into too much detail. Oh, I
know there's a lot to take on, but the
next one I would think about is a
storage heavy scenario.
So if I think about remember it was 20
gigabytes
per
petition key
cuz that's the size of a logical
petition.
If I need more than that for
a partition key value, then we have to
use something called a hierarchical
partition key. So what I'm now going to
have is this hierarchical partition key.
And it it's not super complicated. What
I do as a child of the primary partition
key is I add a second level.
So another attribute. I can if I really
had to I can add a third as well. That's
as many as I can do.
If you needed more than three, there's
something wrong. I'd probably use a
gooid at this point. If it was really
that bad, then that probably just
becomes a gooid, which is guaranteed to
be very high cardality and you're good
to go. So, what then happens when I have
this the petition key becomes the hash
of these really concatenated together
and therefore sharded over logical
partitions. therefore sharded over
physical petitions.
Um all of these together
so this hierarchal petition key this has
to be now less than 20 GB.
So when I'm concatenating them all
together what becomes the actual hash
that shs the data logical partition
still has to be less than 20 GB. So
these three levels together have to be
less than 20 GB. Now when I read I
actually don't have to pass all three
parts. It still helps filter down. So if
I only gave it the first part like the
primary partition key, it still helps
minimize the number of logical
partitions and physical partitions. If I
can give two parts, it filters it even
further. So I don't have to always give
all of them, but it's a way to get past
that 20 GB limit.
And the other scenario
is a lot I think rarer for the most part
but I could think about a right heavy.
This I think becomes more common.
Imagine I've got like an internet of
things. I've got a car, an elevator,
and they're just they have a device ID
and
it is just it also I mean it comes back
to the storage heavy. There's so much
data that honestly it probably exceeds
the 20 GB of data anyway, but maybe also
there's a certain performance element to
it.
And so here I need this same idea. I
need a second level. So very often what
they actually do is they do just add a
second level gooid.
I just need some way
to increase that cardality.
So I add a second level gooid. Now it's
actually possible
um if I don't actually intend to search
or query by the device ID. So if I am
not
searching by in this case device ID
then my other option is the petition ID
just becomes a gooid.
I don't even have the device ID as a
petition key. I I don't care. It's just
some kind of I don't know hot landing
zone for all of this data coming in from
the IoT. Again I've seen it on uh a
number of different scenarios around
vehicles and other types of things. I
don't actually intend to in Cosmos DB
query based on device ID. So in that
super right heavy scenario maybe it's
just a gooid gives me this massive
cardality. I don't care. It's going to
come and get extracted and processed by
something else afterwards.
So I hope that was useful. I mean that's
all the data I wanted to kind of cover
in here. I know we covered a lot. I
think the key thing here is
for nearly all of us we need to
understand how we plan to interact with
the data.
Um autoscale is nearly always going to
be the right choice unless again you
have that super sporadic bursty
workload.
Make sure when I'm designing my data
model I pick a petition key with very
high cardality lots of values so I can
get a good distribution.
If you also very often read using
different attributes, then look at the
global secondary index option. That
could be super super useful. So all that
said, um good luck and I will see you at
the next video.
Ask follow-up questions or revisit key timestamps.
Loading summary...
Videos recently processed by our community