00:00:14.240
hello I've heard that my face is blocked
00:00:17.400
by my laptop so I'm going to try to
00:00:19.640
orient myself
00:00:21.880
correctly I think I'm too short for this
00:00:25.400
setup um hello my name is Liz I'm a
00:00:28.519
software engineer on the switching team
00:00:30.199
at Cisco moroi um prior to moroi I
00:00:32.880
worked in startups ranging from series a
00:00:34.920
to IPO um what I'm about to confess
00:00:38.399
might rattle the good people of Colorado
00:00:41.039
but I used to be a mountain
00:00:46.199
person after living in California for
00:00:48.480
the past few years I have been fully
00:00:50.199
converted and I'm a beach person
00:00:52.879
now
00:00:56.320
yeah it's so dry here I have ChapStick
00:01:00.879
in my pocket right now I feel like you
00:01:02.719
could snap me in half if you wanted I'm
00:01:04.879
like dried to a
00:01:07.400
crisp I know Colorado has some lovely
00:01:10.080
mountains so I apologize for the ocean
00:01:11.680
themes in this presentation but
00:01:13.560
hopefully we can move past it and get to
00:01:15.320
the heart of what I want to talk to you
00:01:16.640
about which is time series data time
00:01:19.600
series data is a topic that Cisco moroi
00:01:21.640
deals with quite often and it can get
00:01:23.600
really intricate and hairy I found that
00:01:25.880
the gotches can be difficult to uncover
00:01:27.640
at first so hopefully this presentation
00:01:29.560
helps um anyone who's currently finding
00:01:31.840
themselves daunted by this
00:01:33.640
concept by the end of the talk you'll
00:01:35.880
hopefully have an understanding of how
00:01:37.320
time series data might differ from the
00:01:39.040
typical sort of relational data that you
00:01:40.920
might be used to dealing with I'll walk
00:01:43.200
through how to select a tool for
00:01:44.680
managing time series data how to
00:01:46.640
organize time series data how to query
00:01:48.920
time series data how to Aggregate and
00:01:51.200
compress time series data and finally
00:01:53.479
how to translate Your Design to API
00:01:56.640
constraints before jumping in you might
00:01:58.880
be asking yourself what even is time
00:02:01.280
series data time series data is
00:02:03.399
essentially just a collection of
00:02:04.799
observations recorded over consistent
00:02:06.640
intervals of Time Time series data is
00:02:09.039
distinct from other types of data
00:02:10.679
because of this ordering by time this
00:02:13.200
graph lifted from our dashboard at Cisco
00:02:15.040
moroi shows a usage rate in bits per
00:02:17.080
second for a device over time we record
00:02:19.519
this data every five minutes by pulling
00:02:21.239
the device we store it in a Time series
00:02:23.400
database and we display it to the user
00:02:25.319
in a graph on our
00:02:27.280
dashboard time series data has become
00:02:29.440
pretty aitous as we collect more and
00:02:31.480
more data to model our surroundings and
00:02:33.440
predict Trends devices like SmartWatches
00:02:36.160
have introduced amateur athletes to way
00:02:38.040
more data than they've ever had access
00:02:40.440
to our friend Liz inspired by apps like
00:02:43.200
straa wants to track her surfing as well
00:02:46.480
Liz has just taken her first surf lesson
00:02:48.720
and she's keen on learning how her
00:02:50.000
surfing will improve over time she's
00:02:52.200
decided to record this data in a Time
00:02:53.959
series database and to access it via an
00:02:56.080
API endpoint but where does she start
00:03:00.200
in surfing it's important to select a
00:03:01.800
surfboard that's suited for a particular
00:03:03.560
swell if the waves are Steep and
00:03:05.599
Powerful it might be worth breaking out
00:03:07.159
the shortboard for better
00:03:08.480
maneuverability for smaller days a
00:03:10.360
longboard can be a lot of fun I've
00:03:12.480
recently been told that it's absurd that
00:03:14.720
my partner and I have nine surfboards
00:03:16.560
between the two of
00:03:18.599
us but it's important to have a lot of
00:03:20.760
options conditions really do vary the
00:03:23.640
same is true of data and databases it's
00:03:26.280
important to select a tool that's
00:03:27.680
appropriate for the type of data you
00:03:29.159
plan to deal with with time series data
00:03:31.360
often comes in large quantities and
00:03:33.000
efficient access is highly important so
00:03:35.360
a database that can accommodate these
00:03:36.760
concerns is crucial When selecting a
00:03:39.280
tool for managing time series data you
00:03:41.480
have four options that nicely mirror the
00:03:43.480
options of surfer faces when deciding
00:03:45.400
which board to Surf as a surfer you can
00:03:47.879
surf a board you already have this is
00:03:50.239
applicable for folks who already have a
00:03:51.760
dedicated time series database in their
00:03:53.319
text tack as a surfer you can use an old
00:03:55.920
board but add a new set of fins this is
00:03:58.319
analogous to using a database extension
00:04:00.480
for say postgress adding onto a tool
00:04:02.840
that you already have as a surfer you
00:04:05.079
can buy a new board this is similar to
00:04:06.840
adopting a new database technology
00:04:08.480
dedicated to time series data or you can
00:04:11.400
break out the foam and fiberglass and
00:04:12.879
shape your own board this is just like
00:04:15.040
designing and implementing your own time
00:04:16.720
series
00:04:18.440
database it's quite possible that you
00:04:20.359
already have as part of your Tex stack a
00:04:22.000
database that works well for your time
00:04:23.400
series use case in that case you just
00:04:26.000
have to ensure that you're using your
00:04:27.440
database tool correctly with the proper
00:04:29.440
Techni techniques later in the talk I'll
00:04:31.560
cover techniques for the proper storage
00:04:33.360
and querying of Time series data because
00:04:36.400
after all if you don't know how to surf
00:04:38.320
even the nicest board in the world is of
00:04:40.160
no
00:04:41.720
use or maybe your team already uses a
00:04:44.520
more generalized database like postgress
00:04:46.800
and the best solution in this case would
00:04:48.320
be to add a postgress
00:04:50.880
extension this is similar to buying a
00:04:52.840
new set of fins for the surfboard you
00:04:54.320
already own fins are easily swapped out
00:04:56.840
without changing the surfing experience
00:04:58.520
too much and this case the old board is
00:05:01.080
postgress and the new set of fins is an
00:05:02.880
extension you can use with postgress
00:05:05.199
there are several options for time
00:05:06.520
series extensions you can use with
00:05:08.000
postgress two of the most notable
00:05:09.960
options are PG time series and time
00:05:11.720
scale DB these extensions provide a user
00:05:14.080
experience around creating managing and
00:05:15.919
querying time series data which doesn't
00:05:18.199
come out of the box with vanilla
00:05:19.960
postgress there are many benefits to
00:05:21.919
extensions using extension is lower lift
00:05:24.400
and cheaper plus you're already used to
00:05:26.479
surfing the board using a postr
00:05:28.440
extension will reduce learning curve and
00:05:30.160
augment the speed of development while
00:05:31.960
avoiding the performance hit you'd
00:05:33.120
otherwise take with vanilla
00:05:34.800
postgress a further benefit to using
00:05:36.880
postgress extension is that it will be
00:05:38.600
far easier to join relational data with
00:05:40.360
time series data as
00:05:42.840
needed now that we've talked so much
00:05:44.720
about postgress extensions you might be
00:05:46.360
asking yourself what's wrong with
00:05:47.800
vanilla postgress why not just store my
00:05:50.360
time series data in postgress without
00:05:52.039
bothering with an extension the first
00:05:54.199
reason is that these postgress
00:05:55.400
extensions come with built-in methods
00:05:57.039
specifically designed for use with time
00:05:58.800
series data without writing your own
00:06:01.039
method to do so you can for example
00:06:03.160
query data for a range of time more
00:06:06.319
importantly there are specific
00:06:07.680
performance differences between vanilla
00:06:09.400
postgress and an extension like time
00:06:11.039
scale
00:06:12.000
DB the time scale DB documentation
00:06:14.840
references an experiment in which
00:06:16.360
postgress and time scale DB are tasked
00:06:18.240
with ingesting a 1 billion row database
00:06:21.160
time scale DB loads this massive
00:06:22.880
database in 115th the total time of
00:06:25.199
postgress and sees through pit of more
00:06:27.199
than 20 times that of postgress because
00:06:29.319
of it heavy utilization of time space
00:06:31.479
partitioning time scale DB achieves a
00:06:33.880
higher ingest rate than post graphs from
00:06:35.479
De dealing with large quantities of data
00:06:37.800
quantities that are quite common when
00:06:39.199
dealing with time series data querying
00:06:41.800
can be even faster given a query that
00:06:43.960
specifies time ordering with a 100
00:06:46.120
million row table time scale DB achieves
00:06:48.800
a query latency that is 396 times faster
00:06:51.880
than
00:06:52.560
postgress later in this presentation
00:06:54.680
when we get into techniques such as
00:06:56.240
aggregation you'll see even more reasons
00:06:58.440
for favoring a Time series extension or
00:07:00.759
database over vanilla postgress
00:07:03.240
specifically you'll see the benefits of
00:07:04.800
a aggregation for data
00:07:08.199
retention sometimes your existing tools
00:07:10.639
don't cut it and you need to invest in
00:07:12.319
something entirely new this is analogous
00:07:14.800
to buying an entirely new surfboard if
00:07:17.319
you're looking for a dedicated time
00:07:18.800
series database it may be worth looking
00:07:20.639
into Solutions such as click housee
00:07:22.879
click house pches itself as a fast
00:07:24.800
open-source analytical database designed
00:07:27.000
around time series data managing time
00:07:29.800
series data is all about optimization
00:07:31.879
and each tool is optimized for a
00:07:33.879
different ideal use case it's essential
00:07:36.479
to evaluate your conditions before
00:07:38.160
selecting a surfboard and in the same
00:07:40.599
way it's important to consider what type
00:07:42.280
of data you'll be working with and what
00:07:43.960
you'll be doing with it consensus among
00:07:46.800
some users of both tools seems to be
00:07:48.960
that time scale DB has a great time
00:07:51.120
series story and an average data
00:07:52.680
warehousing story whereas click house
00:07:54.879
has a great data warehousing story and
00:07:56.440
an average time series Story look into
00:07:58.800
benchmark analys IES investigate the
00:08:00.639
features of each database and extension
00:08:02.400
and read up on the documentation for the
00:08:04.080
tool you're considering the more you
00:08:06.319
understand about the inner workings of
00:08:07.520
your proposed tool the better you'll
00:08:09.560
understand how it will work with your
00:08:10.759
use case my best advice is to really get
00:08:13.400
in the weeds as a surfer if you don't
00:08:15.280
know the purpose of rocker or tail shape
00:08:17.759
When selecting a board it's going to be
00:08:19.759
really difficult to make an informed
00:08:21.560
decision the same goes for understanding
00:08:23.919
and selecting a solution for time series
00:08:27.319
management and sometimes no available
00:08:29.919
database seems suited to your highly
00:08:31.960
specific needs if your use case is
00:08:34.440
really particular and you're feeling
00:08:36.279
especially industrious you might just
00:08:38.159
want to break out the foam and
00:08:39.360
fiberglass and shape your own board
00:08:41.800
moroi found itself in this situation in
00:08:44.120
2008 our core product is a cloud-managed
00:08:47.120
platform that allows users to configure
00:08:49.240
and monitor their networking devices the
00:08:51.720
monitoring data is more often than not
00:08:53.760
time series data we track track metrics
00:08:56.320
such as packet count received bya switch
00:08:58.880
the temperature of a device or the
00:09:00.640
device's memory usage Trends over time
00:09:03.040
can give insight into the health of a
00:09:04.680
system in 2008 there were far fewer time
00:09:07.560
series Solutions available so team at
00:09:09.640
moroi developed their own we call it
00:09:11.720
little
00:09:12.839
table little table is a relational
00:09:15.040
database optimized for time series data
00:09:17.600
Little T was in fact developed
00:09:19.880
specifically for spinning discs in 2008
00:09:22.920
storage on solid state drives was far
00:09:24.720
more expensive than storage on spinning
00:09:26.360
discs so the hardware was a significant
00:09:28.680
consideration because of this data is
00:09:30.800
clustered for continuous disk access in
00:09:33.079
order to improve performance later in
00:09:35.560
this presentation we'll see how this
00:09:37.240
impacts the way one might design a table
00:09:39.240
when using little table fun fact as of
00:09:42.200
2017 when the white paper was written
00:09:44.040
maroi stored 320 terabytes of data
00:09:47.040
across several hundred little table
00:09:48.519
servers systemwide now I'm sure the
00:09:51.120
quantity is even higher though not
00:09:53.480
actually a SQL database little table
00:09:55.040
includes a SQL interface for querying
00:09:56.839
which has improved developer adoption by
00:09:58.880
making this tool easy to use little
00:10:01.480
table exists for internal use only at
00:10:03.480
moroi but the team wrote an excellent
00:10:06.120
white paper that very effectively
00:10:07.600
describes the challenges and design
00:10:09.320
considerations which can be super useful
00:10:11.399
for anyone trying to gain a better
00:10:13.120
understanding of the intricacies of Time
00:10:14.519
series data I've linked it in the slide
00:10:16.800
you can also just look up little table
00:10:19.000
white paper Cisco moroi um and I can't
00:10:22.079
recommend it enough it's a great
00:10:24.680
read all right we now have our board
00:10:27.240
picked out we understand the conditions
00:10:28.920
we're surf in in and we've landed on a
00:10:30.440
board that works best however the work
00:10:33.200
is far from over you can have the
00:10:35.000
perfect board and still struggle to
00:10:36.639
actually surf if you don't have the
00:10:37.920
proper technique technique is also
00:10:40.399
incredibly important when dealing with
00:10:41.959
time series data regardless of which
00:10:43.959
database tool you choose to use in order
00:10:46.560
to optimize performance it's crucial
00:10:48.279
that we follow some tried andrue
00:10:49.680
patterns for organizing and querying
00:10:51.680
data the time series techniques I'll
00:10:53.800
cover in this talk are data arranged by
00:10:56.000
time composite Key quering by index and
00:10:59.399
aggregation and
00:11:01.360
compression the identifying
00:11:03.000
characteristic for a Time series
00:11:04.320
database is that it organizes data by
00:11:06.440
time for efficient access otherwise it
00:11:09.200
wouldn't be a Time series
00:11:10.839
database both click housee and time
00:11:12.839
scale DB will automatically generate an
00:11:14.839
index on the timestamp column this
00:11:17.040
allows for the most performant access
00:11:18.560
when retrieving data for a range of time
00:11:20.920
little table actually clusters data on
00:11:22.720
the disk by timestamp never interleaving
00:11:24.720
data with older
00:11:26.120
timestamps because of the unique data
00:11:28.040
structure some databases Force
00:11:29.680
restrictions arranging data by time
00:11:31.839
allows for highly efficient reads but
00:11:34.000
writing can be quite inefficient the
00:11:35.800
designers of little table decided to
00:11:37.480
constrain rights to be append only since
00:11:40.240
we're collecting data over a range of
00:11:41.839
time it doesn't make much sense to spot
00:11:43.760
fill historic data anyway according to
00:11:46.120
the little table white paper there is no
00:11:48.000
need to update rows as each row
00:11:49.959
represents a measurement taken at a
00:11:51.760
specific point in
00:11:53.760
time when visualizing a Time series
00:11:56.040
database it's important to understand
00:11:57.600
that there are effectively two identif
00:11:59.360
iers for a given piece of data the first
00:12:01.959
mentioned in the previous slide is the
00:12:03.399
time stamp the second piece of
00:12:05.160
information is the identifier in almost
00:12:07.560
every case this is comprised of multiple
00:12:09.480
Fields making it a composite
00:12:11.959
key each time series database refers to
00:12:14.560
this concept using slightly different
00:12:16.199
terminology little table documentation
00:12:18.360
refers to this as a hierarchically
00:12:20.160
delineated key click housee
00:12:22.199
documentation refers to this as a
00:12:23.639
compound primary key and time scale DB
00:12:25.839
refers to this as a partition key in
00:12:28.639
order to understand the implication this
00:12:30.680
composite key has on structuring data
00:12:32.839
I'm going to drill into little table's
00:12:34.240
hierarchically delineated key in little
00:12:36.839
table this key determines how the data
00:12:38.639
is actually arranged on disk in addition
00:12:40.920
to being grouped by time this
00:12:42.880
hierarchical organization enables
00:12:44.680
efficient queries since it will always
00:12:46.639
correspond to a contiguous region of
00:12:48.279
data on the disk it's crucial then to
00:12:50.880
only query based on ordered components
00:12:52.800
of this key in order to determine the
00:12:55.199
component components of this key and how
00:12:57.279
they're ordered it's super important to
00:12:59.519
understand how this data is going to be
00:13:01.199
accessed your queries will only be
00:13:03.240
performant if you're accessing a
00:13:04.639
continuous block of data so you have to
00:13:07.199
understand what the most common queries
00:13:08.680
are going to be and design around
00:13:11.240
those it's probably best to visualize a
00:13:13.440
real world example this way we can
00:13:15.480
actually see the data arranged by these
00:13:17.079
two axes time and composite key in this
00:13:20.120
example we can also visualize what a
00:13:22.120
hierarchically delineated key in little
00:13:23.920
table really is here's an example lifted
00:13:26.760
directly from the little table white
00:13:28.199
paper as you can see the data is
00:13:30.360
organized along two axes on the x axis
00:13:33.600
we have the time stamps and on the y
00:13:35.600
axis we have the elements of the
00:13:37.120
composite key you'll see that along the
00:13:39.920
the y- axis all the records for a single
00:13:41.680
Network are grouped together and within
00:13:43.880
that Network all the records for a
00:13:45.360
single device are grouped together this
00:13:47.760
composite key can contain as many fields
00:13:49.760
as you want thus arranging data many
00:13:51.800
layers deep in this example though we
00:13:54.240
simply have two Fields included in the
00:13:56.040
composite key grouping the data by two
00:13:57.959
layers
00:14:02.480
as we saw in the previous example the
00:14:04.000
most important takeaway for a
00:14:05.199
hierarchically delineated key is that
00:14:06.880
its components are organized with an
00:14:08.360
increasing degree of specificity the
00:14:10.959
example from Cisco moroi included two
00:14:12.959
components Network and device since the
00:14:15.519
network has many devices this example is
00:14:18.040
purely
00:14:19.040
hierarchical however just because the
00:14:20.880
ordering in this key typically
00:14:22.440
corresponds to a real world hierarchy it
00:14:24.880
doesn't necessarily have to you can
00:14:27.040
select whatever ordering you want for
00:14:28.759
the components of this key and that
00:14:30.680
ordering depends only on how this data
00:14:32.600
is accessed in our case Liz's surfing
00:14:35.480
application is designed to be Surfer
00:14:37.680
specific while we want to store data for
00:14:39.959
multiple Surfers it doesn't make much
00:14:42.040
sense to query data across Surfers since
00:14:44.720
each Surfer is interested only in their
00:14:46.600
individual progress this means that we
00:14:49.120
can first prefix our primary key with
00:14:50.839
the surfer so that all the data for a
00:14:52.600
single Surfer is collocated in the table
00:14:55.720
then we can follow the hierarchical
00:14:57.160
pattern with a region and then a break
00:14:59.160
the region might be Los Angeles and the
00:15:01.360
break might be Malibu first point the
00:15:04.240
region and break are very similar in
00:15:05.680
concept to the example from Cisco maroi
00:15:07.639
since a region contains many
00:15:10.480
braks now that we have a key that's
00:15:12.360
optimized for querying we need to
00:15:13.959
actually write our queries in the most
00:15:15.720
optimal way this data is highly
00:15:17.639
structured and the way it's structured
00:15:19.160
depends entirely on how we plan to query
00:15:21.160
it hopefully you remember this graphic
00:15:23.399
from a couple slides ago once again I
00:15:25.320
think it's easiest to understand this
00:15:26.800
query by visualizing it as a refresher
00:15:29.440
data is arranged by time across the xais
00:15:32.000
and composite key across the y AIS when
00:15:34.880
quering time series data it's essential
00:15:36.720
to include the timestamp in your wear
00:15:38.720
Clause it really doesn't make sense to
00:15:41.000
include to query this data without
00:15:42.680
including a timestamp or a range of
00:15:45.399
timestamps additionally you'll want to
00:15:47.480
include part or all of the elements of
00:15:49.639
the composite key because of the way
00:15:51.959
data is arranged in little table you
00:15:53.639
only ever need to include a prefix of
00:15:55.160
the composite key when querying this
00:15:57.279
means that you could query all the data
00:15:58.720
for a single Surfer over a range of time
00:16:01.160
or you can query all the data for a
00:16:02.759
surfer in a specific region over a range
00:16:04.880
of time or you can drill down with the
00:16:06.839
entire composite key and request data
00:16:08.720
for a surfer region and
00:16:11.199
break click house is a little bit
00:16:13.199
different the data in Click house is not
00:16:15.040
arranged across two Dimensions instead
00:16:17.279
the Tim stamp is basically the final
00:16:18.920
component of the composite key because
00:16:21.360
of this it doesn't make much sense to
00:16:23.040
include just the surfer and Tim stamp in
00:16:25.040
the query because you're skipping the
00:16:26.880
middle section of the primary key as you
00:16:28.920
you can see in the
00:16:30.079
diagram consider the example I've shown
00:16:32.600
we have a contiguous stretch of data for
00:16:34.319
Liz which is broken into region which is
00:16:36.399
then broken into break which contains
00:16:38.279
all of the timestamp data timestamp
00:16:40.880
records for say the last month it
00:16:43.399
doesn't make much sense to query the
00:16:45.000
data for Liz over the past two weeks
00:16:47.319
because the data here is not contiguous
00:16:49.319
for each location you'd have to grab
00:16:50.959
just a section skipping over the data
00:16:52.920
points that don't fall within the
00:16:54.240
requested time span the only performance
00:16:57.120
query for click house would be to
00:16:58.319
include all the components of the
00:16:59.839
composite key you must specify the
00:17:02.399
surfer region break and a range of time
00:17:05.360
so it would be performant to query Liz's
00:17:07.240
data from Malibu in La over the past two
00:17:10.160
weeks it's important to drill down and
00:17:12.679
understand how the data is arranged in
00:17:14.199
your time series database of Choice by
00:17:16.959
understanding the structure you can
00:17:18.160
visualize what a contiguous chunk of
00:17:19.799
data looks like and you can ensure that
00:17:21.760
your query is making use of the way the
00:17:23.400
data is structured it's worth noting
00:17:26.079
that once you execute that performant
00:17:27.559
query as long as that quantity of data
00:17:29.559
is reasonable and can be loaded into
00:17:31.160
memory you can always write a subsequent
00:17:33.160
query that gets the data you want it's
00:17:35.520
important though to make sure that the
00:17:37.000
first query is performant or else we're
00:17:39.080
losing the benefit we get from a
00:17:40.360
dedicated time series
00:17:42.679
database cool at this point we know how
00:17:45.280
to store our data how to query it and
00:17:47.880
now we can start to look at the
00:17:48.960
maintenance side of things here's the
00:17:51.160
thing Liz surfs a
00:17:54.000
lot she plans to serve for years to come
00:17:57.240
and although we would love to keep keep
00:17:59.000
all of Liz's surfing data in perpetuity
00:18:00.880
we simply don't have unlimited storage
00:18:03.559
when dealing with time series data you
00:18:05.159
have to balance two major concerns you
00:18:07.400
don't have unlimited storage space to
00:18:08.960
keep raw dat data forever but you also
00:18:11.480
want to provide the user with as much
00:18:12.880
data as
00:18:14.200
possible in order to solve for the first
00:18:16.400
concern the fact that we don't have
00:18:18.320
infinite storage we need to take a look
00:18:20.480
at data retention every time series
00:18:23.159
database that I've seen includes some
00:18:24.840
sort of policy for data retention often
00:18:27.320
this comes in the form of a TT L
00:18:29.440
otherwise known as a time to live the
00:18:31.760
time to live dictates how old the data
00:18:33.840
in the table is allowed to be after a
00:18:36.280
certain point data of a certain age is
00:18:38.320
simply dropped from the
00:18:39.880
table now we also need to address the
00:18:42.320
desire to show as much data as possible
00:18:45.000
in order to do so we need to extend the
00:18:46.960
TTL without sacrificing storage there
00:18:49.760
are few ways of going about this notably
00:18:52.120
compression and
00:18:54.000
aggregation compression is the method of
00:18:56.000
choice for the postgress extension time
00:18:57.919
scale DB
00:19:00.080
when you add data to your database it's
00:19:01.880
in the form of uncompressed rows time
00:19:04.400
scale uses a built-in job scheduler to
00:19:06.320
convert this data to the form of
00:19:07.799
compressed columns you can see in the
00:19:10.159
first example on the slide what six data
00:19:12.120
points might look like compressed into a
00:19:13.960
single data point this preserves all the
00:19:16.440
original data while restructuring it
00:19:18.280
into a format that requires less memory
00:19:20.240
to store I noted earlier in the
00:19:22.559
presentation that the makeup of the
00:19:24.039
composite key should be determined
00:19:25.520
entirely how by how the data is going to
00:19:27.960
be queried
00:19:29.240
the same is true of compression the way
00:19:31.200
the data is segmented depends entirely
00:19:33.240
on how it's going to be queried if
00:19:35.480
you're most often accessing data by say
00:19:37.880
device ID you can configure the
00:19:39.760
scheduler to compress data by device ID
00:19:42.840
the second example in this slide shows
00:19:44.280
an example of data segmented by device
00:19:48.080
ID click housee also uses compression to
00:19:50.760
improve database performance it may seem
00:19:53.039
obvious but less data on disk means less
00:19:55.159
IO and faster queries and inserts when
00:19:58.080
speaking about impression in a Time
00:19:59.559
series context it's important to take a
00:20:01.440
couple of steps backwards and talk about
00:20:03.320
one of the major differences between
00:20:05.480
most time series databases and a more
00:20:07.440
generalized database like postgress this
00:20:10.240
difference lies in the structure of the
00:20:11.640
database which should hopefully come as
00:20:13.400
no surprise since we've already spoken
00:20:15.760
at length about the importance of
00:20:17.159
database structure when it comes to
00:20:18.640
handling time series
00:20:20.520
data postgress is what we call a row
00:20:22.840
based database a row based database
00:20:25.159
organizes data by record keeping all of
00:20:27.600
the data associated with a record next
00:20:29.679
to each other in memory row based
00:20:32.039
databases are well suited for
00:20:33.400
transactional workloads where entire
00:20:35.360
records need to be retrieved updated or
00:20:37.679
inserted quickly and efficiently with a
00:20:40.080
row based database writing can be quite
00:20:41.799
effective but reading from large
00:20:43.240
quantities of data has its shortcomings
00:20:45.600
especially when querying by a field like
00:20:47.799
timestamp and because data is grouped by
00:20:50.200
record compressing data by attribute is
00:20:52.440
also quite inefficient consider the
00:20:54.640
example from the last slide where the
00:20:56.200
CPU usage and disio for many devices
00:20:58.400
were were group together into a single
00:21:00.520
record click house like many time series
00:21:03.000
databases is actually a column based
00:21:04.960
database in a column based database each
00:21:07.440
data block stores values of a single
00:21:09.480
column for multiple rows this is ideal
00:21:12.400
because compression algorithms exploit
00:21:15.400
contiguous patterns of data if this data
00:21:17.919
is sorted by columns in a particular
00:21:19.400
order this can lead to incredibly
00:21:20.960
efficient compression column based
00:21:23.400
databases are often the preferred choice
00:21:25.159
for analytic and data warehousing
00:21:27.080
applications the benefit of column based
00:21:29.360
databases include faster data
00:21:30.880
aggregation higher compression speeds
00:21:32.960
and less use of dis space the drawback
00:21:35.559
is that the data modification is slower
00:21:37.799
but as we've discussed previously
00:21:39.600
modifying time series data is often not
00:21:41.679
an intended use
00:21:43.760
case in addition to compression there's
00:21:45.960
a second approach to ensuring that we
00:21:47.679
can preserve data for an extended period
00:21:49.480
of time without increasing our storage
00:21:51.679
costs this approach is called
00:21:53.520
aggregation and it's the methodology of
00:21:55.400
choice for little table when we're
00:21:57.360
speaking about aggregation there are two
00:21:59.120
concepts the base table and the
00:22:01.080
aggregate table the base table is where
00:22:03.240
we insert the raw metrics we're
00:22:04.640
recording the aggregate tables store a
00:22:07.039
summary or average of that raw data so
00:22:10.279
first we need to decide what raw metrics
00:22:12.720
we want to store in our base table if
00:22:15.520
you recall we already decided on a
00:22:17.080
primary key that contains a surfer
00:22:18.720
region and break each record will
00:22:20.720
represent a single wave the surfer clot
00:22:23.000
at that break at a specific time then we
00:22:26.039
need to decide what aggregated metrics
00:22:27.799
we want to record in the aggregate
00:22:29.159
tables at the most basic level what Liz
00:22:31.679
might want to know is how good was that
00:22:33.320
wave and how long was she on it and how
00:22:35.440
far did she ride the aggregate table
00:22:37.520
will have to contain a summary of the
00:22:39.360
data in the base table it might be
00:22:41.200
helpful to know the total distance
00:22:43.159
summed across all waves the total
00:22:45.080
duration summed across all waves the
00:22:47.240
maximum total speed for a single wave
00:22:49.640
and the number of waves caught over that
00:22:51.080
time
00:22:52.880
period now that we've decided what data
00:22:54.919
to store in these aggregate tables we'll
00:22:56.720
have to decide what intervals of data
00:22:58.720
sense these will determine which
00:23:00.720
aggregate tables we want to create this
00:23:02.799
will also help us decide on our ttls
00:23:05.000
constraining how much data we're
00:23:06.400
actually storing since Li tends to surf
00:23:09.600
at most once a day it makes sense to
00:23:11.640
aggregate data up to the day that way we
00:23:14.080
can preserve data for each surfing
00:23:15.880
session for a TTL of six months from
00:23:18.840
there we can also aggregate data up to
00:23:20.640
the week and the month so that it's
00:23:22.120
easier for Liz to track seasonal and
00:23:24.080
annual Trends this leaves us with a base
00:23:26.880
table with a TTL of 1 month mon a one
00:23:29.440
day aggregate table with a TTL of 6
00:23:31.559
months a one we aggregate table with a
00:23:33.799
TTL of one year and a one- month
00:23:35.880
aggregate table with a TTL of 5
00:23:38.960
years the hardest part is over now we
00:23:42.039
have our data stored aggregated and
00:23:43.960
easily accessible and we want to design
00:23:46.320
an API endpoint that Liz can use to
00:23:48.440
easily query her surf data the decisions
00:23:51.039
we've made when it comes to querying and
00:23:52.760
aggregation will determine exactly how
00:23:54.840
this API endpoint will be used the next
00:23:57.679
step is defining an API contract which
00:23:59.760
can be clearly documented for the end
00:24:01.440
user validated and
00:24:03.760
enforced a crucial element to document
00:24:05.960
for the end user is a set of allowable
00:24:07.720
query progams the components and
00:24:10.000
ordering of the composite key determine
00:24:12.039
which query progams are required and
00:24:14.240
which are optional as always a time span
00:24:17.320
is necessary for a user querying a Time
00:24:19.400
series
00:24:20.480
database and assuming that we're using
00:24:22.559
little table as our underlying storage
00:24:24.480
option we only need a prefix of the
00:24:26.399
primary key so the surfer is the only
00:24:28.640
required component of the key beyond
00:24:31.039
that you can optionally specify a region
00:24:32.840
in a break it's important though to
00:24:34.960
document and enforce that a user must
00:24:36.760
also provide a region if they want to
00:24:38.200
provide a break recall earlier that we
00:24:40.399
noted that you can't skip fields in the
00:24:42.360
middle of the primary key you must
00:24:44.520
provide the full prefix of the primary
00:24:46.120
key which in this case is Surfer region
00:24:48.360
and break in that
00:24:51.919
order now that we have the user's
00:24:53.840
request we need to determine which
00:24:55.320
aggregate table we'll be querying from
00:24:57.480
this requires an understanding of some
00:24:59.120
terminology at Moro we discuss time
00:25:01.520
series data in terms of a time span and
00:25:03.679
an interval so I'll quickly explain what
00:25:05.720
we mean by each of those terms in this
00:25:07.520
context the time span describes the full
00:25:10.120
period of time over which we want data
00:25:12.600
since our longest TTL in the database is
00:25:14.360
5 years we can't can't query data for a
00:25:16.480
time span that extends further than five
00:25:18.240
years in the past the interval
00:25:20.559
corresponds to the grain at which the
00:25:22.120
data is aggregated the only options here
00:25:24.960
are one day one week and one month as
00:25:27.679
noted before each aggregation interval
00:25:29.840
will be stored in its own aggregate
00:25:31.640
table we'll have a one day table a onee
00:25:34.360
table and a one month table in designing
00:25:37.520
this API endpoint we'll assume that the
00:25:39.600
user wants the most data possible for
00:25:41.520
the time span requested this means that
00:25:43.880
the TTL will determine which aggregate
00:25:45.720
table we'll query from we'll want to
00:25:47.760
query from the aggregate table with the
00:25:49.360
smallest interval whose TTL is still
00:25:51.679
greater than the time span requested so
00:25:54.760
for example if the user requests a time
00:25:56.600
span less than or equal to 6 months we
00:25:58.480
can return Daily Surf data for a time
00:26:00.760
span between 6 months and one year we'll
00:26:02.640
return weekly data and for any time span
00:26:05.159
between one year and 5 years will return
00:26:07.120
monthly surf data and we'll validate
00:26:09.480
that the user is not allowed to query
00:26:11.080
the API endpoint with a time span
00:26:12.640
greater than five years since all data
00:26:14.760
is dropped after that
00:26:16.840
point now I'd like to quickly show what
00:26:18.960
a visualization might look like for time
00:26:20.960
series data now that we have in Con in
00:26:23.279
in mind the concepts of a time span and
00:26:25.679
an interval on this slide is another
00:26:27.919
screen grabb from Cisco maro's dashboard
00:26:30.399
application here you can see that the
00:26:32.440
top of the page there's a drop down
00:26:34.440
option for showing data for the past two
00:26:36.720
hours day week or month these are the
00:26:39.760
selectable time spans we currently have
00:26:42.200
the one- month time span selected and in
00:26:44.799
the usage graph below you can see that
00:26:46.320
the data is broken out day by day in
00:26:48.880
this case for a one- month time span
00:26:50.919
we're showing data over a one day
00:26:53.000
interval this pattern is especially
00:26:55.039
useful when you want to detect Trends
00:26:57.080
since Trends can be found when looking
00:26:58.520
at data down to the hour or over the
00:27:00.919
span of an entire
00:27:03.720
month earlier in this talk I explained
00:27:06.320
the shortcomings of postgress when it
00:27:07.720
comes to time series data one of these
00:27:09.880
shortcomings is the lack of specialized
00:27:11.679
time series tooling and vanilla postc
00:27:13.360
grass because tools like click housee
00:27:15.399
and time scale DB are so specialized for
00:27:17.600
time series data you might even be able
00:27:19.760
to skip some of the steps I've listed in
00:27:21.679
this getting out their section by
00:27:23.600
leveraging some of the tools and
00:27:24.960
Integrations offered click house for
00:27:27.360
instance officially integrates with
00:27:28.960
quite a few visualization tools like
00:27:30.679
grafana and Tableau this makes quick
00:27:33.120
data visualization really easy to set up
00:27:36.000
and just this year a time scale DB
00:27:37.760
announced a project called time scale
00:27:39.640
analytics this initiative is not
00:27:41.480
complete and they're still receiving
00:27:42.720
developer input if you're interested in
00:27:44.679
commenting what they're hoping to do is
00:27:46.640
create a One-Stop shop for time series
00:27:48.760
analytics in
00:27:50.080
postgress in the time scale analytics
00:27:52.279
announcement time scale DB listed a few
00:27:54.360
sketching algori algorithms that they
00:27:56.240
hope to build into this extension T
00:27:58.720
digest hyper log log and count Min T
00:28:01.480
digest is a data structure that will
00:28:02.960
estimate a percentile Point without
00:28:04.720
having to store and order all the data
00:28:06.480
points in a set hyper log log is a
00:28:08.679
probabilistic data structure that
00:28:10.240
estimates the cardinality of a set and
00:28:12.559
cman is a probabilistic data structure
00:28:14.799
that serves as a frequency table of
00:28:16.519
events in a stream of data due to time
00:28:19.320
scale DBS aggregation these sketches
00:28:21.679
have very low query
00:28:23.480
latency there are so many features I
00:28:25.440
haven't listed and these products are
00:28:27.240
receiving a lot of support so I'm sure
00:28:28.720
the list will grow it'll be really cool
00:28:30.600
to witness the evolution of these time
00:28:32.240
series
00:28:34.200
tools sweet Liz now has easily
00:28:37.399
accessible data on her surfing
00:28:39.000
performance broken down by break over
00:28:41.240
time fast forward several years from now
00:28:43.640
Liz has been surfing for quite some time
00:28:45.640
and she's learned some important lessons
00:28:48.000
for example she took a look at the
00:28:49.679
monthly data for her favorite break and
00:28:51.480
she realizes that she catches far fewer
00:28:53.880
waves there in the winter than she does
00:28:55.640
in the summer and the waves she does
00:28:57.679
catch are way smaller and Peter out
00:28:59.600
quickly it turns out that this surf spot
00:29:02.320
only catches South swells and South
00:29:04.200
swells are way more common in the
00:29:05.600
summertime Liz had no idea that swell
00:29:08.039
Direction was seasonal and it had never
00:29:09.919
occurred to her to check now she knows
00:29:12.200
where to surf in each season and she's
00:29:13.799
been able to update her surf routine
00:29:15.200
accordingly she's been catching way more
00:29:17.360
waves and she's been having a lot more
00:29:19.200
fun looks like Liz is on her way to
00:29:21.440
getting
00:29:24.120
pitted thanks so much for listening
00:29:26.440
again I work for a remarkable company
00:29:28.240
called Cisco moroi with some of the
00:29:29.720
brightest rails developers I've ever met
00:29:32.000
I still can't believe I get to work at
00:29:33.399
the intersection of web development and
00:29:35.240
computer networking it's a fascinating
00:29:37.480
space to be in with really compelling
00:29:39.120
problems to solve if that sounds
00:29:40.840
interesting to you feel free to reach
00:29:42.159
out to me afterwards and of course I'm
00:29:44.399
always down to talk time series data
00:29:46.399
have a great rest of your conference