00:00:15.440
um hi uh my name is Liz I'm a software
00:00:17.800
engineering manager on the switching
00:00:19.199
team at Cisco moroi prior to Moro I
00:00:21.480
worked in startups ranging from series a
00:00:23.119
to IPO um I lived in Ohio for 24 years
00:00:25.960
so I'm no stranger to the 5-Hour road
00:00:27.720
trip to see a show in Chicago um I truly
00:00:30.439
think Chicago is one of the most perfect
00:00:32.000
cities in the world I have a friend who
00:00:33.640
calls the Great Lakes the North Coast of
00:00:35.680
the US um and I love that description I
00:00:38.160
think that's so fun uh Believe It or Not
00:00:40.399
people not me do surf in Lake Michigan
00:00:43.559
during storms often in the winter it's
00:00:45.640
very gnarly hence why I've never done it
00:00:48.879
so hopefully the surf theme of this
00:00:50.760
presentation is fitting enough for the
00:00:52.879
nautical town of Chicago um but this
00:00:56.440
talk is primarily About Time series data
00:00:58.760
time series is a topic Cisco moraki
00:01:00.879
deals with quite often and it can get
00:01:02.680
really intricate and hairy I found that
00:01:04.559
the gas can be quite difficult to
00:01:06.040
uncover at first so hopefully this
00:01:07.520
presentation helps anyone who's
00:01:08.960
currently finding themselves daunted by
00:01:10.560
the
00:01:11.920
concept by the end of this talk you'll
00:01:13.840
hopefully have an understanding of how
00:01:15.119
time series data might differ from the
00:01:16.960
typical sort of relational data you
00:01:18.520
might be used to dealing with I'll walk
00:01:20.680
through how to select a tool for
00:01:22.000
managing time series data how to
00:01:24.000
organize time series data how to query
00:01:26.240
time series data how to Aggregate and
00:01:28.280
compress time series data and finally
00:01:30.600
how to translate Your Design to API
00:01:33.439
constraints before jumping in you might
00:01:35.320
be asking yourself what even is time
00:01:37.399
series data time series data is
00:01:39.600
essentially just a collection of
00:01:40.920
observations recorded over consistent
00:01:42.720
intervals of Time Time series data is
00:01:44.960
distinct from other types of data
00:01:46.479
because of this ordering by time this
00:01:48.880
graph lifted from our dashboard at Cisco
00:01:50.680
moraki shows a usage rate and bits per
00:01:52.719
second for a device over time we record
00:01:55.320
this data every 5 minutes by pulling the
00:01:57.159
device we store it in a Time series
00:01:59.079
database and display it to the user in a
00:02:01.000
graph on our
00:02:02.399
dashboard time series data has become
00:02:04.520
pretty ubiquitous as we collect more and
00:02:06.320
more data to model our surroundings and
00:02:08.000
predict Trends devices like smart
00:02:10.039
watches have introduced amateur athletes
00:02:11.599
to way more data than they've ever had
00:02:13.319
access to our friend Liz inspired by
00:02:15.959
apps like straa wants to track her
00:02:17.560
surfing as well Liz has just taken her
00:02:20.080
first surf lesson and she's keen on Lear
00:02:21.840
learning how her surfing will improve
00:02:23.280
over time she's decided to record this
00:02:25.519
data in a Time series database and to
00:02:27.319
access it via an API endpoint but where
00:02:29.959
does she
00:02:30.840
start in surfing it's important to
00:02:33.200
select a surfboard that's suited for a
00:02:34.920
particular swell if the waves are Steep
00:02:36.959
and Powerful it might be worth breaking
00:02:38.440
out the short board for better
00:02:39.959
maneuverability for smaller days a
00:02:41.840
longboard can be a lot of fun I've
00:02:43.720
recently been told that it's absurd that
00:02:45.519
my partner and I have nine surfboards
00:02:47.400
between the two of us uh but it's
00:02:49.440
important to have a lot of options
00:02:51.080
conditions really do vary the same is
00:02:53.480
true of data and databases it's
00:02:55.640
important to select a tool that's
00:02:56.879
appropriate for the type of data you
00:02:58.200
plan to deal with Time series data often
00:03:00.480
comes in large quantities and efficient
00:03:02.239
access is highly important so a database
00:03:04.080
that can accommodate these concerns is
00:03:05.680
crucial When selecting a tool for
00:03:07.560
managing time series data you have four
00:03:09.400
options that nicely mirror the options a
00:03:11.480
surfer faces when deciding what board to
00:03:13.480
Surf as a surfer you can surf a board
00:03:15.959
you already have this is applicable for
00:03:18.080
folks who already have a dedicated time
00:03:19.879
series database in their Tex stack as a
00:03:22.319
surfer you can use an old board but add
00:03:24.159
a new set of fins this is analogous to
00:03:26.560
using a database extension for say
00:03:28.680
postgress as add on to a tool you
00:03:30.560
already have as a surfer you can buy a
00:03:33.080
new board this is similar to adopting a
00:03:35.040
new database technology dedicated to
00:03:36.879
time series data or you can break out
00:03:39.439
the foam and fiberglass and shape your
00:03:41.000
own board this is just like designing
00:03:42.879
and implementing your own time series
00:03:45.840
database it's quite possible that you
00:03:47.720
already have as part of your Tech stack
00:03:49.319
a database that works well for your time
00:03:50.879
series use case in that case you just
00:03:52.879
have to ensure that you're using your
00:03:54.040
database tool correctly with the proper
00:03:55.720
techniques later in the talk I'll cover
00:03:57.760
techniques for the proper storage and
00:03:59.280
querying of Time series data because
00:04:01.680
after all if you don't know how to surf
00:04:03.720
even the nicest board in the world is of
00:04:05.360
no
00:04:06.560
use or maybe your team already uses a
00:04:09.439
more generalized database like postgress
00:04:11.799
and the best solution in this case might
00:04:13.200
be to add a postgress extension this is
00:04:15.560
similar to buying a new set of fins for
00:04:17.280
the surf board you already own fins are
00:04:19.359
easily swapped out without changing the
00:04:20.919
surfing experience too much in this case
00:04:23.320
the old board is postgress and the new
00:04:25.360
set of fins is an extension you can use
00:04:27.240
with postgress there are several options
00:04:30.039
for time series extensions you can use
00:04:31.479
with postgress two of the most notable
00:04:33.320
options are PG time series and time
00:04:35.039
scale DB these extensions provide a user
00:04:37.800
experience around creating managing and
00:04:39.600
querying time series data which doesn't
00:04:41.639
come out of the box with vanilla
00:04:43.360
postgress there are many benefits to
00:04:45.320
extensions using using an extension is
00:04:47.680
lower lift and cheaper plus you're
00:04:49.720
already used to surfing the board using
00:04:51.840
a postgress extension will reduce the
00:04:53.479
learning curve and augment the speed of
00:04:55.199
development while avoiding the
00:04:56.520
performance hit you'd otherwise take
00:04:58.000
with vanilla postgress a further benefit
00:05:00.680
to a postcast extension is that it will
00:05:02.280
be far easier to join relational data
00:05:03.880
with time series data as
00:05:05.960
needed now that we've talked so much
00:05:07.919
about postgress extensions you might be
00:05:09.720
asking yourself what's wrong with filla
00:05:11.560
postgress why not just store my time
00:05:13.639
series data in postgress without
00:05:15.160
bothering with an extension the first
00:05:17.600
reason is that these postrest extensions
00:05:19.639
come with built-in methods designed
00:05:21.520
specifically for use with time series
00:05:23.240
data without writing your own method to
00:05:25.199
do so you can for example query a range
00:05:27.600
of data over time more importantly there
00:05:30.240
are significant performance differences
00:05:32.039
between vanilla postest and an extension
00:05:33.960
like time scale DB the time scale DB
00:05:36.440
documentation references an experiment
00:05:38.240
in which postgress and time scale DB
00:05:40.080
were tasked with ingesting a 1 billion
00:05:42.240
row database time scale DB loads this
00:05:45.199
massive database in 115th the total time
00:05:47.720
of postgress and sees throughput of more
00:05:49.639
than 20 times that of postgress because
00:05:52.080
of its heavy utilization of time space
00:05:53.840
partitioning time scale DB achieves a
00:05:56.080
higher ingest rate than postgress when
00:05:57.919
dealing with large quantities of data
00:05:59.639
quantities that are quite common when
00:06:01.360
dealing with time series data querying
00:06:04.240
can be even faster given a query that
00:06:06.199
specifies time ordering with 100 million
00:06:08.240
row table time scale DB achieves a query
00:06:10.800
latency that is 396 times faster than
00:06:14.319
postgress later in this presentation
00:06:16.440
when we get into the techniques such as
00:06:17.960
aggregation you'll see even more reasons
00:06:20.080
for favoring a Time series extension or
00:06:22.080
database over vanilla postgress
00:06:24.400
specifically you'll see the benefits of
00:06:25.919
aggregation for data
00:06:28.319
retention and sometimes your existing
00:06:30.560
tools don't cut it and you need to
00:06:32.120
invest in something entirely new this is
00:06:34.280
analogous to buying an entirely new
00:06:35.759
surfboard if you're looking for a
00:06:37.560
dedicated time series database it may be
00:06:39.360
worth looking into Solutions such as
00:06:40.919
click house click house pitches itself
00:06:43.280
as a fast open- Source analytical
00:06:45.080
database designed around time series
00:06:46.919
data managing time series data is all
00:06:49.560
about optimization and each tool is
00:06:51.479
optimized for a different ideal use case
00:06:53.840
it's essential to evaluate your
00:06:55.240
conditions before selecting a surfboard
00:06:57.199
in the same way it's important to
00:06:58.599
consider what type of you'll be working
00:07:00.360
with and what you're going to be doing
00:07:01.680
with it consensus among some users of
00:07:04.479
both tools seems to be that time scale
00:07:06.240
GB has a great time series story and an
00:07:08.240
average data warehousing story whereas
00:07:10.319
people say that click house has a great
00:07:11.639
data warehousing story and an average
00:07:13.120
time series Story look into Benchmark
00:07:15.840
analyses for yourself investigate the
00:07:17.599
features of each database and extension
00:07:19.680
and read up on documentation for the
00:07:21.000
tool you're considering the more you
00:07:22.960
understand about the inner workings of
00:07:24.400
your proposed tool the better you'll
00:07:26.199
understand how it will work with your
00:07:27.360
use case my best advice is to really get
00:07:29.720
in the weeds as a surfer if you don't
00:07:31.479
know the purpose of rocker or tail shape
00:07:33.599
When selecting a board it's going to be
00:07:35.319
really difficult to make an informed
00:07:36.599
decision the same goes for understanding
00:07:38.759
and selecting a solution for time series
00:07:42.039
management and sometimes no available
00:07:44.639
database seems suited to your highly
00:07:46.440
specific needs if your use case is
00:07:48.720
really particular and you're feeling
00:07:50.400
especially industrious you might just
00:07:52.319
want to break out the foam and
00:07:53.280
fiberglass and shape your own board
00:07:55.440
moroi found itself in this situation in
00:07:57.720
2008 our core product is a Cloud managed
00:08:00.159
platform that allows users to configure
00:08:01.960
and monitor their networking devices the
00:08:04.199
monitoring data is more often than not
00:08:06.400
time series data we track metrics such
00:08:08.479
as packet count received bya switch the
00:08:10.639
temperature of a device or the device's
00:08:12.440
memory usage Trends over time can give
00:08:14.759
insight into the health of a system in
00:08:17.400
2008 there were far fewer time series
00:08:19.159
Solutions available so a team at moroi
00:08:21.280
developed their own we call it little
00:08:23.360
table little table is a relational
00:08:25.599
database optimized for time series data
00:08:28.120
little table was in fact developed
00:08:30.280
specifically for spinning discs in 2008
00:08:32.680
storage on solid state drives was far
00:08:34.399
more expensive than storage on spinning
00:08:35.919
discs so the hardware was a significant
00:08:38.440
consideration because of this data is
00:08:40.440
clustered for continuous dis access in
00:08:42.320
order to improve performance later in
00:08:44.560
this presentation we'll see how this
00:08:45.920
impacts the way one might design a table
00:08:48.000
when using little table fun fact as of
00:08:50.800
2017 when the white paper was written
00:08:52.959
moroi stored 320 terabytes of data
00:08:55.480
across several hundred liel table
00:08:57.160
servers systemwide now I'm I'm sure the
00:08:59.720
quantity is even higher though not
00:09:01.920
actually a SQL database little table
00:09:03.560
includes a SQL interface for querying
00:09:05.480
which has improved developer adoption by
00:09:07.120
making this tool easy to use little
00:09:09.519
table exists only for internal use at
00:09:11.440
moroi but the team wrote an excellent
00:09:13.480
white paper that is very effectively
00:09:15.240
describes the challenges and design
00:09:16.880
considerations which can be super useful
00:09:18.880
for anyone trying to gain a better
00:09:20.399
understanding of the intricacies of Time
00:09:21.600
series data I've linked it in this slide
00:09:23.680
and I can't recommend it
00:09:25.880
enough all righty we now have our board
00:09:28.560
picked out we understand the conditions
00:09:30.480
we're surfing in and we've landed on a
00:09:32.040
board that works best in those
00:09:33.480
conditions however the work is far from
00:09:35.600
over you can have the perfect board and
00:09:37.480
still struggle to actually surf if you
00:09:39.160
don't have the proper techniques
00:09:41.079
technique is also incredibly important
00:09:42.640
when dealing with time series data
00:09:44.480
regardless of which databased tool
00:09:45.839
you've chosen to use in order to
00:09:47.640
optimize performance it's crucial that
00:09:49.160
we follow some tried andrue patterns for
00:09:50.880
organizing and querying
00:09:56.760
data the time series techniques I'll
00:09:58.880
cover in the talk are data arranged by
00:10:01.519
time composite Key quering by index and
00:10:04.880
aggregation and
00:10:07.040
compression the identifying
00:10:08.680
characteristic for a Time series
00:10:10.040
database is that it organizes data by
00:10:12.200
time for efficient access otherwise it
00:10:14.600
wouldn't be a Time series database both
00:10:17.120
click housee and time scale DB will
00:10:18.839
automatically generate an index on the
00:10:20.440
time stamp column this allows for the
00:10:22.440
most performant access when retrieving
00:10:24.000
data for a range of time little table
00:10:26.440
actually clusters data on the disk by
00:10:27.959
timestamp never inter leaving data with
00:10:29.800
older
00:10:30.680
timestamps because of the unique data
00:10:32.640
structure some databases enforce
00:10:34.200
restrictions arranging data by time
00:10:36.360
allows for highly efficient reads but
00:10:38.079
writing can be quite inefficient the
00:10:40.000
designers of little table decided to
00:10:41.560
constrain writes to be append only since
00:10:44.079
we're collecting data over time it
00:10:45.839
doesn't make much sense to spot fill
00:10:47.639
historic data anyway according to the
00:10:49.839
little table white paper there's no need
00:10:51.839
to update rows as each row represents a
00:10:54.160
measurement taken at a specific point in
00:10:57.200
time when visualizing a Time series
00:10:59.440
database it's important to understand
00:11:00.880
that there are effectively two
00:11:02.079
identifiers for a given piece of data
00:11:04.519
the first mentioned in the previous
00:11:06.120
slide is the time stamp the second piece
00:11:08.120
of information is the identifier in
00:11:10.399
almost every case this is comprised of
00:11:12.200
multiple Fields making it a composite
00:11:15.200
key each time series database refers to
00:11:17.920
this concept using slightly different
00:11:19.600
terminology little table documentation
00:11:21.720
refers to this as a hierarchically
00:11:23.000
delineated key whereas click house
00:11:24.760
documentation refers to this as a
00:11:26.240
compound primary key and time scale DB
00:11:28.880
refers to this as a partition key in
00:11:31.399
order to understand the implication this
00:11:33.040
composite key has on structuring data
00:11:34.920
I'm going to drill into little tables
00:11:36.399
hierarchically delineated key in little
00:11:39.000
table this key determines how the data
00:11:40.839
is actually arranged on disk in addition
00:11:42.720
to being grouped by time this
00:11:44.800
hierarchical organization enables
00:11:46.800
efficient queries since it will always
00:11:48.680
correspond to a contiguous region of
00:11:50.360
data on the disk it's crucial then to
00:11:52.600
only query based on ordered components
00:11:54.320
of this key in order to determine the
00:11:57.000
components of this key and how they're
00:11:58.560
ordered it's super important to
00:12:00.639
understand how this data is going to be
00:12:02.120
accessed your queries will only be
00:12:04.480
performant if you're accessing a
00:12:05.880
contigous block of data so you have to
00:12:08.480
understand what the most common queries
00:12:09.760
are going to be and design around
00:12:12.079
those it's probably best to visualize a
00:12:14.360
real world example that way we can
00:12:16.240
actually see the data arranged by these
00:12:17.959
two axes time and composite key in this
00:12:21.279
example we can also visualize what a
00:12:23.079
hierarchically delineating key in little
00:12:25.160
table really is here's an example lifted
00:12:27.839
directly from the little table white
00:12:28.959
paper
00:12:29.760
as you can see the data is organized
00:12:31.600
along two axes on the xaxis we have the
00:12:34.320
time stamps and on the y- axis we have
00:12:36.560
the elements of the composite key you'll
00:12:38.880
see that along the y- AIS the groups the
00:12:41.160
records for a single Network are grouped
00:12:43.079
together and within that Network all the
00:12:45.360
records for a single device are grouped
00:12:47.519
together this composite key can contain
00:12:49.880
as many fields as you want thus
00:12:51.600
arranging data many layers deep in this
00:12:54.120
example though we simply have two Fields
00:12:55.800
included in the composite key grouping
00:12:57.600
the data by two layers
00:13:00.199
as we saw in the previous example the
00:13:01.720
most important takeaway for a
00:13:03.000
hierarchically delineated key is that
00:13:04.720
its components are organized with an
00:13:06.160
increasing degree of specificity the
00:13:08.600
example from Cisco moroi included two
00:13:10.399
components Network and device since a
00:13:12.560
network has many devices this example is
00:13:14.839
purely
00:13:15.920
hierarchical however just because the
00:13:17.639
ordering in this key typically
00:13:19.279
corresponds to a real world hierarchy it
00:13:21.360
doesn't necessarily have to you can
00:13:23.199
select whatever ordering you want for
00:13:24.639
the components of this key and that
00:13:26.000
ordering depends only on how the data is
00:13:27.760
accessed in our case Liz's surfing
00:13:30.360
application is designed to be Surfer
00:13:32.160
specific while we want to store data for
00:13:34.480
multiple Surfers it doesn't make much
00:13:36.240
sense to query data across many Surfers
00:13:38.560
since each Surfer is interested only in
00:13:40.639
their individual progress this means
00:13:42.959
that we can prefix our primary key with
00:13:44.519
the surfer so that all the data for a
00:13:46.440
single Surfer is collocated in the table
00:13:49.519
then we can follow the hierarchical
00:13:50.920
pattern with region and then break the
00:13:53.279
region might be say Los Angeles and the
00:13:55.720
break might be Malibu Malibu first point
00:13:58.440
the region and break are very similar in
00:14:00.199
concept to the example from Cisco moroi
00:14:02.160
since a region contains many
00:14:04.839
braks now that we have a key that's
00:14:06.839
optimized for querying we need to
00:14:08.519
actually write our queries in the most
00:14:10.199
optimal way this data is highly
00:14:12.279
structured and the way it's structured
00:14:13.759
depends entirely on how we plan to query
00:14:15.519
it hopefully you remember this graphic
00:14:17.759
from a couple slides ago this time I've
00:14:20.160
modified it to reflect the composite key
00:14:21.959
we've decided on for Liz's surfing
00:14:23.959
application as a refresher data is
00:14:26.160
arranged by time across the x-axis and
00:14:28.199
composite key across y- axis now for
00:14:31.079
Lisa's surfing surfing application the
00:14:32.839
composite key includes Surfer region and
00:14:35.440
break in that order in this diagram
00:14:38.480
you'll see you'll only see Surfer and
00:14:40.240
region and that's because of the
00:14:42.079
composite nature of the key for the two
00:14:44.240
first two hours of last Monday you're
00:14:45.880
welcome to query for just the surfer Liz
00:14:48.040
in the region LA and you will still find
00:14:50.240
a continuous continuous stretch of data
00:14:53.199
now imagine that you want to drill
00:14:54.519
further and take a closer look at this
00:14:56.519
green slice here you'll see that the
00:14:59.360
stretch of data is further broken down
00:15:01.639
into break Malibu exists in its own
00:15:04.279
subdivided contiguous stretch of data
00:15:07.279
hopefully this helps you visualize what
00:15:08.759
your query will actually turn into
00:15:10.920
you'll see that we certainly want to
00:15:12.199
include a time stamp in your wear Clause
00:15:14.160
after all this is time series data
00:15:16.320
additionally you'll want to include part
00:15:18.079
or all of the elements of that composite
00:15:20.000
key since the data is structured
00:15:22.160
hierarchically you only ever need to
00:15:24.040
include a prefix of the composite key
00:15:25.720
when querying we saw in the first
00:15:27.920
example how effective a query can be for
00:15:30.279
Liz in La for the first two hours of
00:15:32.880
last Monday since we've left off the
00:15:35.079
break and only queried for the surfer
00:15:37.079
and region we've only queried with a
00:15:39.240
prefix of the composite key drilling
00:15:41.399
down even further we can also query for
00:15:43.079
Liz in LA in Malibu over the first two
00:15:45.680
hours of last Monday or zooming out a
00:15:48.279
query for just Liz's surfing data in
00:15:50.000
that time span would also be
00:15:52.600
performant click house is a little bit
00:15:54.560
different the data in Click house is not
00:15:56.319
arranged across two Dimensions instead
00:15:59.079
the timestamp is basically the final
00:16:00.800
component of the composite key because
00:16:03.160
of this it doesn't make sense to include
00:16:05.160
just the surfer and timestamp in the
00:16:06.800
query because you're skipping the middle
00:16:08.600
section of the primary key consider the
00:16:11.399
example I've shown we have a continuous
00:16:13.279
stretch of data for Liz which is broken
00:16:15.079
into region which is then broken into
00:16:16.839
break which contains all of the
00:16:18.360
timestamped records for say the last
00:16:20.720
month it doesn't make much sense to
00:16:22.720
query all the data for Liz over the past
00:16:25.120
two weeks because the data here is not
00:16:27.040
contiguous for each location you'll have
00:16:29.160
to grab just a section skipping over the
00:16:31.040
data points that don't fall within the
00:16:32.720
requested time span the only performant
00:16:35.399
query for click house would be to
00:16:36.759
include all the components of the
00:16:38.199
primary key you must specify the surfer
00:16:40.480
region break and a range of time so it
00:16:43.079
would be performant to query lizes data
00:16:44.720
for Malibu in La over the past two weeks
00:16:47.959
it's important to drill down and
00:16:49.600
understand how the data is arranged in
00:16:51.160
your time series database of Choice by
00:16:53.759
understanding the structure you can
00:16:54.839
visualize what a contiguous chunk of
00:16:56.399
data looks like and you can ensure that
00:16:57.959
your query is making use of the way the
00:17:00.120
data is
00:17:02.000
structured cool at this point we know
00:17:04.280
how to store our data and how to query
00:17:06.199
it we can now start to look at the
00:17:07.839
maintenance side of things here's the
00:17:10.000
thing Liz surfs a lot she plans to surf
00:17:13.480
for years to come and although we love
00:17:15.319
to keep all of lizz's Surfing data in
00:17:16.959
perpetuity we simply don't have
00:17:18.760
unlimited storage space when dealing
00:17:20.839
with time series data you have to
00:17:22.199
balance two major concerns you don't
00:17:24.439
have unlimited storage space to keep raw
00:17:26.120
data forever but you also want to
00:17:27.880
provide the user with his much data as
00:17:30.679
possible in order to solve for the first
00:17:32.880
concern the fact that we don't have
00:17:34.360
unlimited storage we need to take a look
00:17:36.160
at data retention every time series
00:17:38.559
database that I've seen includes some
00:17:40.320
sort of policy for data retention often
00:17:42.919
this comes in the form of a TTL
00:17:44.480
otherwise known as a time to live the
00:17:47.080
time to live dictates how old the data
00:17:49.200
in the table is allowed to be after a
00:17:51.240
certain point data of a certain age is
00:17:52.960
simply dropped from the
00:17:54.600
table now we also need to address the
00:17:56.840
desire to show as much data as possible
00:17:59.400
in order to do so we need to extend the
00:18:01.320
TTL without sacrificing storage there
00:18:04.200
are a few ways of going about this
00:18:05.880
notably compression and
00:18:08.559
aggregation compression is the method of
00:18:10.559
choice for the pro postgress extension
00:18:12.559
time scale DB when you add data to your
00:18:15.080
database it's in the form of
00:18:16.440
uncompressed rows time scale uses a
00:18:18.960
built-in job scheduler to convert this
00:18:20.880
data into the form of compressed columns
00:18:24.039
consider the uncompressed surf data on
00:18:25.600
the left each data point contains a
00:18:27.880
single wave surf by either of our two
00:18:29.760
Surfers Liz and Brandon the top right
00:18:32.400
table shows this data compressed into a
00:18:34.120
single data point this preserves all of
00:18:36.440
the original data while restructuring it
00:18:38.280
into a format that requires less memory
00:18:40.039
to store according to time scale DB's
00:18:42.640
documentation this can compress the size
00:18:44.360
of data by as much as
00:18:45.960
90% additionally you'll see that the
00:18:48.159
data within this compression is ordered
00:18:49.679
by timestamp ordering by time stamp is
00:18:52.200
highly recommended in the documentation
00:18:54.559
but it's worth noting you do have to
00:18:55.880
specifically configure your compression
00:18:57.320
to order by timestamp
00:18:59.679
you may also be wondering about how this
00:19:01.559
data gets queried upon querying
00:19:03.600
compressed data time scale DB
00:19:05.360
selectively uncompresses each column in
00:19:08.080
order to provide the data requested and
00:19:10.480
when compressing data time scale stores
00:19:12.559
it in column order rather than row order
00:19:15.320
we'll see more about column based and
00:19:16.880
row based storage in the next
00:19:19.360
slide however we can certainly be more
00:19:21.960
intelligent about our compression if
00:19:23.760
there's one thing I want to harp on it's
00:19:25.320
that your database design should be
00:19:26.720
driven by how you plan to query in our
00:19:29.400
case we only ever plan a query data for
00:19:31.600
a single Surfer as a result we might
00:19:33.840
want to include segmentation in our
00:19:35.679
compression algorithm as well in the
00:19:37.760
bottom right table you can see the same
00:19:39.720
data segmented by Surfer ID this can
00:19:42.240
certainly improve query
00:19:49.200
performance click house also uses
00:19:51.280
compression to improve database
00:19:52.760
performance it may seem obvious but less
00:19:55.159
data on the disk means less IO and
00:19:57.159
faster queries and inserts
00:19:59.360
when speaking about compression in a
00:20:00.760
Time series context it's important to
00:20:02.679
take a couple of steps backward and talk
00:20:04.440
about one of the major differences
00:20:06.080
between most time series databases and a
00:20:08.520
more generalized database like postgress
00:20:11.200
this difference lies in the structure of
00:20:12.919
the database which should hopefully come
00:20:14.520
as no surprise since we've already
00:20:15.840
spoken at length about the importance of
00:20:17.559
database structure when it comes to
00:20:18.840
handling time series
00:20:20.280
data postgress is what we call a row
00:20:22.600
based database a row based database
00:20:24.640
organizes data by record keeping all of
00:20:27.120
the data associated with a record next
00:20:28.919
to each other in memory row based
00:20:31.120
databases are well suited for
00:20:32.640
transactional workloads where entire
00:20:34.240
records need to be retrieved updated or
00:20:36.360
inserted quickly and
00:20:37.840
efficiently with a row based database
00:20:39.919
writing can be quite effective but
00:20:41.240
reading from large quantities of data
00:20:42.960
has its
00:20:43.840
shortcomings especially when quering by
00:20:45.919
a field like timestamp and because data
00:20:48.760
is grouped by record compressing data by
00:20:50.760
attribute is also quite
00:20:52.799
inefficient click house like many time
00:20:54.960
series databases is actually a column
00:20:56.679
based database in a column based
00:20:58.640
database each data block stores values
00:21:00.679
of a single column for multiple rows
00:21:03.320
this is ideal because compression
00:21:04.640
algorithms exploit continuous patterns
00:21:06.440
of data if this data is sorted by
00:21:08.520
columns in a particular order this can
00:21:10.159
lead to incredibly efficient
00:21:12.159
compression column based databases are
00:21:14.360
often the preferred choice for analytic
00:21:16.200
and data warehousing applications the
00:21:18.559
benefits of column based databases
00:21:20.760
include faster data aggregation high
00:21:23.679
compression speeds and less use of disk
00:21:25.760
space the drawback is that data
00:21:27.919
modification is slower but we as we've
00:21:30.480
discussed previously modifying time
00:21:32.240
series data is often not an intended use
00:21:35.760
case in addition to compression there's
00:21:38.159
a second approach to ensuring that we
00:21:39.640
can preserve data for an extended period
00:21:41.440
of time without increasing our storage
00:21:43.240
costs this approach is called
00:21:45.159
aggregation and it's the methodology of
00:21:47.279
choice for little table when we're
00:21:49.440
speaking about aggregation there are
00:21:50.919
really two concepts the base table and
00:21:53.159
the aggregate table the base table is
00:21:55.320
where we insert the raw metrics we're
00:21:56.840
recording the aggregate table then s
00:21:59.320
store a summary or average of that raw
00:22:01.720
data so first we need to decide what raw
00:22:04.360
metrics we want to store in our base
00:22:05.960
table if you recall we already decided
00:22:08.320
on a primary key that contains a surfer
00:22:10.120
region and break in addition we can
00:22:12.320
record metrics for each Liz W wave Liz
00:22:15.240
has caught at the most basic level what
00:22:17.919
Liz wants to know is how good was that
00:22:19.880
wave the basic stats we can record are
00:22:22.400
distance and duration so we'll include
00:22:24.520
those in our base
00:22:26.279
table then we need to know what what
00:22:28.720
aggregated metrics we want to record in
00:22:30.520
the aggregate tables the aggregate table
00:22:33.000
will have to contain a summary of the
00:22:34.360
data in the base table it might be
00:22:36.559
helpful to know the total distance
00:22:38.080
summed across all waves the total
00:22:40.080
duration summed across all waves the
00:22:42.200
maximum total speed for a single wave
00:22:44.320
and the maximum number of waves caught
00:22:46.080
in that time
00:22:47.919
period now that we've decided what data
00:22:50.000
to store in these aggregate tables we'll
00:22:51.799
have to decide what intervals of data
00:22:53.480
make sense these will determine which
00:22:55.400
aggregate tables we want to create this
00:22:57.559
will also help us decide on our ttls
00:22:59.600
constraining how much data we're
00:23:00.880
actually
00:23:01.760
using since list tends to surf at most
00:23:04.559
once a day it makes sense to aggregate
00:23:06.400
data up to the day that way we can PR
00:23:08.679
preserve data for each surfing session
00:23:10.400
for a TTL of say six months from there
00:23:13.679
we can also aggregate up to the week and
00:23:15.400
the month so that it's easier for Liz to
00:23:17.159
track seasonal and annual Trends this
00:23:19.679
leaves us with a base table with a TTL
00:23:21.960
of say 1 month a one-day aggregate table
00:23:24.720
with a TTL of 6 months a onewe aggregate
00:23:27.320
table with a TTL of one year year and a
00:23:29.000
one- month aggregate table with a TTL of
00:23:31.000
5
00:23:32.320
years woo the hardest part's over now
00:23:35.200
that we have our data stored aggregated
00:23:37.559
and easily accessible we want to design
00:23:39.760
an API endpoint that Liz can use to
00:23:41.760
easily query her surf data the decisions
00:23:44.360
we've made when it comes to querying and
00:23:45.840
aggregation will determine exactly how
00:23:47.480
this API endpoint will be used the next
00:23:50.000
step is defining an API contract which
00:23:52.240
can be clearly documented for the end
00:23:53.880
user validated and
00:23:56.039
enforced a crucial element to document
00:23:58.600
for the end user is the set of allowable
00:24:00.480
query prams the components and ordering
00:24:03.080
of the composite key determine which
00:24:04.799
query prams are required and which are
00:24:06.760
optional as always a time span is
00:24:09.240
necessary for a user quering time series
00:24:11.400
data and assuming that we're using
00:24:13.480
little table as our underlying storage
00:24:15.279
option we only need a prefix of the
00:24:16.840
primary key so the surfer is the only
00:24:18.760
required component of the key beyond
00:24:21.039
that you can optionally specify a region
00:24:22.640
and a break it's important though to
00:24:24.960
document and enforce that a user must
00:24:26.600
also provide a region if they want to
00:24:28.039
provide a break recall earlier that we
00:24:30.000
noted you can't skip fields in the
00:24:31.440
middle of the primary key so you must
00:24:33.399
provide a full prefix of the primary key
00:24:35.399
which in this case is Surfer region and
00:24:37.240
break in that
00:24:39.039
order now that we have the users's
00:24:41.000
request we need to determine which
00:24:42.559
aggregate table we'll be querying from
00:24:44.880
this requires an understanding of some
00:24:46.279
terminology at moroi we discuss time
00:24:48.399
series data in terms of a time span and
00:24:50.679
an interval so I'll quickly explain what
00:24:53.279
we mean by each of those terms in this
00:24:56.000
context the time span describes the full
00:24:58.679
period of time over which we want the
00:25:01.039
data since our longest TTL and database
00:25:03.320
is 5 years we can't query data for a
00:25:05.480
time span that extends further than 5
00:25:07.320
years in the past the interval
00:25:09.640
corresponds to the grain at which the
00:25:11.440
data is aggregated the only options here
00:25:14.120
are one day one week and one month as
00:25:17.600
noted before each aggregation interval
00:25:19.440
will be stored in its own aggregate
00:25:21.120
table we'll have a one day table a onee
00:25:23.520
table and a one month
00:25:24.919
table in designing this API endpoint
00:25:27.440
we'll assume that the use user wants the
00:25:29.120
most data possible for the time span
00:25:31.799
requested this means that the TTL will
00:25:34.120
determine which aggregate table will
00:25:35.760
query from we'll want to query data from
00:25:38.120
the aggregate table with the smallest
00:25:39.720
interval whose TTL is still greater than
00:25:42.159
the time span
00:25:43.360
requested so for example if the user
00:25:45.919
requests a time span less than or equal
00:25:47.840
to 6 months we can return Daily Surf
00:25:49.760
data for a time span between 6 months
00:25:51.919
and one year will return weekly surf
00:25:53.720
data and for any time span between 1
00:25:56.159
year and 5 years will return monthly
00:25:57.720
surf data
00:25:59.159
and we'll validate that the user is not
00:26:00.840
allowed to query the API end point with
00:26:02.360
a time span greater than 5 years since
00:26:04.279
all data is dropped after that
00:26:06.640
point now I'd like to quickly show what
00:26:08.880
a visualization might look like for time
00:26:10.679
series data now that we have in mind the
00:26:12.600
concepts of a time span and interval on
00:26:15.320
this slide is another screen grab from
00:26:16.840
Cisco moro's dashboard application here
00:26:19.840
you can see that at the top of the page
00:26:21.399
there's a drop- down option for showing
00:26:23.320
data over the past 2 hours day week or
00:26:26.200
month those are the selectable time
00:26:28.480
spans we currently have the one month
00:26:30.799
time span selected and in the usage
00:26:33.159
graph below you can see that the data is
00:26:34.760
broken out day by day for a one- month
00:26:37.320
time span we're showing data over a one-
00:26:39.320
day interval this pattern is especially
00:26:41.760
useful when you want to detect Trends
00:26:43.559
since Trends can be found when looking
00:26:45.120
at data down to the hour or over the
00:26:47.320
span of an entire
00:26:49.559
month earlier in the talk I explained
00:26:51.760
the shortcomings of postgress when it
00:26:53.200
comes to time series data one of these
00:26:55.240
shortcomings is the lack of specialized
00:26:56.880
time series Tooling in vanill postgress
00:26:59.559
because tools like click house and time
00:27:01.000
scale DB are so specialized for time
00:27:02.919
series data you might even be able to
00:27:04.840
skip some of the steps I've listed in
00:27:06.320
this getting out there section by
00:27:07.720
leveraging some of the tools and
00:27:08.880
Integrations offered click house for
00:27:11.120
instance officially integrates with
00:27:12.520
quite a few visualization tools like
00:27:14.000
grafana and Tableau this makes quick
00:27:16.279
data visualization really easy to set up
00:27:19.320
and just this year time scale DB
00:27:21.080
announced a project called time scale
00:27:22.679
analytics this initiative is not
00:27:24.559
complete and they're still receiving
00:27:26.000
developer input if you're interested in
00:27:27.480
commenting what they're hoping to do is
00:27:29.440
create a One-Stop shop for time scale
00:27:31.559
time series analytics in postgress in
00:27:34.320
the time scale analytics announcement
00:27:35.679
time scale DB listed a few sketching
00:27:37.240
algorithms that they hope to build into
00:27:38.840
this extension which would provide data
00:27:40.960
structures to estimate metrics such as
00:27:43.080
percentile points and cardinality due to
00:27:45.679
time scale DBS aggregation these
00:27:47.159
sketches should have very low query
00:27:49.039
latency there are so many features I
00:27:50.960
haven't listed and these products are
00:27:52.480
receiving a lot of support so I'm sure
00:27:54.039
the list will grow it'll be really cool
00:27:55.799
to witness the evolution of these time
00:27:57.320
series tools
00:28:00.279
sweet Liz now has easily accessible data
00:28:03.000
on her surfing performance broken down
00:28:04.840
by break over time fast forward several
00:28:07.320
years from now Liz has been surfing for
00:28:09.080
quite some time and she's learned some
00:28:10.519
important lessons for example she took a
00:28:13.039
look at the monthly data for her
00:28:14.399
favorite break and she realized that she
00:28:16.159
catches far fewer waves there in the
00:28:18.159
winter than she does in the summer and
00:28:20.519
the waves she does catch are way smaller
00:28:22.440
and pet her out quickly it turns out
00:28:24.559
that this surf spot only catches South
00:28:26.159
swells and South swells are way more
00:28:27.720
common in in the summertime Liz had no
00:28:29.880
idea that swell Direction was seasonal
00:28:31.480
it had never occurred to her to even
00:28:32.880
check now she knows where to surf in
00:28:34.960
each season and she's been able to
00:28:36.399
update her surf routine accordingly
00:28:38.159
she's been catching way more waves and
00:28:39.600
she's been having a lot more fun looks
00:28:41.919
like Liz is on her way to getting
00:28:44.960
pitted thanks so much for listening
00:28:47.279
again I work for a remarkable company
00:28:48.919
called Cisco moroi with some of the
00:28:50.360
brightest rails developers I've ever met
00:28:52.600
I still can't believe I get to work at
00:28:54.039
the intersection of web development and
00:28:55.760
computer networking it's a fascinating
00:28:57.760
space to be in with really compelling
00:28:59.320
problems to solve if that sounds
00:29:01.120
interesting to you definitely swing by
00:29:02.519
our booth or chat with me later and of
00:29:04.720
course I'm always down to talk time
00:29:05.919
series data have a great rest of your
00:29:07.799
conference