Supporter Talk by Cisco: Catching Waves with Time-Series Data

00:00:15.440 um hi uh my name is Liz I'm a software

00:00:17.800 engineering manager on the switching

00:00:19.199 team at Cisco moroi prior to Moro I

00:00:21.480 worked in startups ranging from series a

00:00:23.119 to IPO um I lived in Ohio for 24 years

00:00:25.960 so I'm no stranger to the 5-Hour road

00:00:27.720 trip to see a show in Chicago um I truly

00:00:30.439 think Chicago is one of the most perfect

00:00:32.000 cities in the world I have a friend who

00:00:33.640 calls the Great Lakes the North Coast of

00:00:35.680 the US um and I love that description I

00:00:38.160 think that's so fun uh Believe It or Not

00:00:40.399 people not me do surf in Lake Michigan

00:00:43.559 during storms often in the winter it's

00:00:45.640 very gnarly hence why I've never done it

00:00:48.879 so hopefully the surf theme of this

00:00:50.760 presentation is fitting enough for the

00:00:52.879 nautical town of Chicago um but this

00:00:56.440 talk is primarily About Time series data

00:00:58.760 time series is a topic Cisco moraki

00:01:00.879 deals with quite often and it can get

00:01:02.680 really intricate and hairy I found that

00:01:04.559 the gas can be quite difficult to

00:01:06.040 uncover at first so hopefully this

00:01:07.520 presentation helps anyone who's

00:01:08.960 currently finding themselves daunted by

00:01:10.560 the

00:01:11.920 concept by the end of this talk you'll

00:01:13.840 hopefully have an understanding of how

00:01:15.119 time series data might differ from the

00:01:16.960 typical sort of relational data you

00:01:18.520 might be used to dealing with I'll walk

00:01:20.680 through how to select a tool for

00:01:22.000 managing time series data how to

00:01:24.000 organize time series data how to query

00:01:26.240 time series data how to Aggregate and

00:01:28.280 compress time series data and finally

00:01:30.600 how to translate Your Design to API

00:01:33.439 constraints before jumping in you might

00:01:35.320 be asking yourself what even is time

00:01:37.399 series data time series data is

00:01:39.600 essentially just a collection of

00:01:40.920 observations recorded over consistent

00:01:42.720 intervals of Time Time series data is

00:01:44.960 distinct from other types of data

00:01:46.479 because of this ordering by time this

00:01:48.880 graph lifted from our dashboard at Cisco

00:01:50.680 moraki shows a usage rate and bits per

00:01:52.719 second for a device over time we record

00:01:55.320 this data every 5 minutes by pulling the

00:01:57.159 device we store it in a Time series

00:01:59.079 database and display it to the user in a

00:02:01.000 graph on our

00:02:02.399 dashboard time series data has become

00:02:04.520 pretty ubiquitous as we collect more and

00:02:06.320 more data to model our surroundings and

00:02:08.000 predict Trends devices like smart

00:02:10.039 watches have introduced amateur athletes

00:02:11.599 to way more data than they've ever had

00:02:13.319 access to our friend Liz inspired by

00:02:15.959 apps like straa wants to track her

00:02:17.560 surfing as well Liz has just taken her

00:02:20.080 first surf lesson and she's keen on Lear

00:02:21.840 learning how her surfing will improve

00:02:23.280 over time she's decided to record this

00:02:25.519 data in a Time series database and to

00:02:27.319 access it via an API endpoint but where

00:02:29.959 does she

00:02:30.840 start in surfing it's important to

00:02:33.200 select a surfboard that's suited for a

00:02:34.920 particular swell if the waves are Steep

00:02:36.959 and Powerful it might be worth breaking

00:02:38.440 out the short board for better

00:02:39.959 maneuverability for smaller days a

00:02:41.840 longboard can be a lot of fun I've

00:02:43.720 recently been told that it's absurd that

00:02:45.519 my partner and I have nine surfboards

00:02:47.400 between the two of us uh but it's

00:02:49.440 important to have a lot of options

00:02:51.080 conditions really do vary the same is

00:02:53.480 true of data and databases it's

00:02:55.640 important to select a tool that's

00:02:56.879 appropriate for the type of data you

00:02:58.200 plan to deal with Time series data often

00:03:00.480 comes in large quantities and efficient

00:03:02.239 access is highly important so a database

00:03:04.080 that can accommodate these concerns is

00:03:05.680 crucial When selecting a tool for

00:03:07.560 managing time series data you have four

00:03:09.400 options that nicely mirror the options a

00:03:11.480 surfer faces when deciding what board to

00:03:13.480 Surf as a surfer you can surf a board

00:03:15.959 you already have this is applicable for

00:03:18.080 folks who already have a dedicated time

00:03:19.879 series database in their Tex stack as a

00:03:22.319 surfer you can use an old board but add

00:03:24.159 a new set of fins this is analogous to

00:03:26.560 using a database extension for say

00:03:28.680 postgress as add on to a tool you

00:03:30.560 already have as a surfer you can buy a

00:03:33.080 new board this is similar to adopting a

00:03:35.040 new database technology dedicated to

00:03:36.879 time series data or you can break out

00:03:39.439 the foam and fiberglass and shape your

00:03:41.000 own board this is just like designing

00:03:42.879 and implementing your own time series

00:03:45.840 database it's quite possible that you

00:03:47.720 already have as part of your Tech stack

00:03:49.319 a database that works well for your time

00:03:50.879 series use case in that case you just

00:03:52.879 have to ensure that you're using your

00:03:54.040 database tool correctly with the proper

00:03:55.720 techniques later in the talk I'll cover

00:03:57.760 techniques for the proper storage and

00:03:59.280 querying of Time series data because

00:04:01.680 after all if you don't know how to surf

00:04:03.720 even the nicest board in the world is of

00:04:05.360 no

00:04:06.560 use or maybe your team already uses a

00:04:09.439 more generalized database like postgress

00:04:11.799 and the best solution in this case might

00:04:13.200 be to add a postgress extension this is

00:04:15.560 similar to buying a new set of fins for

00:04:17.280 the surf board you already own fins are

00:04:19.359 easily swapped out without changing the

00:04:20.919 surfing experience too much in this case

00:04:23.320 the old board is postgress and the new

00:04:25.360 set of fins is an extension you can use

00:04:27.240 with postgress there are several options

00:04:30.039 for time series extensions you can use

00:04:31.479 with postgress two of the most notable

00:04:33.320 options are PG time series and time

00:04:35.039 scale DB these extensions provide a user

00:04:37.800 experience around creating managing and

00:04:39.600 querying time series data which doesn't

00:04:41.639 come out of the box with vanilla

00:04:43.360 postgress there are many benefits to

00:04:45.320 extensions using using an extension is

00:04:47.680 lower lift and cheaper plus you're

00:04:49.720 already used to surfing the board using

00:04:51.840 a postgress extension will reduce the

00:04:53.479 learning curve and augment the speed of

00:04:55.199 development while avoiding the

00:04:56.520 performance hit you'd otherwise take

00:04:58.000 with vanilla postgress a further benefit

00:05:00.680 to a postcast extension is that it will

00:05:02.280 be far easier to join relational data

00:05:03.880 with time series data as

00:05:05.960 needed now that we've talked so much

00:05:07.919 about postgress extensions you might be

00:05:09.720 asking yourself what's wrong with filla

00:05:11.560 postgress why not just store my time

00:05:13.639 series data in postgress without

00:05:15.160 bothering with an extension the first

00:05:17.600 reason is that these postrest extensions

00:05:19.639 come with built-in methods designed

00:05:21.520 specifically for use with time series

00:05:23.240 data without writing your own method to

00:05:25.199 do so you can for example query a range

00:05:27.600 of data over time more importantly there

00:05:30.240 are significant performance differences

00:05:32.039 between vanilla postest and an extension

00:05:33.960 like time scale DB the time scale DB

00:05:36.440 documentation references an experiment

00:05:38.240 in which postgress and time scale DB

00:05:40.080 were tasked with ingesting a 1 billion

00:05:42.240 row database time scale DB loads this

00:05:45.199 massive database in 115th the total time

00:05:47.720 of postgress and sees throughput of more

00:05:49.639 than 20 times that of postgress because

00:05:52.080 of its heavy utilization of time space

00:05:53.840 partitioning time scale DB achieves a

00:05:56.080 higher ingest rate than postgress when

00:05:57.919 dealing with large quantities of data

00:05:59.639 quantities that are quite common when

00:06:01.360 dealing with time series data querying

00:06:04.240 can be even faster given a query that

00:06:06.199 specifies time ordering with 100 million

00:06:08.240 row table time scale DB achieves a query

00:06:10.800 latency that is 396 times faster than

00:06:14.319 postgress later in this presentation

00:06:16.440 when we get into the techniques such as

00:06:17.960 aggregation you'll see even more reasons

00:06:20.080 for favoring a Time series extension or

00:06:22.080 database over vanilla postgress

00:06:24.400 specifically you'll see the benefits of

00:06:25.919 aggregation for data

00:06:28.319 retention and sometimes your existing

00:06:30.560 tools don't cut it and you need to

00:06:32.120 invest in something entirely new this is

00:06:34.280 analogous to buying an entirely new

00:06:35.759 surfboard if you're looking for a

00:06:37.560 dedicated time series database it may be

00:06:39.360 worth looking into Solutions such as

00:06:40.919 click house click house pitches itself

00:06:43.280 as a fast open- Source analytical

00:06:45.080 database designed around time series

00:06:46.919 data managing time series data is all

00:06:49.560 about optimization and each tool is

00:06:51.479 optimized for a different ideal use case

00:06:53.840 it's essential to evaluate your

00:06:55.240 conditions before selecting a surfboard

00:06:57.199 in the same way it's important to

00:06:58.599 consider what type of you'll be working

00:07:00.360 with and what you're going to be doing

00:07:01.680 with it consensus among some users of

00:07:04.479 both tools seems to be that time scale

00:07:06.240 GB has a great time series story and an

00:07:08.240 average data warehousing story whereas

00:07:10.319 people say that click house has a great

00:07:11.639 data warehousing story and an average

00:07:13.120 time series Story look into Benchmark

00:07:15.840 analyses for yourself investigate the

00:07:17.599 features of each database and extension

00:07:19.680 and read up on documentation for the

00:07:21.000 tool you're considering the more you

00:07:22.960 understand about the inner workings of

00:07:24.400 your proposed tool the better you'll

00:07:26.199 understand how it will work with your

00:07:27.360 use case my best advice is to really get

00:07:29.720 in the weeds as a surfer if you don't

00:07:31.479 know the purpose of rocker or tail shape

00:07:33.599 When selecting a board it's going to be

00:07:35.319 really difficult to make an informed

00:07:36.599 decision the same goes for understanding

00:07:38.759 and selecting a solution for time series

00:07:42.039 management and sometimes no available

00:07:44.639 database seems suited to your highly

00:07:46.440 specific needs if your use case is

00:07:48.720 really particular and you're feeling

00:07:50.400 especially industrious you might just

00:07:52.319 want to break out the foam and

00:07:53.280 fiberglass and shape your own board

00:07:55.440 moroi found itself in this situation in

00:07:57.720 2008 our core product is a Cloud managed

00:08:00.159 platform that allows users to configure

00:08:01.960 and monitor their networking devices the

00:08:04.199 monitoring data is more often than not

00:08:06.400 time series data we track metrics such

00:08:08.479 as packet count received bya switch the

00:08:10.639 temperature of a device or the device's

00:08:12.440 memory usage Trends over time can give

00:08:14.759 insight into the health of a system in

00:08:17.400 2008 there were far fewer time series

00:08:19.159 Solutions available so a team at moroi

00:08:21.280 developed their own we call it little

00:08:23.360 table little table is a relational

00:08:25.599 database optimized for time series data

00:08:28.120 little table was in fact developed

00:08:30.280 specifically for spinning discs in 2008

00:08:32.680 storage on solid state drives was far

00:08:34.399 more expensive than storage on spinning

00:08:35.919 discs so the hardware was a significant

00:08:38.440 consideration because of this data is

00:08:40.440 clustered for continuous dis access in

00:08:42.320 order to improve performance later in

00:08:44.560 this presentation we'll see how this

00:08:45.920 impacts the way one might design a table

00:08:48.000 when using little table fun fact as of

00:08:50.800 2017 when the white paper was written

00:08:52.959 moroi stored 320 terabytes of data

00:08:55.480 across several hundred liel table

00:08:57.160 servers systemwide now I'm I'm sure the

00:08:59.720 quantity is even higher though not

00:09:01.920 actually a SQL database little table

00:09:03.560 includes a SQL interface for querying

00:09:05.480 which has improved developer adoption by

00:09:07.120 making this tool easy to use little

00:09:09.519 table exists only for internal use at

00:09:11.440 moroi but the team wrote an excellent

00:09:13.480 white paper that is very effectively

00:09:15.240 describes the challenges and design

00:09:16.880 considerations which can be super useful

00:09:18.880 for anyone trying to gain a better

00:09:20.399 understanding of the intricacies of Time

00:09:21.600 series data I've linked it in this slide

00:09:23.680 and I can't recommend it

00:09:25.880 enough all righty we now have our board

00:09:28.560 picked out we understand the conditions

00:09:30.480 we're surfing in and we've landed on a

00:09:32.040 board that works best in those

00:09:33.480 conditions however the work is far from

00:09:35.600 over you can have the perfect board and

00:09:37.480 still struggle to actually surf if you

00:09:39.160 don't have the proper techniques

00:09:41.079 technique is also incredibly important

00:09:42.640 when dealing with time series data

00:09:44.480 regardless of which databased tool

00:09:45.839 you've chosen to use in order to

00:09:47.640 optimize performance it's crucial that

00:09:49.160 we follow some tried andrue patterns for

00:09:50.880 organizing and querying

00:09:56.760 data the time series techniques I'll

00:09:58.880 cover in the talk are data arranged by

00:10:01.519 time composite Key quering by index and

00:10:04.880 aggregation and

00:10:07.040 compression the identifying

00:10:08.680 characteristic for a Time series

00:10:10.040 database is that it organizes data by

00:10:12.200 time for efficient access otherwise it

00:10:14.600 wouldn't be a Time series database both

00:10:17.120 click housee and time scale DB will

00:10:18.839 automatically generate an index on the

00:10:20.440 time stamp column this allows for the

00:10:22.440 most performant access when retrieving

00:10:24.000 data for a range of time little table

00:10:26.440 actually clusters data on the disk by

00:10:27.959 timestamp never inter leaving data with

00:10:29.800 older

00:10:30.680 timestamps because of the unique data

00:10:32.640 structure some databases enforce

00:10:34.200 restrictions arranging data by time

00:10:36.360 allows for highly efficient reads but

00:10:38.079 writing can be quite inefficient the

00:10:40.000 designers of little table decided to

00:10:41.560 constrain writes to be append only since

00:10:44.079 we're collecting data over time it

00:10:45.839 doesn't make much sense to spot fill

00:10:47.639 historic data anyway according to the

00:10:49.839 little table white paper there's no need

00:10:51.839 to update rows as each row represents a

00:10:54.160 measurement taken at a specific point in

00:10:57.200 time when visualizing a Time series

00:10:59.440 database it's important to understand

00:11:00.880 that there are effectively two

00:11:02.079 identifiers for a given piece of data

00:11:04.519 the first mentioned in the previous

00:11:06.120 slide is the time stamp the second piece

00:11:08.120 of information is the identifier in

00:11:10.399 almost every case this is comprised of

00:11:12.200 multiple Fields making it a composite

00:11:15.200 key each time series database refers to

00:11:17.920 this concept using slightly different

00:11:19.600 terminology little table documentation

00:11:21.720 refers to this as a hierarchically

00:11:23.000 delineated key whereas click house

00:11:24.760 documentation refers to this as a

00:11:26.240 compound primary key and time scale DB

00:11:28.880 refers to this as a partition key in

00:11:31.399 order to understand the implication this

00:11:33.040 composite key has on structuring data

00:11:34.920 I'm going to drill into little tables

00:11:36.399 hierarchically delineated key in little

00:11:39.000 table this key determines how the data

00:11:40.839 is actually arranged on disk in addition

00:11:42.720 to being grouped by time this

00:11:44.800 hierarchical organization enables

00:11:46.800 efficient queries since it will always

00:11:48.680 correspond to a contiguous region of

00:11:50.360 data on the disk it's crucial then to

00:11:52.600 only query based on ordered components

00:11:54.320 of this key in order to determine the

00:11:57.000 components of this key and how they're

00:11:58.560 ordered it's super important to

00:12:00.639 understand how this data is going to be

00:12:02.120 accessed your queries will only be

00:12:04.480 performant if you're accessing a

00:12:05.880 contigous block of data so you have to

00:12:08.480 understand what the most common queries

00:12:09.760 are going to be and design around

00:12:12.079 those it's probably best to visualize a

00:12:14.360 real world example that way we can

00:12:16.240 actually see the data arranged by these

00:12:17.959 two axes time and composite key in this

00:12:21.279 example we can also visualize what a

00:12:23.079 hierarchically delineating key in little

00:12:25.160 table really is here's an example lifted

00:12:27.839 directly from the little table white

00:12:28.959 paper

00:12:29.760 as you can see the data is organized

00:12:31.600 along two axes on the xaxis we have the

00:12:34.320 time stamps and on the y- axis we have

00:12:36.560 the elements of the composite key you'll

00:12:38.880 see that along the y- AIS the groups the

00:12:41.160 records for a single Network are grouped

00:12:43.079 together and within that Network all the

00:12:45.360 records for a single device are grouped

00:12:47.519 together this composite key can contain

00:12:49.880 as many fields as you want thus

00:12:51.600 arranging data many layers deep in this

00:12:54.120 example though we simply have two Fields

00:12:55.800 included in the composite key grouping

00:12:57.600 the data by two layers

00:13:00.199 as we saw in the previous example the

00:13:01.720 most important takeaway for a

00:13:03.000 hierarchically delineated key is that

00:13:04.720 its components are organized with an

00:13:06.160 increasing degree of specificity the

00:13:08.600 example from Cisco moroi included two

00:13:10.399 components Network and device since a

00:13:12.560 network has many devices this example is

00:13:14.839 purely

00:13:15.920 hierarchical however just because the

00:13:17.639 ordering in this key typically

00:13:19.279 corresponds to a real world hierarchy it

00:13:21.360 doesn't necessarily have to you can

00:13:23.199 select whatever ordering you want for

00:13:24.639 the components of this key and that

00:13:26.000 ordering depends only on how the data is

00:13:27.760 accessed in our case Liz's surfing

00:13:30.360 application is designed to be Surfer

00:13:32.160 specific while we want to store data for

00:13:34.480 multiple Surfers it doesn't make much

00:13:36.240 sense to query data across many Surfers

00:13:38.560 since each Surfer is interested only in

00:13:40.639 their individual progress this means

00:13:42.959 that we can prefix our primary key with

00:13:44.519 the surfer so that all the data for a

00:13:46.440 single Surfer is collocated in the table

00:13:49.519 then we can follow the hierarchical

00:13:50.920 pattern with region and then break the

00:13:53.279 region might be say Los Angeles and the

00:13:55.720 break might be Malibu Malibu first point

00:13:58.440 the region and break are very similar in

00:14:00.199 concept to the example from Cisco moroi

00:14:02.160 since a region contains many

00:14:04.839 braks now that we have a key that's

00:14:06.839 optimized for querying we need to

00:14:08.519 actually write our queries in the most

00:14:10.199 optimal way this data is highly

00:14:12.279 structured and the way it's structured

00:14:13.759 depends entirely on how we plan to query

00:14:15.519 it hopefully you remember this graphic

00:14:17.759 from a couple slides ago this time I've

00:14:20.160 modified it to reflect the composite key

00:14:21.959 we've decided on for Liz's surfing

00:14:23.959 application as a refresher data is

00:14:26.160 arranged by time across the x-axis and

00:14:28.199 composite key across y- axis now for

00:14:31.079 Lisa's surfing surfing application the

00:14:32.839 composite key includes Surfer region and

00:14:35.440 break in that order in this diagram

00:14:38.480 you'll see you'll only see Surfer and

00:14:40.240 region and that's because of the

00:14:42.079 composite nature of the key for the two

00:14:44.240 first two hours of last Monday you're

00:14:45.880 welcome to query for just the surfer Liz

00:14:48.040 in the region LA and you will still find

00:14:50.240 a continuous continuous stretch of data

00:14:53.199 now imagine that you want to drill

00:14:54.519 further and take a closer look at this

00:14:56.519 green slice here you'll see that the

00:14:59.360 stretch of data is further broken down

00:15:01.639 into break Malibu exists in its own

00:15:04.279 subdivided contiguous stretch of data

00:15:07.279 hopefully this helps you visualize what

00:15:08.759 your query will actually turn into

00:15:10.920 you'll see that we certainly want to

00:15:12.199 include a time stamp in your wear Clause

00:15:14.160 after all this is time series data

00:15:16.320 additionally you'll want to include part

00:15:18.079 or all of the elements of that composite

00:15:20.000 key since the data is structured

00:15:22.160 hierarchically you only ever need to

00:15:24.040 include a prefix of the composite key

00:15:25.720 when querying we saw in the first

00:15:27.920 example how effective a query can be for

00:15:30.279 Liz in La for the first two hours of

00:15:32.880 last Monday since we've left off the

00:15:35.079 break and only queried for the surfer

00:15:37.079 and region we've only queried with a

00:15:39.240 prefix of the composite key drilling

00:15:41.399 down even further we can also query for

00:15:43.079 Liz in LA in Malibu over the first two

00:15:45.680 hours of last Monday or zooming out a

00:15:48.279 query for just Liz's surfing data in

00:15:50.000 that time span would also be

00:15:52.600 performant click house is a little bit

00:15:54.560 different the data in Click house is not

00:15:56.319 arranged across two Dimensions instead

00:15:59.079 the timestamp is basically the final

00:16:00.800 component of the composite key because

00:16:03.160 of this it doesn't make sense to include

00:16:05.160 just the surfer and timestamp in the

00:16:06.800 query because you're skipping the middle

00:16:08.600 section of the primary key consider the

00:16:11.399 example I've shown we have a continuous

00:16:13.279 stretch of data for Liz which is broken

00:16:15.079 into region which is then broken into

00:16:16.839 break which contains all of the

00:16:18.360 timestamped records for say the last

00:16:20.720 month it doesn't make much sense to

00:16:22.720 query all the data for Liz over the past

00:16:25.120 two weeks because the data here is not

00:16:27.040 contiguous for each location you'll have

00:16:29.160 to grab just a section skipping over the

00:16:31.040 data points that don't fall within the

00:16:32.720 requested time span the only performant

00:16:35.399 query for click house would be to

00:16:36.759 include all the components of the

00:16:38.199 primary key you must specify the surfer

00:16:40.480 region break and a range of time so it

00:16:43.079 would be performant to query lizes data

00:16:44.720 for Malibu in La over the past two weeks

00:16:47.959 it's important to drill down and

00:16:49.600 understand how the data is arranged in

00:16:51.160 your time series database of Choice by

00:16:53.759 understanding the structure you can

00:16:54.839 visualize what a contiguous chunk of

00:16:56.399 data looks like and you can ensure that

00:16:57.959 your query is making use of the way the

00:17:00.120 data is

00:17:02.000 structured cool at this point we know

00:17:04.280 how to store our data and how to query

00:17:06.199 it we can now start to look at the

00:17:07.839 maintenance side of things here's the

00:17:10.000 thing Liz surfs a lot she plans to surf

00:17:13.480 for years to come and although we love

00:17:15.319 to keep all of lizz's Surfing data in

00:17:16.959 perpetuity we simply don't have

00:17:18.760 unlimited storage space when dealing

00:17:20.839 with time series data you have to

00:17:22.199 balance two major concerns you don't

00:17:24.439 have unlimited storage space to keep raw

00:17:26.120 data forever but you also want to

00:17:27.880 provide the user with his much data as

00:17:30.679 possible in order to solve for the first

00:17:32.880 concern the fact that we don't have

00:17:34.360 unlimited storage we need to take a look

00:17:36.160 at data retention every time series

00:17:38.559 database that I've seen includes some

00:17:40.320 sort of policy for data retention often

00:17:42.919 this comes in the form of a TTL

00:17:44.480 otherwise known as a time to live the

00:17:47.080 time to live dictates how old the data

00:17:49.200 in the table is allowed to be after a

00:17:51.240 certain point data of a certain age is

00:17:52.960 simply dropped from the

00:17:54.600 table now we also need to address the

00:17:56.840 desire to show as much data as possible

00:17:59.400 in order to do so we need to extend the

00:18:01.320 TTL without sacrificing storage there

00:18:04.200 are a few ways of going about this

00:18:05.880 notably compression and

00:18:08.559 aggregation compression is the method of

00:18:10.559 choice for the pro postgress extension

00:18:12.559 time scale DB when you add data to your

00:18:15.080 database it's in the form of

00:18:16.440 uncompressed rows time scale uses a

00:18:18.960 built-in job scheduler to convert this

00:18:20.880 data into the form of compressed columns

00:18:24.039 consider the uncompressed surf data on

00:18:25.600 the left each data point contains a

00:18:27.880 single wave surf by either of our two

00:18:29.760 Surfers Liz and Brandon the top right

00:18:32.400 table shows this data compressed into a

00:18:34.120 single data point this preserves all of

00:18:36.440 the original data while restructuring it

00:18:38.280 into a format that requires less memory

00:18:40.039 to store according to time scale DB's

00:18:42.640 documentation this can compress the size

00:18:44.360 of data by as much as

00:18:45.960 90% additionally you'll see that the

00:18:48.159 data within this compression is ordered

00:18:49.679 by timestamp ordering by time stamp is

00:18:52.200 highly recommended in the documentation

00:18:54.559 but it's worth noting you do have to

00:18:55.880 specifically configure your compression

00:18:57.320 to order by timestamp

00:18:59.679 you may also be wondering about how this

00:19:01.559 data gets queried upon querying

00:19:03.600 compressed data time scale DB

00:19:05.360 selectively uncompresses each column in

00:19:08.080 order to provide the data requested and

00:19:10.480 when compressing data time scale stores

00:19:12.559 it in column order rather than row order

00:19:15.320 we'll see more about column based and

00:19:16.880 row based storage in the next

00:19:19.360 slide however we can certainly be more

00:19:21.960 intelligent about our compression if

00:19:23.760 there's one thing I want to harp on it's

00:19:25.320 that your database design should be

00:19:26.720 driven by how you plan to query in our

00:19:29.400 case we only ever plan a query data for

00:19:31.600 a single Surfer as a result we might

00:19:33.840 want to include segmentation in our

00:19:35.679 compression algorithm as well in the

00:19:37.760 bottom right table you can see the same

00:19:39.720 data segmented by Surfer ID this can

00:19:42.240 certainly improve query

00:19:49.200 performance click house also uses

00:19:51.280 compression to improve database

00:19:52.760 performance it may seem obvious but less

00:19:55.159 data on the disk means less IO and

00:19:57.159 faster queries and inserts

00:19:59.360 when speaking about compression in a

00:20:00.760 Time series context it's important to

00:20:02.679 take a couple of steps backward and talk

00:20:04.440 about one of the major differences

00:20:06.080 between most time series databases and a

00:20:08.520 more generalized database like postgress

00:20:11.200 this difference lies in the structure of

00:20:12.919 the database which should hopefully come

00:20:14.520 as no surprise since we've already

00:20:15.840 spoken at length about the importance of

00:20:17.559 database structure when it comes to

00:20:18.840 handling time series

00:20:20.280 data postgress is what we call a row

00:20:22.600 based database a row based database

00:20:24.640 organizes data by record keeping all of

00:20:27.120 the data associated with a record next

00:20:28.919 to each other in memory row based

00:20:31.120 databases are well suited for

00:20:32.640 transactional workloads where entire

00:20:34.240 records need to be retrieved updated or

00:20:36.360 inserted quickly and

00:20:37.840 efficiently with a row based database

00:20:39.919 writing can be quite effective but

00:20:41.240 reading from large quantities of data

00:20:42.960 has its

00:20:43.840 shortcomings especially when quering by

00:20:45.919 a field like timestamp and because data

00:20:48.760 is grouped by record compressing data by

00:20:50.760 attribute is also quite

00:20:52.799 inefficient click house like many time

00:20:54.960 series databases is actually a column

00:20:56.679 based database in a column based

00:20:58.640 database each data block stores values

00:21:00.679 of a single column for multiple rows

00:21:03.320 this is ideal because compression

00:21:04.640 algorithms exploit continuous patterns

00:21:06.440 of data if this data is sorted by

00:21:08.520 columns in a particular order this can

00:21:10.159 lead to incredibly efficient

00:21:12.159 compression column based databases are

00:21:14.360 often the preferred choice for analytic

00:21:16.200 and data warehousing applications the

00:21:18.559 benefits of column based databases

00:21:20.760 include faster data aggregation high

00:21:23.679 compression speeds and less use of disk

00:21:25.760 space the drawback is that data

00:21:27.919 modification is slower but we as we've

00:21:30.480 discussed previously modifying time

00:21:32.240 series data is often not an intended use

00:21:35.760 case in addition to compression there's

00:21:38.159 a second approach to ensuring that we

00:21:39.640 can preserve data for an extended period

00:21:41.440 of time without increasing our storage

00:21:43.240 costs this approach is called

00:21:45.159 aggregation and it's the methodology of

00:21:47.279 choice for little table when we're

00:21:49.440 speaking about aggregation there are

00:21:50.919 really two concepts the base table and

00:21:53.159 the aggregate table the base table is

00:21:55.320 where we insert the raw metrics we're

00:21:56.840 recording the aggregate table then s

00:21:59.320 store a summary or average of that raw

00:22:01.720 data so first we need to decide what raw

00:22:04.360 metrics we want to store in our base

00:22:05.960 table if you recall we already decided

00:22:08.320 on a primary key that contains a surfer

00:22:10.120 region and break in addition we can

00:22:12.320 record metrics for each Liz W wave Liz

00:22:15.240 has caught at the most basic level what

00:22:17.919 Liz wants to know is how good was that

00:22:19.880 wave the basic stats we can record are

00:22:22.400 distance and duration so we'll include

00:22:24.520 those in our base

00:22:26.279 table then we need to know what what

00:22:28.720 aggregated metrics we want to record in

00:22:30.520 the aggregate tables the aggregate table

00:22:33.000 will have to contain a summary of the

00:22:34.360 data in the base table it might be

00:22:36.559 helpful to know the total distance

00:22:38.080 summed across all waves the total

00:22:40.080 duration summed across all waves the

00:22:42.200 maximum total speed for a single wave

00:22:44.320 and the maximum number of waves caught

00:22:46.080 in that time

00:22:47.919 period now that we've decided what data

00:22:50.000 to store in these aggregate tables we'll

00:22:51.799 have to decide what intervals of data

00:22:53.480 make sense these will determine which

00:22:55.400 aggregate tables we want to create this

00:22:57.559 will also help us decide on our ttls

00:22:59.600 constraining how much data we're

00:23:00.880 actually

00:23:01.760 using since list tends to surf at most

00:23:04.559 once a day it makes sense to aggregate

00:23:06.400 data up to the day that way we can PR

00:23:08.679 preserve data for each surfing session

00:23:10.400 for a TTL of say six months from there

00:23:13.679 we can also aggregate up to the week and

00:23:15.400 the month so that it's easier for Liz to

00:23:17.159 track seasonal and annual Trends this

00:23:19.679 leaves us with a base table with a TTL

00:23:21.960 of say 1 month a one-day aggregate table

00:23:24.720 with a TTL of 6 months a onewe aggregate

00:23:27.320 table with a TTL of one year year and a

00:23:29.000 one- month aggregate table with a TTL of

00:23:31.000 5

00:23:32.320 years woo the hardest part's over now

00:23:35.200 that we have our data stored aggregated

00:23:37.559 and easily accessible we want to design

00:23:39.760 an API endpoint that Liz can use to

00:23:41.760 easily query her surf data the decisions

00:23:44.360 we've made when it comes to querying and

00:23:45.840 aggregation will determine exactly how

00:23:47.480 this API endpoint will be used the next

00:23:50.000 step is defining an API contract which

00:23:52.240 can be clearly documented for the end

00:23:53.880 user validated and

00:23:56.039 enforced a crucial element to document

00:23:58.600 for the end user is the set of allowable

00:24:00.480 query prams the components and ordering

00:24:03.080 of the composite key determine which

00:24:04.799 query prams are required and which are

00:24:06.760 optional as always a time span is

00:24:09.240 necessary for a user quering time series

00:24:11.400 data and assuming that we're using

00:24:13.480 little table as our underlying storage

00:24:15.279 option we only need a prefix of the

00:24:16.840 primary key so the surfer is the only

00:24:18.760 required component of the key beyond

00:24:21.039 that you can optionally specify a region

00:24:22.640 and a break it's important though to

00:24:24.960 document and enforce that a user must

00:24:26.600 also provide a region if they want to

00:24:28.039 provide a break recall earlier that we

00:24:30.000 noted you can't skip fields in the

00:24:31.440 middle of the primary key so you must

00:24:33.399 provide a full prefix of the primary key

00:24:35.399 which in this case is Surfer region and

00:24:37.240 break in that

00:24:39.039 order now that we have the users's

00:24:41.000 request we need to determine which

00:24:42.559 aggregate table we'll be querying from

00:24:44.880 this requires an understanding of some

00:24:46.279 terminology at moroi we discuss time

00:24:48.399 series data in terms of a time span and

00:24:50.679 an interval so I'll quickly explain what

00:24:53.279 we mean by each of those terms in this

00:24:56.000 context the time span describes the full

00:24:58.679 period of time over which we want the

00:25:01.039 data since our longest TTL and database

00:25:03.320 is 5 years we can't query data for a

00:25:05.480 time span that extends further than 5

00:25:07.320 years in the past the interval

00:25:09.640 corresponds to the grain at which the

00:25:11.440 data is aggregated the only options here

00:25:14.120 are one day one week and one month as

00:25:17.600 noted before each aggregation interval

00:25:19.440 will be stored in its own aggregate

00:25:21.120 table we'll have a one day table a onee

00:25:23.520 table and a one month

00:25:24.919 table in designing this API endpoint

00:25:27.440 we'll assume that the use user wants the

00:25:29.120 most data possible for the time span

00:25:31.799 requested this means that the TTL will

00:25:34.120 determine which aggregate table will

00:25:35.760 query from we'll want to query data from

00:25:38.120 the aggregate table with the smallest

00:25:39.720 interval whose TTL is still greater than

00:25:42.159 the time span

00:25:43.360 requested so for example if the user

00:25:45.919 requests a time span less than or equal

00:25:47.840 to 6 months we can return Daily Surf

00:25:49.760 data for a time span between 6 months

00:25:51.919 and one year will return weekly surf

00:25:53.720 data and for any time span between 1

00:25:56.159 year and 5 years will return monthly

00:25:57.720 surf data

00:25:59.159 and we'll validate that the user is not

00:26:00.840 allowed to query the API end point with

00:26:02.360 a time span greater than 5 years since

00:26:04.279 all data is dropped after that

00:26:06.640 point now I'd like to quickly show what

00:26:08.880 a visualization might look like for time

00:26:10.679 series data now that we have in mind the

00:26:12.600 concepts of a time span and interval on

00:26:15.320 this slide is another screen grab from

00:26:16.840 Cisco moro's dashboard application here

00:26:19.840 you can see that at the top of the page

00:26:21.399 there's a drop- down option for showing

00:26:23.320 data over the past 2 hours day week or

00:26:26.200 month those are the selectable time

00:26:28.480 spans we currently have the one month

00:26:30.799 time span selected and in the usage

00:26:33.159 graph below you can see that the data is

00:26:34.760 broken out day by day for a one- month

00:26:37.320 time span we're showing data over a one-

00:26:39.320 day interval this pattern is especially

00:26:41.760 useful when you want to detect Trends

00:26:43.559 since Trends can be found when looking

00:26:45.120 at data down to the hour or over the

00:26:47.320 span of an entire

00:26:49.559 month earlier in the talk I explained

00:26:51.760 the shortcomings of postgress when it

00:26:53.200 comes to time series data one of these

00:26:55.240 shortcomings is the lack of specialized

00:26:56.880 time series Tooling in vanill postgress

00:26:59.559 because tools like click house and time

00:27:01.000 scale DB are so specialized for time

00:27:02.919 series data you might even be able to

00:27:04.840 skip some of the steps I've listed in

00:27:06.320 this getting out there section by

00:27:07.720 leveraging some of the tools and

00:27:08.880 Integrations offered click house for

00:27:11.120 instance officially integrates with

00:27:12.520 quite a few visualization tools like

00:27:14.000 grafana and Tableau this makes quick

00:27:16.279 data visualization really easy to set up

00:27:19.320 and just this year time scale DB

00:27:21.080 announced a project called time scale

00:27:22.679 analytics this initiative is not

00:27:24.559 complete and they're still receiving

00:27:26.000 developer input if you're interested in

00:27:27.480 commenting what they're hoping to do is

00:27:29.440 create a One-Stop shop for time scale

00:27:31.559 time series analytics in postgress in

00:27:34.320 the time scale analytics announcement

00:27:35.679 time scale DB listed a few sketching

00:27:37.240 algorithms that they hope to build into

00:27:38.840 this extension which would provide data

00:27:40.960 structures to estimate metrics such as

00:27:43.080 percentile points and cardinality due to

00:27:45.679 time scale DBS aggregation these

00:27:47.159 sketches should have very low query

00:27:49.039 latency there are so many features I

00:27:50.960 haven't listed and these products are

00:27:52.480 receiving a lot of support so I'm sure

00:27:54.039 the list will grow it'll be really cool

00:27:55.799 to witness the evolution of these time

00:27:57.320 series tools

00:28:00.279 sweet Liz now has easily accessible data

00:28:03.000 on her surfing performance broken down

00:28:04.840 by break over time fast forward several

00:28:07.320 years from now Liz has been surfing for

00:28:09.080 quite some time and she's learned some

00:28:10.519 important lessons for example she took a

00:28:13.039 look at the monthly data for her

00:28:14.399 favorite break and she realized that she

00:28:16.159 catches far fewer waves there in the

00:28:18.159 winter than she does in the summer and

00:28:20.519 the waves she does catch are way smaller

00:28:22.440 and pet her out quickly it turns out

00:28:24.559 that this surf spot only catches South

00:28:26.159 swells and South swells are way more

00:28:27.720 common in in the summertime Liz had no

00:28:29.880 idea that swell Direction was seasonal

00:28:31.480 it had never occurred to her to even

00:28:32.880 check now she knows where to surf in

00:28:34.960 each season and she's been able to

00:28:36.399 update her surf routine accordingly

00:28:38.159 she's been catching way more waves and

00:28:39.600 she's been having a lot more fun looks

00:28:41.919 like Liz is on her way to getting

00:28:44.960 pitted thanks so much for listening

00:28:47.279 again I work for a remarkable company

00:28:48.919 called Cisco moroi with some of the

00:28:50.360 brightest rails developers I've ever met

00:28:52.600 I still can't believe I get to work at

00:28:54.039 the intersection of web development and

00:28:55.760 computer networking it's a fascinating

00:28:57.760 space to be in with really compelling

00:28:59.320 problems to solve if that sounds

00:29:01.120 interesting to you definitely swing by

00:29:02.519 our booth or chat with me later and of

00:29:04.720 course I'm always down to talk time

00:29:05.919 series data have a great rest of your

00:29:07.799 conference