Supporter Talk by Cisco: Catching Waves with Time-Series Data

Summarized using AI

Supporter Talk by Cisco: Catching Waves with Time-Series Data

Liz Heym • November 13, 2024 • Chicago, IL • Talk

In her talk titled "Catching Waves with Time-Series Data" at RubyConf 2024, Liz Heym, a software engineering manager at Cisco, explores the complexities and best practices of managing time-series data, drawing a creative analogy to surfing.

Main Topics Covered:
- Definition of Time-Series Data: Heym defines time-series data as observations recorded over consistent time intervals, emphasizing its ordered nature, which distinguishes it from relational data.
- Challenges in Time-Series Data Management: The talk outlines the intricacies involved in managing time-series data, which can be voluminous and requires efficient storage and querying techniques.
- Choosing the Right Database:
- There are various options for managing time-series data, each suitable for different scenarios, akin to selecting the right surfboard based on wave conditions.
- Options include adapting existing databases with time-series extensions, such as PG Time Series and TimescaleDB, or employing dedicated time-series databases like ClickHouse.
- Data Structure and Querying:
- Emphasis is placed on structuring data by time and utilizing composite keys for efficient querying.
- Examples illustrate how data is organized hierarchically to allow for performance in data retrieval.
- Data Maintenance Techniques:
- The importance of data retention policies, compression, and aggregation is discussed as strategies to manage storage costs and optimize performance.

- Heym presents a practical approach through the use of a base table for raw metrics and aggregate tables for summarized data, suitable for various time spans.
- API Design for Time-Series Data:
- The talk concludes with insights on designing an API endpoint for accessing time-series data, underscoring the need for clear documentation and understanding of query parameters.

Illustrative Case Study:
- The presentation follows Liz, an aspiring surfer, who records her surfing activities as time-series data, allowing her to track her progress over time. This example showcases practical applications of time-series data management concepts discussed in the talk.

Conclusion:
Heym invites attendees to simplify their approach to time-series data management by understanding the unique characteristics of time-series databases and employing effective structuring and querying techniques. Ultimately, the goal is to enhance user experience by providing intuitive access to data and meaningful insights over time.

Supporter Talk by Cisco: Catching Waves with Time-Series Data
Liz Heym • November 13, 2024 • Chicago, IL • Talk

Time-series data is remarkably common, with applications ranging from IoT to finance. Effectively storing, reading, and presenting this time-series data can be as finicky as catching the perfect wave.

In order to understand the best-practices of time-series data, we’ll follow a surfer’s journey as she attempts to record every wave she’s ever caught. We’ll discover how to structure the time-series data, query for performant access, aggregate data over timespans, and present the data via an API endpoint. Surf’s up!

RubyConf 2024

00:00:15.440 um hi uh my name is Liz I'm a software
00:00:17.800 engineering manager on the switching
00:00:19.199 team at Cisco moroi prior to Moro I
00:00:21.480 worked in startups ranging from series a
00:00:23.119 to IPO um I lived in Ohio for 24 years
00:00:25.960 so I'm no stranger to the 5-Hour road
00:00:27.720 trip to see a show in Chicago um I truly
00:00:30.439 think Chicago is one of the most perfect
00:00:32.000 cities in the world I have a friend who
00:00:33.640 calls the Great Lakes the North Coast of
00:00:35.680 the US um and I love that description I
00:00:38.160 think that's so fun uh Believe It or Not
00:00:40.399 people not me do surf in Lake Michigan
00:00:43.559 during storms often in the winter it's
00:00:45.640 very gnarly hence why I've never done it
00:00:48.879 so hopefully the surf theme of this
00:00:50.760 presentation is fitting enough for the
00:00:52.879 nautical town of Chicago um but this
00:00:56.440 talk is primarily About Time series data
00:00:58.760 time series is a topic Cisco moraki
00:01:00.879 deals with quite often and it can get
00:01:02.680 really intricate and hairy I found that
00:01:04.559 the gas can be quite difficult to
00:01:06.040 uncover at first so hopefully this
00:01:07.520 presentation helps anyone who's
00:01:08.960 currently finding themselves daunted by
00:01:10.560 the
00:01:11.920 concept by the end of this talk you'll
00:01:13.840 hopefully have an understanding of how
00:01:15.119 time series data might differ from the
00:01:16.960 typical sort of relational data you
00:01:18.520 might be used to dealing with I'll walk
00:01:20.680 through how to select a tool for
00:01:22.000 managing time series data how to
00:01:24.000 organize time series data how to query
00:01:26.240 time series data how to Aggregate and
00:01:28.280 compress time series data and finally
00:01:30.600 how to translate Your Design to API
00:01:33.439 constraints before jumping in you might
00:01:35.320 be asking yourself what even is time
00:01:37.399 series data time series data is
00:01:39.600 essentially just a collection of
00:01:40.920 observations recorded over consistent
00:01:42.720 intervals of Time Time series data is
00:01:44.960 distinct from other types of data
00:01:46.479 because of this ordering by time this
00:01:48.880 graph lifted from our dashboard at Cisco
00:01:50.680 moraki shows a usage rate and bits per
00:01:52.719 second for a device over time we record
00:01:55.320 this data every 5 minutes by pulling the
00:01:57.159 device we store it in a Time series
00:01:59.079 database and display it to the user in a
00:02:01.000 graph on our
00:02:02.399 dashboard time series data has become
00:02:04.520 pretty ubiquitous as we collect more and
00:02:06.320 more data to model our surroundings and
00:02:08.000 predict Trends devices like smart
00:02:10.039 watches have introduced amateur athletes
00:02:11.599 to way more data than they've ever had
00:02:13.319 access to our friend Liz inspired by
00:02:15.959 apps like straa wants to track her
00:02:17.560 surfing as well Liz has just taken her
00:02:20.080 first surf lesson and she's keen on Lear
00:02:21.840 learning how her surfing will improve
00:02:23.280 over time she's decided to record this
00:02:25.519 data in a Time series database and to
00:02:27.319 access it via an API endpoint but where
00:02:29.959 does she
00:02:30.840 start in surfing it's important to
00:02:33.200 select a surfboard that's suited for a
00:02:34.920 particular swell if the waves are Steep
00:02:36.959 and Powerful it might be worth breaking
00:02:38.440 out the short board for better
00:02:39.959 maneuverability for smaller days a
00:02:41.840 longboard can be a lot of fun I've
00:02:43.720 recently been told that it's absurd that
00:02:45.519 my partner and I have nine surfboards
00:02:47.400 between the two of us uh but it's
00:02:49.440 important to have a lot of options
00:02:51.080 conditions really do vary the same is
00:02:53.480 true of data and databases it's
00:02:55.640 important to select a tool that's
00:02:56.879 appropriate for the type of data you
00:02:58.200 plan to deal with Time series data often
00:03:00.480 comes in large quantities and efficient
00:03:02.239 access is highly important so a database
00:03:04.080 that can accommodate these concerns is
00:03:05.680 crucial When selecting a tool for
00:03:07.560 managing time series data you have four
00:03:09.400 options that nicely mirror the options a
00:03:11.480 surfer faces when deciding what board to
00:03:13.480 Surf as a surfer you can surf a board
00:03:15.959 you already have this is applicable for
00:03:18.080 folks who already have a dedicated time
00:03:19.879 series database in their Tex stack as a
00:03:22.319 surfer you can use an old board but add
00:03:24.159 a new set of fins this is analogous to
00:03:26.560 using a database extension for say
00:03:28.680 postgress as add on to a tool you
00:03:30.560 already have as a surfer you can buy a
00:03:33.080 new board this is similar to adopting a
00:03:35.040 new database technology dedicated to
00:03:36.879 time series data or you can break out
00:03:39.439 the foam and fiberglass and shape your
00:03:41.000 own board this is just like designing
00:03:42.879 and implementing your own time series
00:03:45.840 database it's quite possible that you
00:03:47.720 already have as part of your Tech stack
00:03:49.319 a database that works well for your time
00:03:50.879 series use case in that case you just
00:03:52.879 have to ensure that you're using your
00:03:54.040 database tool correctly with the proper
00:03:55.720 techniques later in the talk I'll cover
00:03:57.760 techniques for the proper storage and
00:03:59.280 querying of Time series data because
00:04:01.680 after all if you don't know how to surf
00:04:03.720 even the nicest board in the world is of
00:04:05.360 no
00:04:06.560 use or maybe your team already uses a
00:04:09.439 more generalized database like postgress
00:04:11.799 and the best solution in this case might
00:04:13.200 be to add a postgress extension this is
00:04:15.560 similar to buying a new set of fins for
00:04:17.280 the surf board you already own fins are
00:04:19.359 easily swapped out without changing the
00:04:20.919 surfing experience too much in this case
00:04:23.320 the old board is postgress and the new
00:04:25.360 set of fins is an extension you can use
00:04:27.240 with postgress there are several options
00:04:30.039 for time series extensions you can use
00:04:31.479 with postgress two of the most notable
00:04:33.320 options are PG time series and time
00:04:35.039 scale DB these extensions provide a user
00:04:37.800 experience around creating managing and
00:04:39.600 querying time series data which doesn't
00:04:41.639 come out of the box with vanilla
00:04:43.360 postgress there are many benefits to
00:04:45.320 extensions using using an extension is
00:04:47.680 lower lift and cheaper plus you're
00:04:49.720 already used to surfing the board using
00:04:51.840 a postgress extension will reduce the
00:04:53.479 learning curve and augment the speed of
00:04:55.199 development while avoiding the
00:04:56.520 performance hit you'd otherwise take
00:04:58.000 with vanilla postgress a further benefit
00:05:00.680 to a postcast extension is that it will
00:05:02.280 be far easier to join relational data
00:05:03.880 with time series data as
00:05:05.960 needed now that we've talked so much
00:05:07.919 about postgress extensions you might be
00:05:09.720 asking yourself what's wrong with filla
00:05:11.560 postgress why not just store my time
00:05:13.639 series data in postgress without
00:05:15.160 bothering with an extension the first
00:05:17.600 reason is that these postrest extensions
00:05:19.639 come with built-in methods designed
00:05:21.520 specifically for use with time series
00:05:23.240 data without writing your own method to
00:05:25.199 do so you can for example query a range
00:05:27.600 of data over time more importantly there
00:05:30.240 are significant performance differences
00:05:32.039 between vanilla postest and an extension
00:05:33.960 like time scale DB the time scale DB
00:05:36.440 documentation references an experiment
00:05:38.240 in which postgress and time scale DB
00:05:40.080 were tasked with ingesting a 1 billion
00:05:42.240 row database time scale DB loads this
00:05:45.199 massive database in 115th the total time
00:05:47.720 of postgress and sees throughput of more
00:05:49.639 than 20 times that of postgress because
00:05:52.080 of its heavy utilization of time space
00:05:53.840 partitioning time scale DB achieves a
00:05:56.080 higher ingest rate than postgress when
00:05:57.919 dealing with large quantities of data
00:05:59.639 quantities that are quite common when
00:06:01.360 dealing with time series data querying
00:06:04.240 can be even faster given a query that
00:06:06.199 specifies time ordering with 100 million
00:06:08.240 row table time scale DB achieves a query
00:06:10.800 latency that is 396 times faster than
00:06:14.319 postgress later in this presentation
00:06:16.440 when we get into the techniques such as
00:06:17.960 aggregation you'll see even more reasons
00:06:20.080 for favoring a Time series extension or
00:06:22.080 database over vanilla postgress
00:06:24.400 specifically you'll see the benefits of
00:06:25.919 aggregation for data
00:06:28.319 retention and sometimes your existing
00:06:30.560 tools don't cut it and you need to
00:06:32.120 invest in something entirely new this is
00:06:34.280 analogous to buying an entirely new
00:06:35.759 surfboard if you're looking for a
00:06:37.560 dedicated time series database it may be
00:06:39.360 worth looking into Solutions such as
00:06:40.919 click house click house pitches itself
00:06:43.280 as a fast open- Source analytical
00:06:45.080 database designed around time series
00:06:46.919 data managing time series data is all
00:06:49.560 about optimization and each tool is
00:06:51.479 optimized for a different ideal use case
00:06:53.840 it's essential to evaluate your
00:06:55.240 conditions before selecting a surfboard
00:06:57.199 in the same way it's important to
00:06:58.599 consider what type of you'll be working
00:07:00.360 with and what you're going to be doing
00:07:01.680 with it consensus among some users of
00:07:04.479 both tools seems to be that time scale
00:07:06.240 GB has a great time series story and an
00:07:08.240 average data warehousing story whereas
00:07:10.319 people say that click house has a great
00:07:11.639 data warehousing story and an average
00:07:13.120 time series Story look into Benchmark
00:07:15.840 analyses for yourself investigate the
00:07:17.599 features of each database and extension
00:07:19.680 and read up on documentation for the
00:07:21.000 tool you're considering the more you
00:07:22.960 understand about the inner workings of
00:07:24.400 your proposed tool the better you'll
00:07:26.199 understand how it will work with your
00:07:27.360 use case my best advice is to really get
00:07:29.720 in the weeds as a surfer if you don't
00:07:31.479 know the purpose of rocker or tail shape
00:07:33.599 When selecting a board it's going to be
00:07:35.319 really difficult to make an informed
00:07:36.599 decision the same goes for understanding
00:07:38.759 and selecting a solution for time series
00:07:42.039 management and sometimes no available
00:07:44.639 database seems suited to your highly
00:07:46.440 specific needs if your use case is
00:07:48.720 really particular and you're feeling
00:07:50.400 especially industrious you might just
00:07:52.319 want to break out the foam and
00:07:53.280 fiberglass and shape your own board
00:07:55.440 moroi found itself in this situation in
00:07:57.720 2008 our core product is a Cloud managed
00:08:00.159 platform that allows users to configure
00:08:01.960 and monitor their networking devices the
00:08:04.199 monitoring data is more often than not
00:08:06.400 time series data we track metrics such
00:08:08.479 as packet count received bya switch the
00:08:10.639 temperature of a device or the device's
00:08:12.440 memory usage Trends over time can give
00:08:14.759 insight into the health of a system in
00:08:17.400 2008 there were far fewer time series
00:08:19.159 Solutions available so a team at moroi
00:08:21.280 developed their own we call it little
00:08:23.360 table little table is a relational
00:08:25.599 database optimized for time series data
00:08:28.120 little table was in fact developed
00:08:30.280 specifically for spinning discs in 2008
00:08:32.680 storage on solid state drives was far
00:08:34.399 more expensive than storage on spinning
00:08:35.919 discs so the hardware was a significant
00:08:38.440 consideration because of this data is
00:08:40.440 clustered for continuous dis access in
00:08:42.320 order to improve performance later in
00:08:44.560 this presentation we'll see how this
00:08:45.920 impacts the way one might design a table
00:08:48.000 when using little table fun fact as of
00:08:50.800 2017 when the white paper was written
00:08:52.959 moroi stored 320 terabytes of data
00:08:55.480 across several hundred liel table
00:08:57.160 servers systemwide now I'm I'm sure the
00:08:59.720 quantity is even higher though not
00:09:01.920 actually a SQL database little table
00:09:03.560 includes a SQL interface for querying
00:09:05.480 which has improved developer adoption by
00:09:07.120 making this tool easy to use little
00:09:09.519 table exists only for internal use at
00:09:11.440 moroi but the team wrote an excellent
00:09:13.480 white paper that is very effectively
00:09:15.240 describes the challenges and design
00:09:16.880 considerations which can be super useful
00:09:18.880 for anyone trying to gain a better
00:09:20.399 understanding of the intricacies of Time
00:09:21.600 series data I've linked it in this slide
00:09:23.680 and I can't recommend it
00:09:25.880 enough all righty we now have our board
00:09:28.560 picked out we understand the conditions
00:09:30.480 we're surfing in and we've landed on a
00:09:32.040 board that works best in those
00:09:33.480 conditions however the work is far from
00:09:35.600 over you can have the perfect board and
00:09:37.480 still struggle to actually surf if you
00:09:39.160 don't have the proper techniques
00:09:41.079 technique is also incredibly important
00:09:42.640 when dealing with time series data
00:09:44.480 regardless of which databased tool
00:09:45.839 you've chosen to use in order to
00:09:47.640 optimize performance it's crucial that
00:09:49.160 we follow some tried andrue patterns for
00:09:50.880 organizing and querying
00:09:56.760 data the time series techniques I'll
00:09:58.880 cover in the talk are data arranged by
00:10:01.519 time composite Key quering by index and
00:10:04.880 aggregation and
00:10:07.040 compression the identifying
00:10:08.680 characteristic for a Time series
00:10:10.040 database is that it organizes data by
00:10:12.200 time for efficient access otherwise it
00:10:14.600 wouldn't be a Time series database both
00:10:17.120 click housee and time scale DB will
00:10:18.839 automatically generate an index on the
00:10:20.440 time stamp column this allows for the
00:10:22.440 most performant access when retrieving
00:10:24.000 data for a range of time little table
00:10:26.440 actually clusters data on the disk by
00:10:27.959 timestamp never inter leaving data with
00:10:29.800 older
00:10:30.680 timestamps because of the unique data
00:10:32.640 structure some databases enforce
00:10:34.200 restrictions arranging data by time
00:10:36.360 allows for highly efficient reads but
00:10:38.079 writing can be quite inefficient the
00:10:40.000 designers of little table decided to
00:10:41.560 constrain writes to be append only since
00:10:44.079 we're collecting data over time it
00:10:45.839 doesn't make much sense to spot fill
00:10:47.639 historic data anyway according to the
00:10:49.839 little table white paper there's no need
00:10:51.839 to update rows as each row represents a
00:10:54.160 measurement taken at a specific point in
00:10:57.200 time when visualizing a Time series
00:10:59.440 database it's important to understand
00:11:00.880 that there are effectively two
00:11:02.079 identifiers for a given piece of data
00:11:04.519 the first mentioned in the previous
00:11:06.120 slide is the time stamp the second piece
00:11:08.120 of information is the identifier in
00:11:10.399 almost every case this is comprised of
00:11:12.200 multiple Fields making it a composite
00:11:15.200 key each time series database refers to
00:11:17.920 this concept using slightly different
00:11:19.600 terminology little table documentation
00:11:21.720 refers to this as a hierarchically
00:11:23.000 delineated key whereas click house
00:11:24.760 documentation refers to this as a
00:11:26.240 compound primary key and time scale DB
00:11:28.880 refers to this as a partition key in
00:11:31.399 order to understand the implication this
00:11:33.040 composite key has on structuring data
00:11:34.920 I'm going to drill into little tables
00:11:36.399 hierarchically delineated key in little
00:11:39.000 table this key determines how the data
00:11:40.839 is actually arranged on disk in addition
00:11:42.720 to being grouped by time this
00:11:44.800 hierarchical organization enables
00:11:46.800 efficient queries since it will always
00:11:48.680 correspond to a contiguous region of
00:11:50.360 data on the disk it's crucial then to
00:11:52.600 only query based on ordered components
00:11:54.320 of this key in order to determine the
00:11:57.000 components of this key and how they're
00:11:58.560 ordered it's super important to
00:12:00.639 understand how this data is going to be
00:12:02.120 accessed your queries will only be
00:12:04.480 performant if you're accessing a
00:12:05.880 contigous block of data so you have to
00:12:08.480 understand what the most common queries
00:12:09.760 are going to be and design around
00:12:12.079 those it's probably best to visualize a
00:12:14.360 real world example that way we can
00:12:16.240 actually see the data arranged by these
00:12:17.959 two axes time and composite key in this
00:12:21.279 example we can also visualize what a
00:12:23.079 hierarchically delineating key in little
00:12:25.160 table really is here's an example lifted
00:12:27.839 directly from the little table white
00:12:28.959 paper
00:12:29.760 as you can see the data is organized
00:12:31.600 along two axes on the xaxis we have the
00:12:34.320 time stamps and on the y- axis we have
00:12:36.560 the elements of the composite key you'll
00:12:38.880 see that along the y- AIS the groups the
00:12:41.160 records for a single Network are grouped
00:12:43.079 together and within that Network all the
00:12:45.360 records for a single device are grouped
00:12:47.519 together this composite key can contain
00:12:49.880 as many fields as you want thus
00:12:51.600 arranging data many layers deep in this
00:12:54.120 example though we simply have two Fields
00:12:55.800 included in the composite key grouping
00:12:57.600 the data by two layers
00:13:00.199 as we saw in the previous example the
00:13:01.720 most important takeaway for a
00:13:03.000 hierarchically delineated key is that
00:13:04.720 its components are organized with an
00:13:06.160 increasing degree of specificity the
00:13:08.600 example from Cisco moroi included two
00:13:10.399 components Network and device since a
00:13:12.560 network has many devices this example is
00:13:14.839 purely
00:13:15.920 hierarchical however just because the
00:13:17.639 ordering in this key typically
00:13:19.279 corresponds to a real world hierarchy it
00:13:21.360 doesn't necessarily have to you can
00:13:23.199 select whatever ordering you want for
00:13:24.639 the components of this key and that
00:13:26.000 ordering depends only on how the data is
00:13:27.760 accessed in our case Liz's surfing
00:13:30.360 application is designed to be Surfer
00:13:32.160 specific while we want to store data for
00:13:34.480 multiple Surfers it doesn't make much
00:13:36.240 sense to query data across many Surfers
00:13:38.560 since each Surfer is interested only in
00:13:40.639 their individual progress this means
00:13:42.959 that we can prefix our primary key with
00:13:44.519 the surfer so that all the data for a
00:13:46.440 single Surfer is collocated in the table
00:13:49.519 then we can follow the hierarchical
00:13:50.920 pattern with region and then break the
00:13:53.279 region might be say Los Angeles and the
00:13:55.720 break might be Malibu Malibu first point
00:13:58.440 the region and break are very similar in
00:14:00.199 concept to the example from Cisco moroi
00:14:02.160 since a region contains many
00:14:04.839 braks now that we have a key that's
00:14:06.839 optimized for querying we need to
00:14:08.519 actually write our queries in the most
00:14:10.199 optimal way this data is highly
00:14:12.279 structured and the way it's structured
00:14:13.759 depends entirely on how we plan to query
00:14:15.519 it hopefully you remember this graphic
00:14:17.759 from a couple slides ago this time I've
00:14:20.160 modified it to reflect the composite key
00:14:21.959 we've decided on for Liz's surfing
00:14:23.959 application as a refresher data is
00:14:26.160 arranged by time across the x-axis and
00:14:28.199 composite key across y- axis now for
00:14:31.079 Lisa's surfing surfing application the
00:14:32.839 composite key includes Surfer region and
00:14:35.440 break in that order in this diagram
00:14:38.480 you'll see you'll only see Surfer and
00:14:40.240 region and that's because of the
00:14:42.079 composite nature of the key for the two
00:14:44.240 first two hours of last Monday you're
00:14:45.880 welcome to query for just the surfer Liz
00:14:48.040 in the region LA and you will still find
00:14:50.240 a continuous continuous stretch of data
00:14:53.199 now imagine that you want to drill
00:14:54.519 further and take a closer look at this
00:14:56.519 green slice here you'll see that the
00:14:59.360 stretch of data is further broken down
00:15:01.639 into break Malibu exists in its own
00:15:04.279 subdivided contiguous stretch of data
00:15:07.279 hopefully this helps you visualize what
00:15:08.759 your query will actually turn into
00:15:10.920 you'll see that we certainly want to
00:15:12.199 include a time stamp in your wear Clause
00:15:14.160 after all this is time series data
00:15:16.320 additionally you'll want to include part
00:15:18.079 or all of the elements of that composite
00:15:20.000 key since the data is structured
00:15:22.160 hierarchically you only ever need to
00:15:24.040 include a prefix of the composite key
00:15:25.720 when querying we saw in the first
00:15:27.920 example how effective a query can be for
00:15:30.279 Liz in La for the first two hours of
00:15:32.880 last Monday since we've left off the
00:15:35.079 break and only queried for the surfer
00:15:37.079 and region we've only queried with a
00:15:39.240 prefix of the composite key drilling
00:15:41.399 down even further we can also query for
00:15:43.079 Liz in LA in Malibu over the first two
00:15:45.680 hours of last Monday or zooming out a
00:15:48.279 query for just Liz's surfing data in
00:15:50.000 that time span would also be
00:15:52.600 performant click house is a little bit
00:15:54.560 different the data in Click house is not
00:15:56.319 arranged across two Dimensions instead
00:15:59.079 the timestamp is basically the final
00:16:00.800 component of the composite key because
00:16:03.160 of this it doesn't make sense to include
00:16:05.160 just the surfer and timestamp in the
00:16:06.800 query because you're skipping the middle
00:16:08.600 section of the primary key consider the
00:16:11.399 example I've shown we have a continuous
00:16:13.279 stretch of data for Liz which is broken
00:16:15.079 into region which is then broken into
00:16:16.839 break which contains all of the
00:16:18.360 timestamped records for say the last
00:16:20.720 month it doesn't make much sense to
00:16:22.720 query all the data for Liz over the past
00:16:25.120 two weeks because the data here is not
00:16:27.040 contiguous for each location you'll have
00:16:29.160 to grab just a section skipping over the
00:16:31.040 data points that don't fall within the
00:16:32.720 requested time span the only performant
00:16:35.399 query for click house would be to
00:16:36.759 include all the components of the
00:16:38.199 primary key you must specify the surfer
00:16:40.480 region break and a range of time so it
00:16:43.079 would be performant to query lizes data
00:16:44.720 for Malibu in La over the past two weeks
00:16:47.959 it's important to drill down and
00:16:49.600 understand how the data is arranged in
00:16:51.160 your time series database of Choice by
00:16:53.759 understanding the structure you can
00:16:54.839 visualize what a contiguous chunk of
00:16:56.399 data looks like and you can ensure that
00:16:57.959 your query is making use of the way the
00:17:00.120 data is
00:17:02.000 structured cool at this point we know
00:17:04.280 how to store our data and how to query
00:17:06.199 it we can now start to look at the
00:17:07.839 maintenance side of things here's the
00:17:10.000 thing Liz surfs a lot she plans to surf
00:17:13.480 for years to come and although we love
00:17:15.319 to keep all of lizz's Surfing data in
00:17:16.959 perpetuity we simply don't have
00:17:18.760 unlimited storage space when dealing
00:17:20.839 with time series data you have to
00:17:22.199 balance two major concerns you don't
00:17:24.439 have unlimited storage space to keep raw
00:17:26.120 data forever but you also want to
00:17:27.880 provide the user with his much data as
00:17:30.679 possible in order to solve for the first
00:17:32.880 concern the fact that we don't have
00:17:34.360 unlimited storage we need to take a look
00:17:36.160 at data retention every time series
00:17:38.559 database that I've seen includes some
00:17:40.320 sort of policy for data retention often
00:17:42.919 this comes in the form of a TTL
00:17:44.480 otherwise known as a time to live the
00:17:47.080 time to live dictates how old the data
00:17:49.200 in the table is allowed to be after a
00:17:51.240 certain point data of a certain age is
00:17:52.960 simply dropped from the
00:17:54.600 table now we also need to address the
00:17:56.840 desire to show as much data as possible
00:17:59.400 in order to do so we need to extend the
00:18:01.320 TTL without sacrificing storage there
00:18:04.200 are a few ways of going about this
00:18:05.880 notably compression and
00:18:08.559 aggregation compression is the method of
00:18:10.559 choice for the pro postgress extension
00:18:12.559 time scale DB when you add data to your
00:18:15.080 database it's in the form of
00:18:16.440 uncompressed rows time scale uses a
00:18:18.960 built-in job scheduler to convert this
00:18:20.880 data into the form of compressed columns
00:18:24.039 consider the uncompressed surf data on
00:18:25.600 the left each data point contains a
00:18:27.880 single wave surf by either of our two
00:18:29.760 Surfers Liz and Brandon the top right
00:18:32.400 table shows this data compressed into a
00:18:34.120 single data point this preserves all of
00:18:36.440 the original data while restructuring it
00:18:38.280 into a format that requires less memory
00:18:40.039 to store according to time scale DB's
00:18:42.640 documentation this can compress the size
00:18:44.360 of data by as much as
00:18:45.960 90% additionally you'll see that the
00:18:48.159 data within this compression is ordered
00:18:49.679 by timestamp ordering by time stamp is
00:18:52.200 highly recommended in the documentation
00:18:54.559 but it's worth noting you do have to
00:18:55.880 specifically configure your compression
00:18:57.320 to order by timestamp
00:18:59.679 you may also be wondering about how this
00:19:01.559 data gets queried upon querying
00:19:03.600 compressed data time scale DB
00:19:05.360 selectively uncompresses each column in
00:19:08.080 order to provide the data requested and
00:19:10.480 when compressing data time scale stores
00:19:12.559 it in column order rather than row order
00:19:15.320 we'll see more about column based and
00:19:16.880 row based storage in the next
00:19:19.360 slide however we can certainly be more
00:19:21.960 intelligent about our compression if
00:19:23.760 there's one thing I want to harp on it's
00:19:25.320 that your database design should be
00:19:26.720 driven by how you plan to query in our
00:19:29.400 case we only ever plan a query data for
00:19:31.600 a single Surfer as a result we might
00:19:33.840 want to include segmentation in our
00:19:35.679 compression algorithm as well in the
00:19:37.760 bottom right table you can see the same
00:19:39.720 data segmented by Surfer ID this can
00:19:42.240 certainly improve query
00:19:49.200 performance click house also uses
00:19:51.280 compression to improve database
00:19:52.760 performance it may seem obvious but less
00:19:55.159 data on the disk means less IO and
00:19:57.159 faster queries and inserts
00:19:59.360 when speaking about compression in a
00:20:00.760 Time series context it's important to
00:20:02.679 take a couple of steps backward and talk
00:20:04.440 about one of the major differences
00:20:06.080 between most time series databases and a
00:20:08.520 more generalized database like postgress
00:20:11.200 this difference lies in the structure of
00:20:12.919 the database which should hopefully come
00:20:14.520 as no surprise since we've already
00:20:15.840 spoken at length about the importance of
00:20:17.559 database structure when it comes to
00:20:18.840 handling time series
00:20:20.280 data postgress is what we call a row
00:20:22.600 based database a row based database
00:20:24.640 organizes data by record keeping all of
00:20:27.120 the data associated with a record next
00:20:28.919 to each other in memory row based
00:20:31.120 databases are well suited for
00:20:32.640 transactional workloads where entire
00:20:34.240 records need to be retrieved updated or
00:20:36.360 inserted quickly and
00:20:37.840 efficiently with a row based database
00:20:39.919 writing can be quite effective but
00:20:41.240 reading from large quantities of data
00:20:42.960 has its
00:20:43.840 shortcomings especially when quering by
00:20:45.919 a field like timestamp and because data
00:20:48.760 is grouped by record compressing data by
00:20:50.760 attribute is also quite
00:20:52.799 inefficient click house like many time
00:20:54.960 series databases is actually a column
00:20:56.679 based database in a column based
00:20:58.640 database each data block stores values
00:21:00.679 of a single column for multiple rows
00:21:03.320 this is ideal because compression
00:21:04.640 algorithms exploit continuous patterns
00:21:06.440 of data if this data is sorted by
00:21:08.520 columns in a particular order this can
00:21:10.159 lead to incredibly efficient
00:21:12.159 compression column based databases are
00:21:14.360 often the preferred choice for analytic
00:21:16.200 and data warehousing applications the
00:21:18.559 benefits of column based databases
00:21:20.760 include faster data aggregation high
00:21:23.679 compression speeds and less use of disk
00:21:25.760 space the drawback is that data
00:21:27.919 modification is slower but we as we've
00:21:30.480 discussed previously modifying time
00:21:32.240 series data is often not an intended use
00:21:35.760 case in addition to compression there's
00:21:38.159 a second approach to ensuring that we
00:21:39.640 can preserve data for an extended period
00:21:41.440 of time without increasing our storage
00:21:43.240 costs this approach is called
00:21:45.159 aggregation and it's the methodology of
00:21:47.279 choice for little table when we're
00:21:49.440 speaking about aggregation there are
00:21:50.919 really two concepts the base table and
00:21:53.159 the aggregate table the base table is
00:21:55.320 where we insert the raw metrics we're
00:21:56.840 recording the aggregate table then s
00:21:59.320 store a summary or average of that raw
00:22:01.720 data so first we need to decide what raw
00:22:04.360 metrics we want to store in our base
00:22:05.960 table if you recall we already decided
00:22:08.320 on a primary key that contains a surfer
00:22:10.120 region and break in addition we can
00:22:12.320 record metrics for each Liz W wave Liz
00:22:15.240 has caught at the most basic level what
00:22:17.919 Liz wants to know is how good was that
00:22:19.880 wave the basic stats we can record are
00:22:22.400 distance and duration so we'll include
00:22:24.520 those in our base
00:22:26.279 table then we need to know what what
00:22:28.720 aggregated metrics we want to record in
00:22:30.520 the aggregate tables the aggregate table
00:22:33.000 will have to contain a summary of the
00:22:34.360 data in the base table it might be
00:22:36.559 helpful to know the total distance
00:22:38.080 summed across all waves the total
00:22:40.080 duration summed across all waves the
00:22:42.200 maximum total speed for a single wave
00:22:44.320 and the maximum number of waves caught
00:22:46.080 in that time
00:22:47.919 period now that we've decided what data
00:22:50.000 to store in these aggregate tables we'll
00:22:51.799 have to decide what intervals of data
00:22:53.480 make sense these will determine which
00:22:55.400 aggregate tables we want to create this
00:22:57.559 will also help us decide on our ttls
00:22:59.600 constraining how much data we're
00:23:00.880 actually
00:23:01.760 using since list tends to surf at most
00:23:04.559 once a day it makes sense to aggregate
00:23:06.400 data up to the day that way we can PR
00:23:08.679 preserve data for each surfing session
00:23:10.400 for a TTL of say six months from there
00:23:13.679 we can also aggregate up to the week and
00:23:15.400 the month so that it's easier for Liz to
00:23:17.159 track seasonal and annual Trends this
00:23:19.679 leaves us with a base table with a TTL
00:23:21.960 of say 1 month a one-day aggregate table
00:23:24.720 with a TTL of 6 months a onewe aggregate
00:23:27.320 table with a TTL of one year year and a
00:23:29.000 one- month aggregate table with a TTL of
00:23:31.000 5
00:23:32.320 years woo the hardest part's over now
00:23:35.200 that we have our data stored aggregated
00:23:37.559 and easily accessible we want to design
00:23:39.760 an API endpoint that Liz can use to
00:23:41.760 easily query her surf data the decisions
00:23:44.360 we've made when it comes to querying and
00:23:45.840 aggregation will determine exactly how
00:23:47.480 this API endpoint will be used the next
00:23:50.000 step is defining an API contract which
00:23:52.240 can be clearly documented for the end
00:23:53.880 user validated and
00:23:56.039 enforced a crucial element to document
00:23:58.600 for the end user is the set of allowable
00:24:00.480 query prams the components and ordering
00:24:03.080 of the composite key determine which
00:24:04.799 query prams are required and which are
00:24:06.760 optional as always a time span is
00:24:09.240 necessary for a user quering time series
00:24:11.400 data and assuming that we're using
00:24:13.480 little table as our underlying storage
00:24:15.279 option we only need a prefix of the
00:24:16.840 primary key so the surfer is the only
00:24:18.760 required component of the key beyond
00:24:21.039 that you can optionally specify a region
00:24:22.640 and a break it's important though to
00:24:24.960 document and enforce that a user must
00:24:26.600 also provide a region if they want to
00:24:28.039 provide a break recall earlier that we
00:24:30.000 noted you can't skip fields in the
00:24:31.440 middle of the primary key so you must
00:24:33.399 provide a full prefix of the primary key
00:24:35.399 which in this case is Surfer region and
00:24:37.240 break in that
00:24:39.039 order now that we have the users's
00:24:41.000 request we need to determine which
00:24:42.559 aggregate table we'll be querying from
00:24:44.880 this requires an understanding of some
00:24:46.279 terminology at moroi we discuss time
00:24:48.399 series data in terms of a time span and
00:24:50.679 an interval so I'll quickly explain what
00:24:53.279 we mean by each of those terms in this
00:24:56.000 context the time span describes the full
00:24:58.679 period of time over which we want the
00:25:01.039 data since our longest TTL and database
00:25:03.320 is 5 years we can't query data for a
00:25:05.480 time span that extends further than 5
00:25:07.320 years in the past the interval
00:25:09.640 corresponds to the grain at which the
00:25:11.440 data is aggregated the only options here
00:25:14.120 are one day one week and one month as
00:25:17.600 noted before each aggregation interval
00:25:19.440 will be stored in its own aggregate
00:25:21.120 table we'll have a one day table a onee
00:25:23.520 table and a one month
00:25:24.919 table in designing this API endpoint
00:25:27.440 we'll assume that the use user wants the
00:25:29.120 most data possible for the time span
00:25:31.799 requested this means that the TTL will
00:25:34.120 determine which aggregate table will
00:25:35.760 query from we'll want to query data from
00:25:38.120 the aggregate table with the smallest
00:25:39.720 interval whose TTL is still greater than
00:25:42.159 the time span
00:25:43.360 requested so for example if the user
00:25:45.919 requests a time span less than or equal
00:25:47.840 to 6 months we can return Daily Surf
00:25:49.760 data for a time span between 6 months
00:25:51.919 and one year will return weekly surf
00:25:53.720 data and for any time span between 1
00:25:56.159 year and 5 years will return monthly
00:25:57.720 surf data
00:25:59.159 and we'll validate that the user is not
00:26:00.840 allowed to query the API end point with
00:26:02.360 a time span greater than 5 years since
00:26:04.279 all data is dropped after that
00:26:06.640 point now I'd like to quickly show what
00:26:08.880 a visualization might look like for time
00:26:10.679 series data now that we have in mind the
00:26:12.600 concepts of a time span and interval on
00:26:15.320 this slide is another screen grab from
00:26:16.840 Cisco moro's dashboard application here
00:26:19.840 you can see that at the top of the page
00:26:21.399 there's a drop- down option for showing
00:26:23.320 data over the past 2 hours day week or
00:26:26.200 month those are the selectable time
00:26:28.480 spans we currently have the one month
00:26:30.799 time span selected and in the usage
00:26:33.159 graph below you can see that the data is
00:26:34.760 broken out day by day for a one- month
00:26:37.320 time span we're showing data over a one-
00:26:39.320 day interval this pattern is especially
00:26:41.760 useful when you want to detect Trends
00:26:43.559 since Trends can be found when looking
00:26:45.120 at data down to the hour or over the
00:26:47.320 span of an entire
00:26:49.559 month earlier in the talk I explained
00:26:51.760 the shortcomings of postgress when it
00:26:53.200 comes to time series data one of these
00:26:55.240 shortcomings is the lack of specialized
00:26:56.880 time series Tooling in vanill postgress
00:26:59.559 because tools like click house and time
00:27:01.000 scale DB are so specialized for time
00:27:02.919 series data you might even be able to
00:27:04.840 skip some of the steps I've listed in
00:27:06.320 this getting out there section by
00:27:07.720 leveraging some of the tools and
00:27:08.880 Integrations offered click house for
00:27:11.120 instance officially integrates with
00:27:12.520 quite a few visualization tools like
00:27:14.000 grafana and Tableau this makes quick
00:27:16.279 data visualization really easy to set up
00:27:19.320 and just this year time scale DB
00:27:21.080 announced a project called time scale
00:27:22.679 analytics this initiative is not
00:27:24.559 complete and they're still receiving
00:27:26.000 developer input if you're interested in
00:27:27.480 commenting what they're hoping to do is
00:27:29.440 create a One-Stop shop for time scale
00:27:31.559 time series analytics in postgress in
00:27:34.320 the time scale analytics announcement
00:27:35.679 time scale DB listed a few sketching
00:27:37.240 algorithms that they hope to build into
00:27:38.840 this extension which would provide data
00:27:40.960 structures to estimate metrics such as
00:27:43.080 percentile points and cardinality due to
00:27:45.679 time scale DBS aggregation these
00:27:47.159 sketches should have very low query
00:27:49.039 latency there are so many features I
00:27:50.960 haven't listed and these products are
00:27:52.480 receiving a lot of support so I'm sure
00:27:54.039 the list will grow it'll be really cool
00:27:55.799 to witness the evolution of these time
00:27:57.320 series tools
00:28:00.279 sweet Liz now has easily accessible data
00:28:03.000 on her surfing performance broken down
00:28:04.840 by break over time fast forward several
00:28:07.320 years from now Liz has been surfing for
00:28:09.080 quite some time and she's learned some
00:28:10.519 important lessons for example she took a
00:28:13.039 look at the monthly data for her
00:28:14.399 favorite break and she realized that she
00:28:16.159 catches far fewer waves there in the
00:28:18.159 winter than she does in the summer and
00:28:20.519 the waves she does catch are way smaller
00:28:22.440 and pet her out quickly it turns out
00:28:24.559 that this surf spot only catches South
00:28:26.159 swells and South swells are way more
00:28:27.720 common in in the summertime Liz had no
00:28:29.880 idea that swell Direction was seasonal
00:28:31.480 it had never occurred to her to even
00:28:32.880 check now she knows where to surf in
00:28:34.960 each season and she's been able to
00:28:36.399 update her surf routine accordingly
00:28:38.159 she's been catching way more waves and
00:28:39.600 she's been having a lot more fun looks
00:28:41.919 like Liz is on her way to getting
00:28:44.960 pitted thanks so much for listening
00:28:47.279 again I work for a remarkable company
00:28:48.919 called Cisco moroi with some of the
00:28:50.360 brightest rails developers I've ever met
00:28:52.600 I still can't believe I get to work at
00:28:54.039 the intersection of web development and
00:28:55.760 computer networking it's a fascinating
00:28:57.760 space to be in with really compelling
00:28:59.320 problems to solve if that sounds
00:29:01.120 interesting to you definitely swing by
00:29:02.519 our booth or chat with me later and of
00:29:04.720 course I'm always down to talk time
00:29:05.919 series data have a great rest of your
00:29:07.799 conference
Explore all talks recorded at RubyConf 2024
+64