Liz Heym

Summarized using AI

Catching Waves with Time-Series Data

Liz Heym • October 07, 2024 • Boulder, CO

In the presentation "Catching Waves with Time-Series Data" by Liz Heym, the intricacies of managing time-series data are explored, drawing parallels between surfing and data storage, access, and representation.

Key Points Discussed:

- Understanding Time-Series Data:

- Time-series data comprises observations recorded at consistent time intervals, differing from traditional relational data.

- An example includes usage rates for devices tracked every five minutes at Cisco Moroi.

  • Selecting Appropriate Tools:

    • Just as surfers choose boards based on conditions, database selection depends on data quantity and access needs.
    • Options for managing time-series data include:
    • Utilizing existing databases with proper techniques.
    • Adding extensions to existing databases (e.g., Postgres with PG TimeSeries, TimescaleDB).
    • Adopting entirely new database solutions (e.g., ClickHouse).
  • Organizing and Querying Data:

    • Stress on the importance of structuring data by timestamp to enable efficient queries.
    • Employing composite keys for access optimization, where data is grouped for speed.
    • Performance varies significantly between different database technologies based on data arrangement and access methods.
  • Data Retention and Compression:

    • Challenges include managing storage and providing extensive data access.
    • Aggregation and compression are key strategies, utilizing time-to-live (TTL) policies for data retention.
    • TimescaleDB's compression converts raw data into a compact format for reduced memory usage.
  • Designing API Endpoints:

    • Structure of the API should reflect the database's organization, providing valid querying formats for users.
    • Important to establish documentation for request structures, acceptable time spans, and accessible aggregate data intervals corresponding to TTLs.

Significant Examples & Anecdotes:

- The journey of Liz, a fictional surfer tracking her performance over time, highlights how proper data management leads to insightful trends.

- The creation of "Little Table,” a proprietary relational database optimized for time-series data used internally at Cisco Moroi.

Conclusions and Takeaways:

- Understanding specific time-series data management techniques is crucial for performance and usability.

- Choosing the right tools and optimizing data storage, organization, and access is akin to selecting the right surfboard for the waves, allowing users to derive meaningful insights from vast data sets.

- The evolving landscape of time-series databases offers solutions that enhance performance and facilitate deeper data analysis.

Liz's surfing data story illustrates the practical application of these concepts, paving the way for more effective data strategies in varied use cases.

Catching Waves with Time-Series Data
Liz Heym • October 07, 2024 • Boulder, CO

Time-series data is remarkably common, with applications ranging from IoT to finance. Effectively storing, reading, and presenting this time-series data can be as finicky as catching the perfect wave.

In order to understand the best-practices of time-series data, we’ll follow a surfer’s journey as she attempts to record every wave she’s ever caught. We’ll discover how to structure the time-series data, query for performant access, aggregate data over timespans, and present the data via an API endpoint. Surf’s up!

Rocky Mountain Ruby 2024

00:00:14.240 hello I've heard that my face is blocked
00:00:17.400 by my laptop so I'm going to try to
00:00:19.640 orient myself
00:00:21.880 correctly I think I'm too short for this
00:00:25.400 setup um hello my name is Liz I'm a
00:00:28.519 software engineer on the switching team
00:00:30.199 at Cisco moroi um prior to moroi I
00:00:32.880 worked in startups ranging from series a
00:00:34.920 to IPO um what I'm about to confess
00:00:38.399 might rattle the good people of Colorado
00:00:41.039 but I used to be a mountain
00:00:46.199 person after living in California for
00:00:48.480 the past few years I have been fully
00:00:50.199 converted and I'm a beach person
00:00:52.879 now
00:00:56.320 yeah it's so dry here I have ChapStick
00:01:00.879 in my pocket right now I feel like you
00:01:02.719 could snap me in half if you wanted I'm
00:01:04.879 like dried to a
00:01:07.400 crisp I know Colorado has some lovely
00:01:10.080 mountains so I apologize for the ocean
00:01:11.680 themes in this presentation but
00:01:13.560 hopefully we can move past it and get to
00:01:15.320 the heart of what I want to talk to you
00:01:16.640 about which is time series data time
00:01:19.600 series data is a topic that Cisco moroi
00:01:21.640 deals with quite often and it can get
00:01:23.600 really intricate and hairy I found that
00:01:25.880 the gotches can be difficult to uncover
00:01:27.640 at first so hopefully this presentation
00:01:29.560 helps um anyone who's currently finding
00:01:31.840 themselves daunted by this
00:01:33.640 concept by the end of the talk you'll
00:01:35.880 hopefully have an understanding of how
00:01:37.320 time series data might differ from the
00:01:39.040 typical sort of relational data that you
00:01:40.920 might be used to dealing with I'll walk
00:01:43.200 through how to select a tool for
00:01:44.680 managing time series data how to
00:01:46.640 organize time series data how to query
00:01:48.920 time series data how to Aggregate and
00:01:51.200 compress time series data and finally
00:01:53.479 how to translate Your Design to API
00:01:56.640 constraints before jumping in you might
00:01:58.880 be asking yourself what even is time
00:02:01.280 series data time series data is
00:02:03.399 essentially just a collection of
00:02:04.799 observations recorded over consistent
00:02:06.640 intervals of Time Time series data is
00:02:09.039 distinct from other types of data
00:02:10.679 because of this ordering by time this
00:02:13.200 graph lifted from our dashboard at Cisco
00:02:15.040 moroi shows a usage rate in bits per
00:02:17.080 second for a device over time we record
00:02:19.519 this data every five minutes by pulling
00:02:21.239 the device we store it in a Time series
00:02:23.400 database and we display it to the user
00:02:25.319 in a graph on our
00:02:27.280 dashboard time series data has become
00:02:29.440 pretty aitous as we collect more and
00:02:31.480 more data to model our surroundings and
00:02:33.440 predict Trends devices like SmartWatches
00:02:36.160 have introduced amateur athletes to way
00:02:38.040 more data than they've ever had access
00:02:40.440 to our friend Liz inspired by apps like
00:02:43.200 straa wants to track her surfing as well
00:02:46.480 Liz has just taken her first surf lesson
00:02:48.720 and she's keen on learning how her
00:02:50.000 surfing will improve over time she's
00:02:52.200 decided to record this data in a Time
00:02:53.959 series database and to access it via an
00:02:56.080 API endpoint but where does she start
00:03:00.200 in surfing it's important to select a
00:03:01.800 surfboard that's suited for a particular
00:03:03.560 swell if the waves are Steep and
00:03:05.599 Powerful it might be worth breaking out
00:03:07.159 the shortboard for better
00:03:08.480 maneuverability for smaller days a
00:03:10.360 longboard can be a lot of fun I've
00:03:12.480 recently been told that it's absurd that
00:03:14.720 my partner and I have nine surfboards
00:03:16.560 between the two of
00:03:18.599 us but it's important to have a lot of
00:03:20.760 options conditions really do vary the
00:03:23.640 same is true of data and databases it's
00:03:26.280 important to select a tool that's
00:03:27.680 appropriate for the type of data you
00:03:29.159 plan to deal with with time series data
00:03:31.360 often comes in large quantities and
00:03:33.000 efficient access is highly important so
00:03:35.360 a database that can accommodate these
00:03:36.760 concerns is crucial When selecting a
00:03:39.280 tool for managing time series data you
00:03:41.480 have four options that nicely mirror the
00:03:43.480 options of surfer faces when deciding
00:03:45.400 which board to Surf as a surfer you can
00:03:47.879 surf a board you already have this is
00:03:50.239 applicable for folks who already have a
00:03:51.760 dedicated time series database in their
00:03:53.319 text tack as a surfer you can use an old
00:03:55.920 board but add a new set of fins this is
00:03:58.319 analogous to using a database extension
00:04:00.480 for say postgress adding onto a tool
00:04:02.840 that you already have as a surfer you
00:04:05.079 can buy a new board this is similar to
00:04:06.840 adopting a new database technology
00:04:08.480 dedicated to time series data or you can
00:04:11.400 break out the foam and fiberglass and
00:04:12.879 shape your own board this is just like
00:04:15.040 designing and implementing your own time
00:04:16.720 series
00:04:18.440 database it's quite possible that you
00:04:20.359 already have as part of your Tex stack a
00:04:22.000 database that works well for your time
00:04:23.400 series use case in that case you just
00:04:26.000 have to ensure that you're using your
00:04:27.440 database tool correctly with the proper
00:04:29.440 Techni techniques later in the talk I'll
00:04:31.560 cover techniques for the proper storage
00:04:33.360 and querying of Time series data because
00:04:36.400 after all if you don't know how to surf
00:04:38.320 even the nicest board in the world is of
00:04:40.160 no
00:04:41.720 use or maybe your team already uses a
00:04:44.520 more generalized database like postgress
00:04:46.800 and the best solution in this case would
00:04:48.320 be to add a postgress
00:04:50.880 extension this is similar to buying a
00:04:52.840 new set of fins for the surfboard you
00:04:54.320 already own fins are easily swapped out
00:04:56.840 without changing the surfing experience
00:04:58.520 too much and this case the old board is
00:05:01.080 postgress and the new set of fins is an
00:05:02.880 extension you can use with postgress
00:05:05.199 there are several options for time
00:05:06.520 series extensions you can use with
00:05:08.000 postgress two of the most notable
00:05:09.960 options are PG time series and time
00:05:11.720 scale DB these extensions provide a user
00:05:14.080 experience around creating managing and
00:05:15.919 querying time series data which doesn't
00:05:18.199 come out of the box with vanilla
00:05:19.960 postgress there are many benefits to
00:05:21.919 extensions using extension is lower lift
00:05:24.400 and cheaper plus you're already used to
00:05:26.479 surfing the board using a postr
00:05:28.440 extension will reduce learning curve and
00:05:30.160 augment the speed of development while
00:05:31.960 avoiding the performance hit you'd
00:05:33.120 otherwise take with vanilla
00:05:34.800 postgress a further benefit to using
00:05:36.880 postgress extension is that it will be
00:05:38.600 far easier to join relational data with
00:05:40.360 time series data as
00:05:42.840 needed now that we've talked so much
00:05:44.720 about postgress extensions you might be
00:05:46.360 asking yourself what's wrong with
00:05:47.800 vanilla postgress why not just store my
00:05:50.360 time series data in postgress without
00:05:52.039 bothering with an extension the first
00:05:54.199 reason is that these postgress
00:05:55.400 extensions come with built-in methods
00:05:57.039 specifically designed for use with time
00:05:58.800 series data without writing your own
00:06:01.039 method to do so you can for example
00:06:03.160 query data for a range of time more
00:06:06.319 importantly there are specific
00:06:07.680 performance differences between vanilla
00:06:09.400 postgress and an extension like time
00:06:11.039 scale
00:06:12.000 DB the time scale DB documentation
00:06:14.840 references an experiment in which
00:06:16.360 postgress and time scale DB are tasked
00:06:18.240 with ingesting a 1 billion row database
00:06:21.160 time scale DB loads this massive
00:06:22.880 database in 115th the total time of
00:06:25.199 postgress and sees through pit of more
00:06:27.199 than 20 times that of postgress because
00:06:29.319 of it heavy utilization of time space
00:06:31.479 partitioning time scale DB achieves a
00:06:33.880 higher ingest rate than post graphs from
00:06:35.479 De dealing with large quantities of data
00:06:37.800 quantities that are quite common when
00:06:39.199 dealing with time series data querying
00:06:41.800 can be even faster given a query that
00:06:43.960 specifies time ordering with a 100
00:06:46.120 million row table time scale DB achieves
00:06:48.800 a query latency that is 396 times faster
00:06:51.880 than
00:06:52.560 postgress later in this presentation
00:06:54.680 when we get into techniques such as
00:06:56.240 aggregation you'll see even more reasons
00:06:58.440 for favoring a Time series extension or
00:07:00.759 database over vanilla postgress
00:07:03.240 specifically you'll see the benefits of
00:07:04.800 a aggregation for data
00:07:08.199 retention sometimes your existing tools
00:07:10.639 don't cut it and you need to invest in
00:07:12.319 something entirely new this is analogous
00:07:14.800 to buying an entirely new surfboard if
00:07:17.319 you're looking for a dedicated time
00:07:18.800 series database it may be worth looking
00:07:20.639 into Solutions such as click housee
00:07:22.879 click house pches itself as a fast
00:07:24.800 open-source analytical database designed
00:07:27.000 around time series data managing time
00:07:29.800 series data is all about optimization
00:07:31.879 and each tool is optimized for a
00:07:33.879 different ideal use case it's essential
00:07:36.479 to evaluate your conditions before
00:07:38.160 selecting a surfboard and in the same
00:07:40.599 way it's important to consider what type
00:07:42.280 of data you'll be working with and what
00:07:43.960 you'll be doing with it consensus among
00:07:46.800 some users of both tools seems to be
00:07:48.960 that time scale DB has a great time
00:07:51.120 series story and an average data
00:07:52.680 warehousing story whereas click house
00:07:54.879 has a great data warehousing story and
00:07:56.440 an average time series Story look into
00:07:58.800 benchmark analys IES investigate the
00:08:00.639 features of each database and extension
00:08:02.400 and read up on the documentation for the
00:08:04.080 tool you're considering the more you
00:08:06.319 understand about the inner workings of
00:08:07.520 your proposed tool the better you'll
00:08:09.560 understand how it will work with your
00:08:10.759 use case my best advice is to really get
00:08:13.400 in the weeds as a surfer if you don't
00:08:15.280 know the purpose of rocker or tail shape
00:08:17.759 When selecting a board it's going to be
00:08:19.759 really difficult to make an informed
00:08:21.560 decision the same goes for understanding
00:08:23.919 and selecting a solution for time series
00:08:27.319 management and sometimes no available
00:08:29.919 database seems suited to your highly
00:08:31.960 specific needs if your use case is
00:08:34.440 really particular and you're feeling
00:08:36.279 especially industrious you might just
00:08:38.159 want to break out the foam and
00:08:39.360 fiberglass and shape your own board
00:08:41.800 moroi found itself in this situation in
00:08:44.120 2008 our core product is a cloud-managed
00:08:47.120 platform that allows users to configure
00:08:49.240 and monitor their networking devices the
00:08:51.720 monitoring data is more often than not
00:08:53.760 time series data we track track metrics
00:08:56.320 such as packet count received bya switch
00:08:58.880 the temperature of a device or the
00:09:00.640 device's memory usage Trends over time
00:09:03.040 can give insight into the health of a
00:09:04.680 system in 2008 there were far fewer time
00:09:07.560 series Solutions available so team at
00:09:09.640 moroi developed their own we call it
00:09:11.720 little
00:09:12.839 table little table is a relational
00:09:15.040 database optimized for time series data
00:09:17.600 Little T was in fact developed
00:09:19.880 specifically for spinning discs in 2008
00:09:22.920 storage on solid state drives was far
00:09:24.720 more expensive than storage on spinning
00:09:26.360 discs so the hardware was a significant
00:09:28.680 consideration because of this data is
00:09:30.800 clustered for continuous disk access in
00:09:33.079 order to improve performance later in
00:09:35.560 this presentation we'll see how this
00:09:37.240 impacts the way one might design a table
00:09:39.240 when using little table fun fact as of
00:09:42.200 2017 when the white paper was written
00:09:44.040 maroi stored 320 terabytes of data
00:09:47.040 across several hundred little table
00:09:48.519 servers systemwide now I'm sure the
00:09:51.120 quantity is even higher though not
00:09:53.480 actually a SQL database little table
00:09:55.040 includes a SQL interface for querying
00:09:56.839 which has improved developer adoption by
00:09:58.880 making this tool easy to use little
00:10:01.480 table exists for internal use only at
00:10:03.480 moroi but the team wrote an excellent
00:10:06.120 white paper that very effectively
00:10:07.600 describes the challenges and design
00:10:09.320 considerations which can be super useful
00:10:11.399 for anyone trying to gain a better
00:10:13.120 understanding of the intricacies of Time
00:10:14.519 series data I've linked it in the slide
00:10:16.800 you can also just look up little table
00:10:19.000 white paper Cisco moroi um and I can't
00:10:22.079 recommend it enough it's a great
00:10:24.680 read all right we now have our board
00:10:27.240 picked out we understand the conditions
00:10:28.920 we're surf in in and we've landed on a
00:10:30.440 board that works best however the work
00:10:33.200 is far from over you can have the
00:10:35.000 perfect board and still struggle to
00:10:36.639 actually surf if you don't have the
00:10:37.920 proper technique technique is also
00:10:40.399 incredibly important when dealing with
00:10:41.959 time series data regardless of which
00:10:43.959 database tool you choose to use in order
00:10:46.560 to optimize performance it's crucial
00:10:48.279 that we follow some tried andrue
00:10:49.680 patterns for organizing and querying
00:10:51.680 data the time series techniques I'll
00:10:53.800 cover in this talk are data arranged by
00:10:56.000 time composite Key quering by index and
00:10:59.399 aggregation and
00:11:01.360 compression the identifying
00:11:03.000 characteristic for a Time series
00:11:04.320 database is that it organizes data by
00:11:06.440 time for efficient access otherwise it
00:11:09.200 wouldn't be a Time series
00:11:10.839 database both click housee and time
00:11:12.839 scale DB will automatically generate an
00:11:14.839 index on the timestamp column this
00:11:17.040 allows for the most performant access
00:11:18.560 when retrieving data for a range of time
00:11:20.920 little table actually clusters data on
00:11:22.720 the disk by timestamp never interleaving
00:11:24.720 data with older
00:11:26.120 timestamps because of the unique data
00:11:28.040 structure some databases Force
00:11:29.680 restrictions arranging data by time
00:11:31.839 allows for highly efficient reads but
00:11:34.000 writing can be quite inefficient the
00:11:35.800 designers of little table decided to
00:11:37.480 constrain rights to be append only since
00:11:40.240 we're collecting data over a range of
00:11:41.839 time it doesn't make much sense to spot
00:11:43.760 fill historic data anyway according to
00:11:46.120 the little table white paper there is no
00:11:48.000 need to update rows as each row
00:11:49.959 represents a measurement taken at a
00:11:51.760 specific point in
00:11:53.760 time when visualizing a Time series
00:11:56.040 database it's important to understand
00:11:57.600 that there are effectively two identif
00:11:59.360 iers for a given piece of data the first
00:12:01.959 mentioned in the previous slide is the
00:12:03.399 time stamp the second piece of
00:12:05.160 information is the identifier in almost
00:12:07.560 every case this is comprised of multiple
00:12:09.480 Fields making it a composite
00:12:11.959 key each time series database refers to
00:12:14.560 this concept using slightly different
00:12:16.199 terminology little table documentation
00:12:18.360 refers to this as a hierarchically
00:12:20.160 delineated key click housee
00:12:22.199 documentation refers to this as a
00:12:23.639 compound primary key and time scale DB
00:12:25.839 refers to this as a partition key in
00:12:28.639 order to understand the implication this
00:12:30.680 composite key has on structuring data
00:12:32.839 I'm going to drill into little table's
00:12:34.240 hierarchically delineated key in little
00:12:36.839 table this key determines how the data
00:12:38.639 is actually arranged on disk in addition
00:12:40.920 to being grouped by time this
00:12:42.880 hierarchical organization enables
00:12:44.680 efficient queries since it will always
00:12:46.639 correspond to a contiguous region of
00:12:48.279 data on the disk it's crucial then to
00:12:50.880 only query based on ordered components
00:12:52.800 of this key in order to determine the
00:12:55.199 component components of this key and how
00:12:57.279 they're ordered it's super important to
00:12:59.519 understand how this data is going to be
00:13:01.199 accessed your queries will only be
00:13:03.240 performant if you're accessing a
00:13:04.639 continuous block of data so you have to
00:13:07.199 understand what the most common queries
00:13:08.680 are going to be and design around
00:13:11.240 those it's probably best to visualize a
00:13:13.440 real world example this way we can
00:13:15.480 actually see the data arranged by these
00:13:17.079 two axes time and composite key in this
00:13:20.120 example we can also visualize what a
00:13:22.120 hierarchically delineated key in little
00:13:23.920 table really is here's an example lifted
00:13:26.760 directly from the little table white
00:13:28.199 paper as you can see the data is
00:13:30.360 organized along two axes on the x axis
00:13:33.600 we have the time stamps and on the y
00:13:35.600 axis we have the elements of the
00:13:37.120 composite key you'll see that along the
00:13:39.920 the y- axis all the records for a single
00:13:41.680 Network are grouped together and within
00:13:43.880 that Network all the records for a
00:13:45.360 single device are grouped together this
00:13:47.760 composite key can contain as many fields
00:13:49.760 as you want thus arranging data many
00:13:51.800 layers deep in this example though we
00:13:54.240 simply have two Fields included in the
00:13:56.040 composite key grouping the data by two
00:13:57.959 layers
00:14:02.480 as we saw in the previous example the
00:14:04.000 most important takeaway for a
00:14:05.199 hierarchically delineated key is that
00:14:06.880 its components are organized with an
00:14:08.360 increasing degree of specificity the
00:14:10.959 example from Cisco moroi included two
00:14:12.959 components Network and device since the
00:14:15.519 network has many devices this example is
00:14:18.040 purely
00:14:19.040 hierarchical however just because the
00:14:20.880 ordering in this key typically
00:14:22.440 corresponds to a real world hierarchy it
00:14:24.880 doesn't necessarily have to you can
00:14:27.040 select whatever ordering you want for
00:14:28.759 the components of this key and that
00:14:30.680 ordering depends only on how this data
00:14:32.600 is accessed in our case Liz's surfing
00:14:35.480 application is designed to be Surfer
00:14:37.680 specific while we want to store data for
00:14:39.959 multiple Surfers it doesn't make much
00:14:42.040 sense to query data across Surfers since
00:14:44.720 each Surfer is interested only in their
00:14:46.600 individual progress this means that we
00:14:49.120 can first prefix our primary key with
00:14:50.839 the surfer so that all the data for a
00:14:52.600 single Surfer is collocated in the table
00:14:55.720 then we can follow the hierarchical
00:14:57.160 pattern with a region and then a break
00:14:59.160 the region might be Los Angeles and the
00:15:01.360 break might be Malibu first point the
00:15:04.240 region and break are very similar in
00:15:05.680 concept to the example from Cisco maroi
00:15:07.639 since a region contains many
00:15:10.480 braks now that we have a key that's
00:15:12.360 optimized for querying we need to
00:15:13.959 actually write our queries in the most
00:15:15.720 optimal way this data is highly
00:15:17.639 structured and the way it's structured
00:15:19.160 depends entirely on how we plan to query
00:15:21.160 it hopefully you remember this graphic
00:15:23.399 from a couple slides ago once again I
00:15:25.320 think it's easiest to understand this
00:15:26.800 query by visualizing it as a refresher
00:15:29.440 data is arranged by time across the xais
00:15:32.000 and composite key across the y AIS when
00:15:34.880 quering time series data it's essential
00:15:36.720 to include the timestamp in your wear
00:15:38.720 Clause it really doesn't make sense to
00:15:41.000 include to query this data without
00:15:42.680 including a timestamp or a range of
00:15:45.399 timestamps additionally you'll want to
00:15:47.480 include part or all of the elements of
00:15:49.639 the composite key because of the way
00:15:51.959 data is arranged in little table you
00:15:53.639 only ever need to include a prefix of
00:15:55.160 the composite key when querying this
00:15:57.279 means that you could query all the data
00:15:58.720 for a single Surfer over a range of time
00:16:01.160 or you can query all the data for a
00:16:02.759 surfer in a specific region over a range
00:16:04.880 of time or you can drill down with the
00:16:06.839 entire composite key and request data
00:16:08.720 for a surfer region and
00:16:11.199 break click house is a little bit
00:16:13.199 different the data in Click house is not
00:16:15.040 arranged across two Dimensions instead
00:16:17.279 the Tim stamp is basically the final
00:16:18.920 component of the composite key because
00:16:21.360 of this it doesn't make much sense to
00:16:23.040 include just the surfer and Tim stamp in
00:16:25.040 the query because you're skipping the
00:16:26.880 middle section of the primary key as you
00:16:28.920 you can see in the
00:16:30.079 diagram consider the example I've shown
00:16:32.600 we have a contiguous stretch of data for
00:16:34.319 Liz which is broken into region which is
00:16:36.399 then broken into break which contains
00:16:38.279 all of the timestamp data timestamp
00:16:40.880 records for say the last month it
00:16:43.399 doesn't make much sense to query the
00:16:45.000 data for Liz over the past two weeks
00:16:47.319 because the data here is not contiguous
00:16:49.319 for each location you'd have to grab
00:16:50.959 just a section skipping over the data
00:16:52.920 points that don't fall within the
00:16:54.240 requested time span the only performance
00:16:57.120 query for click house would be to
00:16:58.319 include all the components of the
00:16:59.839 composite key you must specify the
00:17:02.399 surfer region break and a range of time
00:17:05.360 so it would be performant to query Liz's
00:17:07.240 data from Malibu in La over the past two
00:17:10.160 weeks it's important to drill down and
00:17:12.679 understand how the data is arranged in
00:17:14.199 your time series database of Choice by
00:17:16.959 understanding the structure you can
00:17:18.160 visualize what a contiguous chunk of
00:17:19.799 data looks like and you can ensure that
00:17:21.760 your query is making use of the way the
00:17:23.400 data is structured it's worth noting
00:17:26.079 that once you execute that performant
00:17:27.559 query as long as that quantity of data
00:17:29.559 is reasonable and can be loaded into
00:17:31.160 memory you can always write a subsequent
00:17:33.160 query that gets the data you want it's
00:17:35.520 important though to make sure that the
00:17:37.000 first query is performant or else we're
00:17:39.080 losing the benefit we get from a
00:17:40.360 dedicated time series
00:17:42.679 database cool at this point we know how
00:17:45.280 to store our data how to query it and
00:17:47.880 now we can start to look at the
00:17:48.960 maintenance side of things here's the
00:17:51.160 thing Liz surfs a
00:17:54.000 lot she plans to serve for years to come
00:17:57.240 and although we would love to keep keep
00:17:59.000 all of Liz's surfing data in perpetuity
00:18:00.880 we simply don't have unlimited storage
00:18:03.559 when dealing with time series data you
00:18:05.159 have to balance two major concerns you
00:18:07.400 don't have unlimited storage space to
00:18:08.960 keep raw dat data forever but you also
00:18:11.480 want to provide the user with as much
00:18:12.880 data as
00:18:14.200 possible in order to solve for the first
00:18:16.400 concern the fact that we don't have
00:18:18.320 infinite storage we need to take a look
00:18:20.480 at data retention every time series
00:18:23.159 database that I've seen includes some
00:18:24.840 sort of policy for data retention often
00:18:27.320 this comes in the form of a TT L
00:18:29.440 otherwise known as a time to live the
00:18:31.760 time to live dictates how old the data
00:18:33.840 in the table is allowed to be after a
00:18:36.280 certain point data of a certain age is
00:18:38.320 simply dropped from the
00:18:39.880 table now we also need to address the
00:18:42.320 desire to show as much data as possible
00:18:45.000 in order to do so we need to extend the
00:18:46.960 TTL without sacrificing storage there
00:18:49.760 are few ways of going about this notably
00:18:52.120 compression and
00:18:54.000 aggregation compression is the method of
00:18:56.000 choice for the postgress extension time
00:18:57.919 scale DB
00:19:00.080 when you add data to your database it's
00:19:01.880 in the form of uncompressed rows time
00:19:04.400 scale uses a built-in job scheduler to
00:19:06.320 convert this data to the form of
00:19:07.799 compressed columns you can see in the
00:19:10.159 first example on the slide what six data
00:19:12.120 points might look like compressed into a
00:19:13.960 single data point this preserves all the
00:19:16.440 original data while restructuring it
00:19:18.280 into a format that requires less memory
00:19:20.240 to store I noted earlier in the
00:19:22.559 presentation that the makeup of the
00:19:24.039 composite key should be determined
00:19:25.520 entirely how by how the data is going to
00:19:27.960 be queried
00:19:29.240 the same is true of compression the way
00:19:31.200 the data is segmented depends entirely
00:19:33.240 on how it's going to be queried if
00:19:35.480 you're most often accessing data by say
00:19:37.880 device ID you can configure the
00:19:39.760 scheduler to compress data by device ID
00:19:42.840 the second example in this slide shows
00:19:44.280 an example of data segmented by device
00:19:48.080 ID click housee also uses compression to
00:19:50.760 improve database performance it may seem
00:19:53.039 obvious but less data on disk means less
00:19:55.159 IO and faster queries and inserts when
00:19:58.080 speaking about impression in a Time
00:19:59.559 series context it's important to take a
00:20:01.440 couple of steps backwards and talk about
00:20:03.320 one of the major differences between
00:20:05.480 most time series databases and a more
00:20:07.440 generalized database like postgress this
00:20:10.240 difference lies in the structure of the
00:20:11.640 database which should hopefully come as
00:20:13.400 no surprise since we've already spoken
00:20:15.760 at length about the importance of
00:20:17.159 database structure when it comes to
00:20:18.640 handling time series
00:20:20.520 data postgress is what we call a row
00:20:22.840 based database a row based database
00:20:25.159 organizes data by record keeping all of
00:20:27.600 the data associated with a record next
00:20:29.679 to each other in memory row based
00:20:32.039 databases are well suited for
00:20:33.400 transactional workloads where entire
00:20:35.360 records need to be retrieved updated or
00:20:37.679 inserted quickly and efficiently with a
00:20:40.080 row based database writing can be quite
00:20:41.799 effective but reading from large
00:20:43.240 quantities of data has its shortcomings
00:20:45.600 especially when querying by a field like
00:20:47.799 timestamp and because data is grouped by
00:20:50.200 record compressing data by attribute is
00:20:52.440 also quite inefficient consider the
00:20:54.640 example from the last slide where the
00:20:56.200 CPU usage and disio for many devices
00:20:58.400 were were group together into a single
00:21:00.520 record click house like many time series
00:21:03.000 databases is actually a column based
00:21:04.960 database in a column based database each
00:21:07.440 data block stores values of a single
00:21:09.480 column for multiple rows this is ideal
00:21:12.400 because compression algorithms exploit
00:21:15.400 contiguous patterns of data if this data
00:21:17.919 is sorted by columns in a particular
00:21:19.400 order this can lead to incredibly
00:21:20.960 efficient compression column based
00:21:23.400 databases are often the preferred choice
00:21:25.159 for analytic and data warehousing
00:21:27.080 applications the benefit of column based
00:21:29.360 databases include faster data
00:21:30.880 aggregation higher compression speeds
00:21:32.960 and less use of dis space the drawback
00:21:35.559 is that the data modification is slower
00:21:37.799 but as we've discussed previously
00:21:39.600 modifying time series data is often not
00:21:41.679 an intended use
00:21:43.760 case in addition to compression there's
00:21:45.960 a second approach to ensuring that we
00:21:47.679 can preserve data for an extended period
00:21:49.480 of time without increasing our storage
00:21:51.679 costs this approach is called
00:21:53.520 aggregation and it's the methodology of
00:21:55.400 choice for little table when we're
00:21:57.360 speaking about aggregation there are two
00:21:59.120 concepts the base table and the
00:22:01.080 aggregate table the base table is where
00:22:03.240 we insert the raw metrics we're
00:22:04.640 recording the aggregate tables store a
00:22:07.039 summary or average of that raw data so
00:22:10.279 first we need to decide what raw metrics
00:22:12.720 we want to store in our base table if
00:22:15.520 you recall we already decided on a
00:22:17.080 primary key that contains a surfer
00:22:18.720 region and break each record will
00:22:20.720 represent a single wave the surfer clot
00:22:23.000 at that break at a specific time then we
00:22:26.039 need to decide what aggregated metrics
00:22:27.799 we want to record in the aggregate
00:22:29.159 tables at the most basic level what Liz
00:22:31.679 might want to know is how good was that
00:22:33.320 wave and how long was she on it and how
00:22:35.440 far did she ride the aggregate table
00:22:37.520 will have to contain a summary of the
00:22:39.360 data in the base table it might be
00:22:41.200 helpful to know the total distance
00:22:43.159 summed across all waves the total
00:22:45.080 duration summed across all waves the
00:22:47.240 maximum total speed for a single wave
00:22:49.640 and the number of waves caught over that
00:22:51.080 time
00:22:52.880 period now that we've decided what data
00:22:54.919 to store in these aggregate tables we'll
00:22:56.720 have to decide what intervals of data
00:22:58.720 sense these will determine which
00:23:00.720 aggregate tables we want to create this
00:23:02.799 will also help us decide on our ttls
00:23:05.000 constraining how much data we're
00:23:06.400 actually storing since Li tends to surf
00:23:09.600 at most once a day it makes sense to
00:23:11.640 aggregate data up to the day that way we
00:23:14.080 can preserve data for each surfing
00:23:15.880 session for a TTL of six months from
00:23:18.840 there we can also aggregate data up to
00:23:20.640 the week and the month so that it's
00:23:22.120 easier for Liz to track seasonal and
00:23:24.080 annual Trends this leaves us with a base
00:23:26.880 table with a TTL of 1 month mon a one
00:23:29.440 day aggregate table with a TTL of 6
00:23:31.559 months a one we aggregate table with a
00:23:33.799 TTL of one year and a one- month
00:23:35.880 aggregate table with a TTL of 5
00:23:38.960 years the hardest part is over now we
00:23:42.039 have our data stored aggregated and
00:23:43.960 easily accessible and we want to design
00:23:46.320 an API endpoint that Liz can use to
00:23:48.440 easily query her surf data the decisions
00:23:51.039 we've made when it comes to querying and
00:23:52.760 aggregation will determine exactly how
00:23:54.840 this API endpoint will be used the next
00:23:57.679 step is defining an API contract which
00:23:59.760 can be clearly documented for the end
00:24:01.440 user validated and
00:24:03.760 enforced a crucial element to document
00:24:05.960 for the end user is a set of allowable
00:24:07.720 query progams the components and
00:24:10.000 ordering of the composite key determine
00:24:12.039 which query progams are required and
00:24:14.240 which are optional as always a time span
00:24:17.320 is necessary for a user querying a Time
00:24:19.400 series
00:24:20.480 database and assuming that we're using
00:24:22.559 little table as our underlying storage
00:24:24.480 option we only need a prefix of the
00:24:26.399 primary key so the surfer is the only
00:24:28.640 required component of the key beyond
00:24:31.039 that you can optionally specify a region
00:24:32.840 in a break it's important though to
00:24:34.960 document and enforce that a user must
00:24:36.760 also provide a region if they want to
00:24:38.200 provide a break recall earlier that we
00:24:40.399 noted that you can't skip fields in the
00:24:42.360 middle of the primary key you must
00:24:44.520 provide the full prefix of the primary
00:24:46.120 key which in this case is Surfer region
00:24:48.360 and break in that
00:24:51.919 order now that we have the user's
00:24:53.840 request we need to determine which
00:24:55.320 aggregate table we'll be querying from
00:24:57.480 this requires an understanding of some
00:24:59.120 terminology at Moro we discuss time
00:25:01.520 series data in terms of a time span and
00:25:03.679 an interval so I'll quickly explain what
00:25:05.720 we mean by each of those terms in this
00:25:07.520 context the time span describes the full
00:25:10.120 period of time over which we want data
00:25:12.600 since our longest TTL in the database is
00:25:14.360 5 years we can't can't query data for a
00:25:16.480 time span that extends further than five
00:25:18.240 years in the past the interval
00:25:20.559 corresponds to the grain at which the
00:25:22.120 data is aggregated the only options here
00:25:24.960 are one day one week and one month as
00:25:27.679 noted before each aggregation interval
00:25:29.840 will be stored in its own aggregate
00:25:31.640 table we'll have a one day table a onee
00:25:34.360 table and a one month table in designing
00:25:37.520 this API endpoint we'll assume that the
00:25:39.600 user wants the most data possible for
00:25:41.520 the time span requested this means that
00:25:43.880 the TTL will determine which aggregate
00:25:45.720 table we'll query from we'll want to
00:25:47.760 query from the aggregate table with the
00:25:49.360 smallest interval whose TTL is still
00:25:51.679 greater than the time span requested so
00:25:54.760 for example if the user requests a time
00:25:56.600 span less than or equal to 6 months we
00:25:58.480 can return Daily Surf data for a time
00:26:00.760 span between 6 months and one year we'll
00:26:02.640 return weekly data and for any time span
00:26:05.159 between one year and 5 years will return
00:26:07.120 monthly surf data and we'll validate
00:26:09.480 that the user is not allowed to query
00:26:11.080 the API endpoint with a time span
00:26:12.640 greater than five years since all data
00:26:14.760 is dropped after that
00:26:16.840 point now I'd like to quickly show what
00:26:18.960 a visualization might look like for time
00:26:20.960 series data now that we have in Con in
00:26:23.279 in mind the concepts of a time span and
00:26:25.679 an interval on this slide is another
00:26:27.919 screen grabb from Cisco maro's dashboard
00:26:30.399 application here you can see that the
00:26:32.440 top of the page there's a drop down
00:26:34.440 option for showing data for the past two
00:26:36.720 hours day week or month these are the
00:26:39.760 selectable time spans we currently have
00:26:42.200 the one- month time span selected and in
00:26:44.799 the usage graph below you can see that
00:26:46.320 the data is broken out day by day in
00:26:48.880 this case for a one- month time span
00:26:50.919 we're showing data over a one day
00:26:53.000 interval this pattern is especially
00:26:55.039 useful when you want to detect Trends
00:26:57.080 since Trends can be found when looking
00:26:58.520 at data down to the hour or over the
00:27:00.919 span of an entire
00:27:03.720 month earlier in this talk I explained
00:27:06.320 the shortcomings of postgress when it
00:27:07.720 comes to time series data one of these
00:27:09.880 shortcomings is the lack of specialized
00:27:11.679 time series tooling and vanilla postc
00:27:13.360 grass because tools like click housee
00:27:15.399 and time scale DB are so specialized for
00:27:17.600 time series data you might even be able
00:27:19.760 to skip some of the steps I've listed in
00:27:21.679 this getting out their section by
00:27:23.600 leveraging some of the tools and
00:27:24.960 Integrations offered click house for
00:27:27.360 instance officially integrates with
00:27:28.960 quite a few visualization tools like
00:27:30.679 grafana and Tableau this makes quick
00:27:33.120 data visualization really easy to set up
00:27:36.000 and just this year a time scale DB
00:27:37.760 announced a project called time scale
00:27:39.640 analytics this initiative is not
00:27:41.480 complete and they're still receiving
00:27:42.720 developer input if you're interested in
00:27:44.679 commenting what they're hoping to do is
00:27:46.640 create a One-Stop shop for time series
00:27:48.760 analytics in
00:27:50.080 postgress in the time scale analytics
00:27:52.279 announcement time scale DB listed a few
00:27:54.360 sketching algori algorithms that they
00:27:56.240 hope to build into this extension T
00:27:58.720 digest hyper log log and count Min T
00:28:01.480 digest is a data structure that will
00:28:02.960 estimate a percentile Point without
00:28:04.720 having to store and order all the data
00:28:06.480 points in a set hyper log log is a
00:28:08.679 probabilistic data structure that
00:28:10.240 estimates the cardinality of a set and
00:28:12.559 cman is a probabilistic data structure
00:28:14.799 that serves as a frequency table of
00:28:16.519 events in a stream of data due to time
00:28:19.320 scale DBS aggregation these sketches
00:28:21.679 have very low query
00:28:23.480 latency there are so many features I
00:28:25.440 haven't listed and these products are
00:28:27.240 receiving a lot of support so I'm sure
00:28:28.720 the list will grow it'll be really cool
00:28:30.600 to witness the evolution of these time
00:28:32.240 series
00:28:34.200 tools sweet Liz now has easily
00:28:37.399 accessible data on her surfing
00:28:39.000 performance broken down by break over
00:28:41.240 time fast forward several years from now
00:28:43.640 Liz has been surfing for quite some time
00:28:45.640 and she's learned some important lessons
00:28:48.000 for example she took a look at the
00:28:49.679 monthly data for her favorite break and
00:28:51.480 she realizes that she catches far fewer
00:28:53.880 waves there in the winter than she does
00:28:55.640 in the summer and the waves she does
00:28:57.679 catch are way smaller and Peter out
00:28:59.600 quickly it turns out that this surf spot
00:29:02.320 only catches South swells and South
00:29:04.200 swells are way more common in the
00:29:05.600 summertime Liz had no idea that swell
00:29:08.039 Direction was seasonal and it had never
00:29:09.919 occurred to her to check now she knows
00:29:12.200 where to surf in each season and she's
00:29:13.799 been able to update her surf routine
00:29:15.200 accordingly she's been catching way more
00:29:17.360 waves and she's been having a lot more
00:29:19.200 fun looks like Liz is on her way to
00:29:21.440 getting
00:29:24.120 pitted thanks so much for listening
00:29:26.440 again I work for a remarkable company
00:29:28.240 called Cisco moroi with some of the
00:29:29.720 brightest rails developers I've ever met
00:29:32.000 I still can't believe I get to work at
00:29:33.399 the intersection of web development and
00:29:35.240 computer networking it's a fascinating
00:29:37.480 space to be in with really compelling
00:29:39.120 problems to solve if that sounds
00:29:40.840 interesting to you feel free to reach
00:29:42.159 out to me afterwards and of course I'm
00:29:44.399 always down to talk time series data
00:29:46.399 have a great rest of your conference
Explore all talks recorded at Rocky Mountain Ruby 2024
+18