PostgreSQL
What is your TimeScale? : An Introduction to TimeScaleDB

Summarized using AI

What is your TimeScale? : An Introduction to TimeScaleDB

Michelle Yuen • January 30, 2024 • online

In the video titled "What is your TimeScale? : An Introduction to TimeScaleDB," Michelle Yuen discusses the rising necessity for time-series databases given the growing volume of data across various applications, such as stock tracking, environmental monitoring, and COVID-19 statistics. TimeScaleDB, an extension of PostgreSQL, is presented as a high-performance solution designed specifically to facilitate the management and analysis of time-series data. The following key points are highlighted in the talk: - Introduction to TimeScaleDB: It is a PostgreSQL extension that optimizes performance for time-series analytics, making SQL powerful for time-related data challenges. - Definition of Time-Series Data: Time-series data differs from regular data by having time as its primary axis, exemplified by metrics such as weather conditions or battery life that represent changes over time. - Hypertables: A fundamental feature discussed is the hypertable, which is a table partitioned by time. This structure allows large datasets to be more efficiently queried by routing SQL commands to smaller, manageable chunks, enhancing performance. - Continuous Aggregates: These are introduced as built-in workers that can group data on a regular schedule, making it significantly faster to aggregate time-series data. Continuous aggregates build upon the concept of materialized views, which are snapshots of queries that can be stored for later use. Unlike standard materialized views, continuous aggregates automatically refresh, keeping them updated without manual intervention. - Technical Example: Michelle provides a practical demonstration using a replica database with over 750 million records, illustrating how queries benefit from using hypertables and continuous aggregates. For instance, the execution time for classification counts significantly improved from 3.6 seconds in a normal table setting to just 43 milliseconds with continuous aggregates. - Integration with Rails: To illustrate the implementation of TimeScaleDB, Michelle discusses integrating these features into Rails applications, including enabling TimeScaleDB, creating hypertables, and utilizing continuous aggregates in queries for efficient data retrieval. In conclusion, the video emphasizes the importance and advantages of using TimeScaleDB for handling time-series data effectively. The discussed features, such as hypertables and continuous aggregates, significantly enhance data management and querying efficiency, making it a powerful tool for developers managing large datasets.

What is your TimeScale? : An Introduction to TimeScaleDB
Michelle Yuen • January 30, 2024 • online

People create, capture and consume more data than ever before. From recording price changes of a single stock, to tracking environmental values such as average high and low temps for consecutive days at a location, to calculating total number of COVID-19 hospitalizations per day, to application monitoring: the need for a time-series database has increased. In this talk, we introduce some of the features of one such time-series database: TimeScaleDB. We will briefly introduce Postgres Materialized Views, explore other data coolness, and go through how we can implement TimeScale features into our Rails apps.

https://www.wnb-rb.dev/meetups/2024/01/30

WNB.rb Meetup

00:00:00.160 hello everyone uh today I thought it
00:00:02.919 would be super hip to talk about time
00:00:05.759 scale DB um this is a postgress
00:00:08.519 extension that is geared towards time
00:00:11.160 series data um I'll talk about what time
00:00:14.000 scale is what materialized views are and
00:00:17.080 the many cool features my team and I
00:00:19.000 have gained with time skill manag
00:00:21.039 Service uh I'm going to preface this
00:00:23.439 with I'm by no means an expert um these
00:00:26.760 slides are based on my learnings and
00:00:28.880 time scale documentation
00:00:30.960 uh fortunately time scale is very well
00:00:32.880 documented and I invite you all to check
00:00:34.719 those docs
00:00:36.280 out with that uh let's get this party
00:00:40.320 started
00:00:42.360 yeah yeah so about me um hi my name is
00:00:46.000 Michelle Yuan I am a software developer
00:00:48.800 for the Adler Planetarium in Chicago uh
00:00:52.120 you're probably thinking that sounds
00:00:54.000 pretty awesome and you are absolutely
00:00:56.719 right it is out of this world pun in
00:01:02.000 uh more specifically within the Adler I
00:01:05.040 work with maintaining our citizen
00:01:06.720 science platform zverse
00:01:11.680 dwork uh now that introductions are out
00:01:14.040 of the way let's get down on business um
00:01:17.320 to start off what is time scale per the
00:01:21.680 docs it's a high performance uh
00:01:24.400 postgress extended for time series data
00:01:27.119 in analytics so what does that mean
00:01:30.799 basically it's a database designed to
00:01:34.040 harness the full power of SQL as well as
00:01:36.799 our existing knowledge of SQL in order
00:01:39.320 to deal with problems that people tend
00:01:41.320 to have when dealing with time series
00:01:43.280 data Stores um some issues for example
00:01:46.079 are like dealing with large data volumes
00:01:48.520 and quering that large data store for
00:01:56.640 analytics
00:01:58.280 so what it exactly is time series data
00:02:03.320 and how does it differ from regular data
00:02:06.759 um when we talk about time series data
00:02:09.440 we're talking about metrics or event
00:02:11.800 monitoring said another way time series
00:02:15.120 data is data which represents how a
00:02:18.040 system process or even Behavior changes
00:02:21.319 over time uh for example weather
00:02:24.680 conditions measured every hour or even
00:02:27.760 your battery life on your smartphone I
00:02:30.280 currently have 78%
00:02:33.120 battery um those are some examples of
00:02:36.680 Time series data but what makes it
00:02:39.000 different from regular data biggest
00:02:41.760 difference is that time series data uh
00:02:45.159 with time series data time is our
00:02:47.440 primary access
00:02:53.239 access uh so now we have a general idea
00:02:56.519 of what time scale is let's talk about
00:02:59.159 what we get with time scale um for time
00:03:02.000 sake I'll mostly focus on the features
00:03:04.159 that I've used at this point in
00:03:09.840 time um one key feature is the
00:03:13.360 hypertable hyper tables are postgress
00:03:16.080 tables with special features that make
00:03:18.560 it easy to handle time series data uh
00:03:21.959 technically speaking a hypertable is a
00:03:24.760 table partitioned by time consisting of
00:03:27.519 smaller chunk tables um um time scale
00:03:30.879 does this so that whenever you query
00:03:32.959 over the hyper table postgress Farms out
00:03:36.000 the query to the smaller chunk tables
00:03:39.000 very much like a map reduced solution um
00:03:42.400 so you can think of a hyper table as one
00:03:45.280 big table full of uh full of tables of
00:03:49.360 all the same structure these mini tables
00:03:52.879 uh we call chunk tables that are chunked
00:03:55.040 by time in this in our
00:03:58.720 case
00:04:01.560 to really reinforce the idea of what a
00:04:04.319 hypertable is we'll look at a very
00:04:06.920 abbreviated example um imagine we have a
00:04:10.079 table where where the columns are time
00:04:12.640 and value uh in a normal table setting
00:04:16.040 it'll look just like normal rows normal
00:04:18.680 columns all in one
00:04:20.280 table in a hypertable setting our hyper
00:04:23.759 table consists of chunked normal tables
00:04:26.759 all with the same columns time and value
00:04:28.800 in our case uh chunked by day or
00:04:31.680 whatever time interval we
00:04:33.360 choose um so in this example we can see
00:04:36.600 three events on January 2nd 2021 one at
00:04:41.000 midnight one at
00:04:43.240 0600 and one at
00:04:46.039 2300 um in a hypertable setting these
00:04:49.720 three rows live in this chunked table
00:04:52.600 chunked by Day January 2nd
00:04:55.160 2021 um same thing with the events on
00:04:57.880 January 3rd 2021 and same thing for the
00:05:00.919 events on January 4th
00:05:03.400 2021 um what's nice is that time scale
00:05:06.520 automatically chunks our hypertable into
00:05:09.000 time buckets so we don't have to think
00:05:11.440 about it um this is all taken care of in
00:05:14.320 Magic that we don't really have to touch
00:05:16.199 all living under the hood of the
00:05:18.800 extension um another nice thing is that
00:05:21.360 when querying from hyper tables our
00:05:23.720 query syntax will look the same as if
00:05:26.479 we're querying from a normal table um in
00:05:30.160 the end under all this beauty it's just
00:05:33.720 postgress feel free to quote me on that
00:05:36.600 um and that's the awesomeness of it to
00:05:39.479 get a bit technical on why hyper tables
00:05:42.600 Aid Us in speeding up query execution
00:05:45.039 time um when we send an incoming SQL
00:05:48.240 query the postgress planner analyzes the
00:05:51.160 query and routes it to the proper chunk
00:05:53.919 table uh based upon our queries
00:05:58.639 conditions
00:06:03.360 so cool we get hyper tables that speed
00:06:05.800 up our queries and help manage our data
00:06:08.440 in an environment of hundreds of
00:06:10.280 millions of data entries where users
00:06:12.520 expect our query response times to be
00:06:14.759 fast dare I say hyper
00:06:17.759 fast this feels
00:06:19.960 necessary so what else do we get with
00:06:22.280 time
00:06:24.199 scale we get to utilize continuous
00:06:27.639 Aggregates these are back TR workers
00:06:30.639 built into the database that can group
00:06:32.759 data by a
00:06:34.160 schedule um just to give some
00:06:36.560 perspective around this um time series
00:06:39.160 data usually grows very quickly um and
00:06:42.400 that means that aggregating data into
00:06:44.440 useful
00:06:45.520 summaries can become very
00:06:47.960 slow reason being every time we run
00:06:51.400 these aggregation queries the database
00:06:53.720 needs to uh scan the entire D table and
00:06:57.120 recalculate every time um with
00:06:59.919 continuous Aggregates aggregating data
00:07:02.599 becomes lightning fast and it does this
00:07:05.120 by utilizing and improving upon
00:07:07.599 postgress materialized
00:07:09.479 views how does it improve materialized
00:07:12.080 views you may ask very good
00:07:15.360 question but first I think it's best to
00:07:18.759 explain what materialized views are
00:07:21.479 since that is what continuous Aggregates
00:07:23.440 are based on um then we'll go into an
00:07:26.879 in-depth example uh explanation and
00:07:29.160 example Le on continuous Aggregates and
00:07:32.080 its
00:07:32.879 improvements um so materialized views
00:07:35.800 are snapshots of a query that stores the
00:07:39.160 results in a special postgress table
00:07:42.800 that we can view from slash read from
00:07:45.639 later
00:07:46.400 on as an example pretend we have an
00:07:50.319 application um with an accounts to
00:07:52.599 orders
00:07:53.520 table hope you can read this okay um
00:07:57.440 within yeah imagine we have an
00:07:59.800 application with an accounts to orders
00:08:01.520 table um so here we have account a with
00:08:04.479 order one account B with order
00:08:08.319 two uh we have an inventory table with
00:08:11.479 items in stock in this case shirts and
00:08:13.960 pants um with their item IDs and how
00:08:17.120 much how much we have in stock and we
00:08:19.440 have an orders table that tells us how
00:08:22.199 much of each item was
00:08:25.000 ordered say we wanted to know the total
00:08:27.720 of each item that was sold
00:08:30.440 we'd have to roll up sum the quantity of
00:08:32.800 shirts that were sold so in this case uh
00:08:36.320 shirts have item id
00:08:38.240 30 um order one has two a quantity of
00:08:41.919 two shirts in their order and order two
00:08:44.560 has a quantity of two shirts in their
00:08:46.040 order for a total of four
00:08:48.640 shirts um and we'll do the same thing
00:08:50.680 for pants in this case we only have one
00:08:52.959 order of pants with a a quantity of
00:08:55.680 three to make a total of three pant uh
00:08:58.800 uh three p pant oops oh no come back
00:09:03.320 okay my bad um instead of having to run
00:09:08.120 this query every time we can read our
00:09:11.720 materialized view which saves us from
00:09:14.440 doing this complex query again um so how
00:09:19.680 are continuous Aggregates better one
00:09:22.839 thing about
00:09:24.440 vanilla postgress materialized views is
00:09:27.720 that they don't update regular ly uh
00:09:30.519 we'd have to manually refresh them as
00:09:33.000 required um with continuous Aggregates
00:09:36.000 we get the functionality of materialized
00:09:38.360 views but we also get the ability to
00:09:41.800 automatically refresh those views
00:09:43.800 without us having to do a refresh
00:09:45.640 manually and without Lo losing any
00:09:47.880 historic aggregated
00:09:51.399 data okay so hopefully get we have a
00:09:55.160 general sense of what materialized views
00:09:57.880 are um we're going to go
00:09:59.920 in depth and view how continuous
00:10:02.040 Aggregates utilizes uh materialized data
00:10:05.800 to speed up
00:10:07.160 queries uh when we create a continuous
00:10:09.680 aggregate time scale takes the raw data
00:10:12.440 from an underlying hypertable and
00:10:14.760 Aggregates it over a set time interval
00:10:18.240 and then materializes that aggreg
00:10:22.160 aggregation as an example of what we
00:10:24.680 might want to aggregate say we have a
00:10:26.920 table full of events and want to count
00:10:30.320 how many events came in each day or even
00:10:32.920 each month or even each quarter Etc in
00:10:36.560 instead of uh querying the whole table
00:10:39.200 to grab this count we could run a query
00:10:42.800 every 24 hours to aggregate the previous
00:10:45.680 Days event counts and store that
00:10:47.880 pre-computed value somewhere or in some
00:10:50.480 sort of view for us to grab easily and
00:10:53.639 quickly whenever we need it and that's
00:10:56.399 essentially what continuous Aggregates
00:10:58.120 are they're optimizations of queries by
00:11:01.800 pre-computing the results and allowing
00:11:04.480 us to utilize the aggregated results and
00:11:07.839 to serve for our
00:11:11.440 queries okay let's see this in play I
00:11:14.639 actually ran a proof of concept using a
00:11:17.240 replica of uh X universe's database um
00:11:21.000 where I created a hypertable full of uh
00:11:23.920 what we call classifications you can
00:11:26.079 think of them as events or occurrences
00:11:28.200 just data entries
00:11:30.240 um and a continuous aggregate that
00:11:32.440 Aggregates the classification uh per uh
00:11:36.680 classification count uh per specific
00:11:39.399 workflow per day from the h uh from the
00:11:42.360 hypertable that I
00:11:43.760 created uh to do some
00:11:46.320 comparisons um to give you some
00:11:49.200 perspective uh within the X Universe we
00:11:52.480 have a little bit more than 750 million
00:11:55.720 classifications and Counting that's a
00:11:58.760 pretty big
00:12:01.000 table bring this back um if we wanted a
00:12:05.440 count of classifications in this
00:12:07.560 particular workflow workflow with ID
00:12:10.959 443 uh when we do this in a nor a normal
00:12:14.000 t uh table setting after a th000 runs we
00:12:17.920 get an average execution time of about
00:12:20.480 3.6 seconds which is 3600
00:12:24.760 milliseconds if we grab the count of
00:12:26.920 classifications in a hypertable setting
00:12:29.240 where these classification events are
00:12:31.480 chunked by day um in a hypertable
00:12:34.120 setting uh this is what our query would
00:12:36.279 look like select countstar from
00:12:38.399 classifications where workflow ID equals
00:12:41.320 4403 um note that this query looks like
00:12:45.600 the same query as if we were querying
00:12:47.839 from the normal table but in this case
00:12:51.000 our classifications table is actually a
00:12:52.959 hyper
00:12:53.880 table uh when we do this query we better
00:12:57.399 our execution time with an average of
00:12:59.680 293
00:13:01.480 milliseconds if we uh create a
00:13:03.600 continuous aggregate of our
00:13:05.199 classifications hypertable that
00:13:08.600 Aggregates classification count per
00:13:10.639 workflow per day and then sum those
00:13:13.240 classification counts um to get us the
00:13:16.600 same results of these queries uh we
00:13:20.600 better our execution time to 43
00:13:24.160 milliseconds on average um which feels
00:13:27.199 like a really great optimization
00:13:30.639 all this amazing stuff are some of the
00:13:32.920 features of time scale
00:13:35.800 DB
00:13:38.079 okay now that we got to see what we gain
00:13:41.199 with time scale let's go over a small
00:13:43.600 example of how to implement uh implement
00:13:46.240 this into our uh into new rails
00:13:50.720 apps I'll show you what the final result
00:13:53.680 of what we'll build what we'll be
00:13:55.399 building with our example um in our case
00:13:58.720 our users are interested in how many
00:14:01.160 classifications are done in our project
00:14:04.040 each day or each week month year uh
00:14:08.600 which they can pick uh what time bucket
00:14:11.480 they want to choose through this do uh
00:14:13.079 drop down um our default view will be
00:14:16.279 bucketed by
00:14:20.639 Day first after we get our rails app
00:14:23.800 setup and we uh will enable time scale
00:14:27.000 onto our apps like so
00:14:29.759 enable the extension time scale
00:14:34.839 DB then we are going to create a
00:14:37.360 hypertable in a migration we create our
00:14:40.399 table first classification events uh
00:14:43.279 which have the columns classification ID
00:14:46.600 event time which is the time the
00:14:48.720 classification was created you can think
00:14:50.480 of it as a created at um and the user ID
00:14:54.600 which is the you ID of the user who
00:14:57.120 created the classification um we U we
00:15:00.519 utilize a composite primary key because
00:15:03.680 if we recall time scale uh with time
00:15:06.680 scale time is our primary access for
00:15:09.480 time series data so we want to index by
00:15:12.240 time um and for our particular case we
00:15:15.399 also want to keep indexing by
00:15:17.279 classification ID as
00:15:19.399 well um once our table is created we're
00:15:23.000 going to turn our classification events
00:15:25.639 table into a hyper table uh with our
00:15:29.000 time axis being the event time column we
00:15:32.319 do this by utilizing the create
00:15:34.319 hypertable function which is a time
00:15:36.440 scale function um it takes in the table
00:15:40.759 name and what column uh is our primary
00:15:44.759 time
00:15:46.720 access this one's a big one but we'll go
00:15:49.199 through this one a piece by piece okay
00:15:53.639 um we're going to create our continuous
00:15:56.199 aggregate in this case we'll exec Ute a
00:15:59.440 SQL query to create a materialized view
00:16:03.160 called daily classification count
00:16:06.079 bucketing each time uh event Time by
00:16:09.759 day and saving each uh entry as a total
00:16:13.199 count of classifications for that
00:16:16.639 day we also want to create a refresh
00:16:20.560 policy that triggers that continuous
00:16:23.000 aggregate to refresh the uh materialized
00:16:25.959 view uh and do any recalculation given
00:16:28.920 given a specified time period look back
00:16:31.360 and on a scheduled interval here our
00:16:34.480 refresh policy runs every
00:16:38.000 hour with the scheduled interval and
00:16:40.839 then looks back at the past day at data
00:16:43.839 from the past day up to the current hour
00:16:48.040 to create or update a record within our
00:16:51.639 materialized
00:16:56.040 view uh we'll we'll create a model for
00:16:58.959 our materialized view SL continuous
00:17:01.639 aggregate uh we called it daily
00:17:03.440 classification count in our case uh to
00:17:06.039 make it easier for us to interact with
00:17:08.160 and query from uh it's a readon table
00:17:11.160 since it's a materialized
00:17:14.679 view then in our controller when we
00:17:17.640 query bucketing by a specific time
00:17:19.760 period in our case day week month or
00:17:22.720 year we'll run a select query on our
00:17:25.919 continuous aggregate View bucketing by a
00:17:29.360 specific time
00:17:32.000 period and then summing the daily counts
00:17:34.640 if
00:17:41.720 needed um yeah that's pretty much all I
00:17:44.840 had thank you so much um for listening
00:17:47.960 to me here are my social meets um let me
00:17:52.880 know if you have any questions and thank
00:17:54.480 you all so much
Explore all talks recorded at WNB.rb Meetup
+20