00:00:00.160
hello everyone uh today I thought it
00:00:02.919
would be super hip to talk about time
00:00:05.759
scale DB um this is a postgress
00:00:08.519
extension that is geared towards time
00:00:11.160
series data um I'll talk about what time
00:00:14.000
scale is what materialized views are and
00:00:17.080
the many cool features my team and I
00:00:19.000
have gained with time skill manag
00:00:21.039
Service uh I'm going to preface this
00:00:23.439
with I'm by no means an expert um these
00:00:26.760
slides are based on my learnings and
00:00:28.880
time scale documentation
00:00:30.960
uh fortunately time scale is very well
00:00:32.880
documented and I invite you all to check
00:00:34.719
those docs
00:00:36.280
out with that uh let's get this party
00:00:40.320
started
00:00:42.360
yeah yeah so about me um hi my name is
00:00:46.000
Michelle Yuan I am a software developer
00:00:48.800
for the Adler Planetarium in Chicago uh
00:00:52.120
you're probably thinking that sounds
00:00:54.000
pretty awesome and you are absolutely
00:00:56.719
right it is out of this world pun in
00:01:02.000
uh more specifically within the Adler I
00:01:05.040
work with maintaining our citizen
00:01:06.720
science platform zverse
00:01:11.680
dwork uh now that introductions are out
00:01:14.040
of the way let's get down on business um
00:01:17.320
to start off what is time scale per the
00:01:21.680
docs it's a high performance uh
00:01:24.400
postgress extended for time series data
00:01:27.119
in analytics so what does that mean
00:01:30.799
basically it's a database designed to
00:01:34.040
harness the full power of SQL as well as
00:01:36.799
our existing knowledge of SQL in order
00:01:39.320
to deal with problems that people tend
00:01:41.320
to have when dealing with time series
00:01:43.280
data Stores um some issues for example
00:01:46.079
are like dealing with large data volumes
00:01:48.520
and quering that large data store for
00:01:56.640
analytics
00:01:58.280
so what it exactly is time series data
00:02:03.320
and how does it differ from regular data
00:02:06.759
um when we talk about time series data
00:02:09.440
we're talking about metrics or event
00:02:11.800
monitoring said another way time series
00:02:15.120
data is data which represents how a
00:02:18.040
system process or even Behavior changes
00:02:21.319
over time uh for example weather
00:02:24.680
conditions measured every hour or even
00:02:27.760
your battery life on your smartphone I
00:02:30.280
currently have 78%
00:02:33.120
battery um those are some examples of
00:02:36.680
Time series data but what makes it
00:02:39.000
different from regular data biggest
00:02:41.760
difference is that time series data uh
00:02:45.159
with time series data time is our
00:02:47.440
primary access
00:02:53.239
access uh so now we have a general idea
00:02:56.519
of what time scale is let's talk about
00:02:59.159
what we get with time scale um for time
00:03:02.000
sake I'll mostly focus on the features
00:03:04.159
that I've used at this point in
00:03:09.840
time um one key feature is the
00:03:13.360
hypertable hyper tables are postgress
00:03:16.080
tables with special features that make
00:03:18.560
it easy to handle time series data uh
00:03:21.959
technically speaking a hypertable is a
00:03:24.760
table partitioned by time consisting of
00:03:27.519
smaller chunk tables um um time scale
00:03:30.879
does this so that whenever you query
00:03:32.959
over the hyper table postgress Farms out
00:03:36.000
the query to the smaller chunk tables
00:03:39.000
very much like a map reduced solution um
00:03:42.400
so you can think of a hyper table as one
00:03:45.280
big table full of uh full of tables of
00:03:49.360
all the same structure these mini tables
00:03:52.879
uh we call chunk tables that are chunked
00:03:55.040
by time in this in our
00:03:58.720
case
00:04:01.560
to really reinforce the idea of what a
00:04:04.319
hypertable is we'll look at a very
00:04:06.920
abbreviated example um imagine we have a
00:04:10.079
table where where the columns are time
00:04:12.640
and value uh in a normal table setting
00:04:16.040
it'll look just like normal rows normal
00:04:18.680
columns all in one
00:04:20.280
table in a hypertable setting our hyper
00:04:23.759
table consists of chunked normal tables
00:04:26.759
all with the same columns time and value
00:04:28.800
in our case uh chunked by day or
00:04:31.680
whatever time interval we
00:04:33.360
choose um so in this example we can see
00:04:36.600
three events on January 2nd 2021 one at
00:04:41.000
midnight one at
00:04:43.240
0600 and one at
00:04:46.039
2300 um in a hypertable setting these
00:04:49.720
three rows live in this chunked table
00:04:52.600
chunked by Day January 2nd
00:04:55.160
2021 um same thing with the events on
00:04:57.880
January 3rd 2021 and same thing for the
00:05:00.919
events on January 4th
00:05:03.400
2021 um what's nice is that time scale
00:05:06.520
automatically chunks our hypertable into
00:05:09.000
time buckets so we don't have to think
00:05:11.440
about it um this is all taken care of in
00:05:14.320
Magic that we don't really have to touch
00:05:16.199
all living under the hood of the
00:05:18.800
extension um another nice thing is that
00:05:21.360
when querying from hyper tables our
00:05:23.720
query syntax will look the same as if
00:05:26.479
we're querying from a normal table um in
00:05:30.160
the end under all this beauty it's just
00:05:33.720
postgress feel free to quote me on that
00:05:36.600
um and that's the awesomeness of it to
00:05:39.479
get a bit technical on why hyper tables
00:05:42.600
Aid Us in speeding up query execution
00:05:45.039
time um when we send an incoming SQL
00:05:48.240
query the postgress planner analyzes the
00:05:51.160
query and routes it to the proper chunk
00:05:53.919
table uh based upon our queries
00:05:58.639
conditions
00:06:03.360
so cool we get hyper tables that speed
00:06:05.800
up our queries and help manage our data
00:06:08.440
in an environment of hundreds of
00:06:10.280
millions of data entries where users
00:06:12.520
expect our query response times to be
00:06:14.759
fast dare I say hyper
00:06:17.759
fast this feels
00:06:19.960
necessary so what else do we get with
00:06:22.280
time
00:06:24.199
scale we get to utilize continuous
00:06:27.639
Aggregates these are back TR workers
00:06:30.639
built into the database that can group
00:06:32.759
data by a
00:06:34.160
schedule um just to give some
00:06:36.560
perspective around this um time series
00:06:39.160
data usually grows very quickly um and
00:06:42.400
that means that aggregating data into
00:06:44.440
useful
00:06:45.520
summaries can become very
00:06:47.960
slow reason being every time we run
00:06:51.400
these aggregation queries the database
00:06:53.720
needs to uh scan the entire D table and
00:06:57.120
recalculate every time um with
00:06:59.919
continuous Aggregates aggregating data
00:07:02.599
becomes lightning fast and it does this
00:07:05.120
by utilizing and improving upon
00:07:07.599
postgress materialized
00:07:09.479
views how does it improve materialized
00:07:12.080
views you may ask very good
00:07:15.360
question but first I think it's best to
00:07:18.759
explain what materialized views are
00:07:21.479
since that is what continuous Aggregates
00:07:23.440
are based on um then we'll go into an
00:07:26.879
in-depth example uh explanation and
00:07:29.160
example Le on continuous Aggregates and
00:07:32.080
its
00:07:32.879
improvements um so materialized views
00:07:35.800
are snapshots of a query that stores the
00:07:39.160
results in a special postgress table
00:07:42.800
that we can view from slash read from
00:07:45.639
later
00:07:46.400
on as an example pretend we have an
00:07:50.319
application um with an accounts to
00:07:52.599
orders
00:07:53.520
table hope you can read this okay um
00:07:57.440
within yeah imagine we have an
00:07:59.800
application with an accounts to orders
00:08:01.520
table um so here we have account a with
00:08:04.479
order one account B with order
00:08:08.319
two uh we have an inventory table with
00:08:11.479
items in stock in this case shirts and
00:08:13.960
pants um with their item IDs and how
00:08:17.120
much how much we have in stock and we
00:08:19.440
have an orders table that tells us how
00:08:22.199
much of each item was
00:08:25.000
ordered say we wanted to know the total
00:08:27.720
of each item that was sold
00:08:30.440
we'd have to roll up sum the quantity of
00:08:32.800
shirts that were sold so in this case uh
00:08:36.320
shirts have item id
00:08:38.240
30 um order one has two a quantity of
00:08:41.919
two shirts in their order and order two
00:08:44.560
has a quantity of two shirts in their
00:08:46.040
order for a total of four
00:08:48.640
shirts um and we'll do the same thing
00:08:50.680
for pants in this case we only have one
00:08:52.959
order of pants with a a quantity of
00:08:55.680
three to make a total of three pant uh
00:08:58.800
uh three p pant oops oh no come back
00:09:03.320
okay my bad um instead of having to run
00:09:08.120
this query every time we can read our
00:09:11.720
materialized view which saves us from
00:09:14.440
doing this complex query again um so how
00:09:19.680
are continuous Aggregates better one
00:09:22.839
thing about
00:09:24.440
vanilla postgress materialized views is
00:09:27.720
that they don't update regular ly uh
00:09:30.519
we'd have to manually refresh them as
00:09:33.000
required um with continuous Aggregates
00:09:36.000
we get the functionality of materialized
00:09:38.360
views but we also get the ability to
00:09:41.800
automatically refresh those views
00:09:43.800
without us having to do a refresh
00:09:45.640
manually and without Lo losing any
00:09:47.880
historic aggregated
00:09:51.399
data okay so hopefully get we have a
00:09:55.160
general sense of what materialized views
00:09:57.880
are um we're going to go
00:09:59.920
in depth and view how continuous
00:10:02.040
Aggregates utilizes uh materialized data
00:10:05.800
to speed up
00:10:07.160
queries uh when we create a continuous
00:10:09.680
aggregate time scale takes the raw data
00:10:12.440
from an underlying hypertable and
00:10:14.760
Aggregates it over a set time interval
00:10:18.240
and then materializes that aggreg
00:10:22.160
aggregation as an example of what we
00:10:24.680
might want to aggregate say we have a
00:10:26.920
table full of events and want to count
00:10:30.320
how many events came in each day or even
00:10:32.920
each month or even each quarter Etc in
00:10:36.560
instead of uh querying the whole table
00:10:39.200
to grab this count we could run a query
00:10:42.800
every 24 hours to aggregate the previous
00:10:45.680
Days event counts and store that
00:10:47.880
pre-computed value somewhere or in some
00:10:50.480
sort of view for us to grab easily and
00:10:53.639
quickly whenever we need it and that's
00:10:56.399
essentially what continuous Aggregates
00:10:58.120
are they're optimizations of queries by
00:11:01.800
pre-computing the results and allowing
00:11:04.480
us to utilize the aggregated results and
00:11:07.839
to serve for our
00:11:11.440
queries okay let's see this in play I
00:11:14.639
actually ran a proof of concept using a
00:11:17.240
replica of uh X universe's database um
00:11:21.000
where I created a hypertable full of uh
00:11:23.920
what we call classifications you can
00:11:26.079
think of them as events or occurrences
00:11:28.200
just data entries
00:11:30.240
um and a continuous aggregate that
00:11:32.440
Aggregates the classification uh per uh
00:11:36.680
classification count uh per specific
00:11:39.399
workflow per day from the h uh from the
00:11:42.360
hypertable that I
00:11:43.760
created uh to do some
00:11:46.320
comparisons um to give you some
00:11:49.200
perspective uh within the X Universe we
00:11:52.480
have a little bit more than 750 million
00:11:55.720
classifications and Counting that's a
00:11:58.760
pretty big
00:12:01.000
table bring this back um if we wanted a
00:12:05.440
count of classifications in this
00:12:07.560
particular workflow workflow with ID
00:12:10.959
443 uh when we do this in a nor a normal
00:12:14.000
t uh table setting after a th000 runs we
00:12:17.920
get an average execution time of about
00:12:20.480
3.6 seconds which is 3600
00:12:24.760
milliseconds if we grab the count of
00:12:26.920
classifications in a hypertable setting
00:12:29.240
where these classification events are
00:12:31.480
chunked by day um in a hypertable
00:12:34.120
setting uh this is what our query would
00:12:36.279
look like select countstar from
00:12:38.399
classifications where workflow ID equals
00:12:41.320
4403 um note that this query looks like
00:12:45.600
the same query as if we were querying
00:12:47.839
from the normal table but in this case
00:12:51.000
our classifications table is actually a
00:12:52.959
hyper
00:12:53.880
table uh when we do this query we better
00:12:57.399
our execution time with an average of
00:12:59.680
293
00:13:01.480
milliseconds if we uh create a
00:13:03.600
continuous aggregate of our
00:13:05.199
classifications hypertable that
00:13:08.600
Aggregates classification count per
00:13:10.639
workflow per day and then sum those
00:13:13.240
classification counts um to get us the
00:13:16.600
same results of these queries uh we
00:13:20.600
better our execution time to 43
00:13:24.160
milliseconds on average um which feels
00:13:27.199
like a really great optimization
00:13:30.639
all this amazing stuff are some of the
00:13:32.920
features of time scale
00:13:35.800
DB
00:13:38.079
okay now that we got to see what we gain
00:13:41.199
with time scale let's go over a small
00:13:43.600
example of how to implement uh implement
00:13:46.240
this into our uh into new rails
00:13:50.720
apps I'll show you what the final result
00:13:53.680
of what we'll build what we'll be
00:13:55.399
building with our example um in our case
00:13:58.720
our users are interested in how many
00:14:01.160
classifications are done in our project
00:14:04.040
each day or each week month year uh
00:14:08.600
which they can pick uh what time bucket
00:14:11.480
they want to choose through this do uh
00:14:13.079
drop down um our default view will be
00:14:16.279
bucketed by
00:14:20.639
Day first after we get our rails app
00:14:23.800
setup and we uh will enable time scale
00:14:27.000
onto our apps like so
00:14:29.759
enable the extension time scale
00:14:34.839
DB then we are going to create a
00:14:37.360
hypertable in a migration we create our
00:14:40.399
table first classification events uh
00:14:43.279
which have the columns classification ID
00:14:46.600
event time which is the time the
00:14:48.720
classification was created you can think
00:14:50.480
of it as a created at um and the user ID
00:14:54.600
which is the you ID of the user who
00:14:57.120
created the classification um we U we
00:15:00.519
utilize a composite primary key because
00:15:03.680
if we recall time scale uh with time
00:15:06.680
scale time is our primary access for
00:15:09.480
time series data so we want to index by
00:15:12.240
time um and for our particular case we
00:15:15.399
also want to keep indexing by
00:15:17.279
classification ID as
00:15:19.399
well um once our table is created we're
00:15:23.000
going to turn our classification events
00:15:25.639
table into a hyper table uh with our
00:15:29.000
time axis being the event time column we
00:15:32.319
do this by utilizing the create
00:15:34.319
hypertable function which is a time
00:15:36.440
scale function um it takes in the table
00:15:40.759
name and what column uh is our primary
00:15:44.759
time
00:15:46.720
access this one's a big one but we'll go
00:15:49.199
through this one a piece by piece okay
00:15:53.639
um we're going to create our continuous
00:15:56.199
aggregate in this case we'll exec Ute a
00:15:59.440
SQL query to create a materialized view
00:16:03.160
called daily classification count
00:16:06.079
bucketing each time uh event Time by
00:16:09.759
day and saving each uh entry as a total
00:16:13.199
count of classifications for that
00:16:16.639
day we also want to create a refresh
00:16:20.560
policy that triggers that continuous
00:16:23.000
aggregate to refresh the uh materialized
00:16:25.959
view uh and do any recalculation given
00:16:28.920
given a specified time period look back
00:16:31.360
and on a scheduled interval here our
00:16:34.480
refresh policy runs every
00:16:38.000
hour with the scheduled interval and
00:16:40.839
then looks back at the past day at data
00:16:43.839
from the past day up to the current hour
00:16:48.040
to create or update a record within our
00:16:51.639
materialized
00:16:56.040
view uh we'll we'll create a model for
00:16:58.959
our materialized view SL continuous
00:17:01.639
aggregate uh we called it daily
00:17:03.440
classification count in our case uh to
00:17:06.039
make it easier for us to interact with
00:17:08.160
and query from uh it's a readon table
00:17:11.160
since it's a materialized
00:17:14.679
view then in our controller when we
00:17:17.640
query bucketing by a specific time
00:17:19.760
period in our case day week month or
00:17:22.720
year we'll run a select query on our
00:17:25.919
continuous aggregate View bucketing by a
00:17:29.360
specific time
00:17:32.000
period and then summing the daily counts
00:17:34.640
if
00:17:41.720
needed um yeah that's pretty much all I
00:17:44.840
had thank you so much um for listening
00:17:47.960
to me here are my social meets um let me
00:17:52.880
know if you have any questions and thank
00:17:54.480
you all so much