RubyConf AU 2017
Simple and Awesome Database Tricks

http://www.rubyconf.org.au

This talk discusses a collection of things that have been helpful through the years. Some are supported by ActiveRecord, and some fall outside the scope of what ActiveRecord provides. You'll learn things like awesome datatypes, proactive and reactive performance testing, time-series data, backup and restore strategies, PostGIS, scripting with Postgres, and more!

RubyConf AU 2017

00:00:08.519 Good morning! I'm Barrett Clark. I've been working with Ruby for about a decade now, and as you can tell, I'm from the United States. I currently work at The Container Store, a retailer in North America that specializes in organization and storage solutions. We sell boxes and help our customers organize their closets. We've also started playing with Docker containers.
00:00:24.519 Today, I'm going to share a dozen tips and tricks to help you get the most out of your database and your application. Some of these tips are supported by ActiveRecord, while others fall outside of what ActiveRecord provides. We're going to cover topics like awesome data types, proactive and reactive performance testing, time-series data, backup and restore strategies, PostGIS, scripting with Postgres, and more!
00:00:50.280 So, let's get started. For the first trick, we're going to explore how Postgres executes a query. We have a sample query here that filters by a couple of fields. The '10K' table is commonly used by Postgres developers for developing their features and running benchmarks. It contains 10,000 records with a variety of fields.
00:01:08.680 Here's one of the records, showing unique values in fields like 'unique_one' and 'unique_two.' With 10,000 records, there are 10,000 unique values and 100 possible field values, ranging from 0 to 99. To get a query plan in Postgres, you simply prepend 'EXPLAIN ANALYZE' to your query. When you execute it, you receive output instead of the actual query results, which provides insights about the query being analyzed.
00:01:43.720 The default output format is text, but you can also get it in JSON or XML if needed. Looking at the output, we can see that a sequence scan will occur on the '10K' table, as no indexes exist. For small tables, a table scan may be acceptable, but for larger tables with frequent queries, we should consider adding indexes.
00:02:20.879 It’s a good rule of thumb to index foreign keys, as they are often used for lookups and filtering. After adding some indexes and rerunning the query, we notice that we have transitioned to a bitmap heap scan and a bitmap index scan. Essentially, a heap scan is a smarter table scan that incorporates index consideration, thus increasing efficiency.
00:02:51.920 While we can improve execution times with indexes, keep in mind that query plans can get lengthy. For instance, I analyzed a query related to flight data that took about 50 seconds to run, which is much longer than my target of 500 milliseconds. I could visualize the query plan using tools like visual query plan tools. Analyzing the output helps identify hotspots, such as in this case where filtering by 'United Airlines' appears problematic.
00:03:44.360 From here, I could either explore if the field is indexed or make further considerations like partitioning the data or creating a materialized view. There are several ways to visualize query plans, but keep in mind that different environments yield different results. Factors like data, hardware, and configuration all affect how the query engine executes a query.
00:04:53.760 Now, sometimes we need to observe the database’s performance in real time. This is where PG Stat Statements come into play. By enabling an extension with the syntax 'CREATE EXTENSION,' we can capture statistics on executed queries. This assumes that the extension is already installed and permutations may be necessary depending on the operating system.
00:05:52.879 Once activated, this view logs statistics such as total minutes executed, the number of executions, and average execution time. However, be cautious, as it can introduce performance overhead. It’s not advisable to keep it active in a production environment all the time. Instead, enable it when needed, keeping in mind that it stores the last 5,000 statistics.
00:06:42.640 Let’s dive a bit deeper into how we can access the database using Ruby. You don’t necessarily need to use ActiveRecord or Rails for database access. For example, here's some plain Ruby code that runs the PG Stat Statements query. It involves loading a configuration file and specifying how to connect to the database.
00:07:03.760 You can expect a return of raw data as a PG result array. Remember, ActiveRecord simplifies working with PG result objects, but plain Ruby allows for direct interaction. This example also illustrates how to use Bundler in a Ruby script by including the required gems in a Gemfile.
00:07:51.920 Now, let’s shift our focus onto some handy data types that can enhance your application’s functionality. You can leverage non-relational structures within a relational database. Let’s consider how you manage data from APIs—many create local models mirroring those external resources.
00:08:30.600 It’s often useful to store the entire payload to ensure that no vital information is lost. Storing data in formats like JSON or JSON binary is beneficial for flexibility and indexing. JSONB, in particular, allows more efficient indexing and retrieval of specific elements within the payload.
00:09:11.120 Using PostgreSQL's jsonb features allows querying at more granular levels, including indexing JSON fields to improve performance. Additionally, PostgreSQL offers range data types, like date range and timestamp range, which consolidate start and end dates into a single field, thus optimizing data management.
00:10:00.000 By utilizing the range types, we can efficiently perform calculations like determining flight durations and enhancing the clarity of our data insights. For instance, I can compare flight durations with actual elapsed time recorded in the data.
00:10:39.920 The way these range data types operate might seem unusual, but the special operators available can facilitate quick queries. For example, we can check all flights in the air at a specific time using these range types, significantly simplifying data analysis.
00:11:25.920 Another valuable data type in postgresql is UUID, which supports storing identifiers that are unique across databases. To utilize UUIDs in Rails, it’s critical to enable the corresponding extension and configure migrations accordingly. This gives the added benefit of ensuring data integrity by preventing invalid values from being stored.
00:12:39.920 Bear in mind, while UUIDs prevent overflow in primary keys, they can also complicate queries due to their non-sequential nature. Adjusting ActiveRecord's default sorting to utilize fields like 'created_at' helps overcome issues related to the ordering of UUIDs.
00:13:28.240 Next up, let’s explore creating your own data, which is a pretty exciting trick. You can generate a series of timestamps or numbers to fill in gaps in your time series data, which can be particularly helpful when plotting data points.
00:14:19.920 Using a Common Table Expression (CTE) allows for composing complex queries by using simpler subqueries. For instance, generating timestamps for every hour in 1999 can be combined with data about flight statuses to produce a comprehensive view of when a given airplane was in the air.
00:15:13.520 By joining the generated timestamps with flight data, we can visualize flight activity alongside hourly timelines, even reflecting periods when no flights were logged. This kind of operation can be particularly useful for creating insightful visualizations.
00:16:08.240 Furthermore, using functions like COALESCE enables us to handle missing data. Imagine counting occurrences in a column where we only have even numbers. By generating a set of numbers while filtering out the even integers, we can cleverly use COALESCE to fill in gaps with zeroes where no data exists.
00:17:07.040 Now, let’s talk about database views, which are stored queries that ActiveRecord can interpret as tables. Views can be useful when you have frequent queries that consist of complex joins or subqueries.
00:18:01.160 Introducing views into your database can streamline your data retrieval process and provide a clean way to handle complex queries without cluttering your application’s general logic.
00:18:54.960 Using tools like the Scenic gem, you can efficiently manage views in Rails. Scenic allows you to create models and migrations for views and materialized views, promoting a cleaner separation of concerns.
00:19:50.480 This not only simplifies future edits to the underlying SQL but also enhances maintainability as you version those SQL scripts separately from your Rails application’s codebase.
00:20:31.760 Additionally, we can create functions in PostgreSQL, making it easy to encapsulate reusable logic for complex queries. This might include calculations such as mortgage payments that would benefit from being defined and maintained directly in the database.
00:21:19.920 PostgreSQL allows for functional programming with SQL, letting you write functions to shrink complex calculations into manageable, straightforward queries. Migrations in Rails can help keep those functions in sync as you develop your application.
00:22:53.920 Median calculations, despite being absent in PostgreSQL by default, can be done through custom functions written in SQL. It's essential to access such resources when performing statistical analysis.
00:23:40.080 To expand your database's capabilities utilizing schemas effectively means you can organize database objects by grouping related components logically and safely within your database.
00:24:32.960 Utilizing schemas to manage ETL processes allows for cleaner data transformation as well, making clear distinctions between original data and processed data.
00:25:28.160 Backup and restore strategies can also benefit from schemas, providing a simpler way to handle merges or report generation without affecting your main operations.
00:26:25.760 Extensions, such as PostGIS, further enhance PostgreSQL functionality for geospatial data analysis. This extension allows you to leverage GIS capabilities directly within your database.
00:27:07.920 By using the active record PostGIS adapter, developers can easily integrate geospatial queries into their Rails applications, allowing for powerful location-based data manipulation.
00:28:08.160 In working with geospatial data, consider using functions to calculate distances correctly. For instance, comparing airports involves understanding latitude and longitude appropriately to avoid confusion in data.
00:29:16.560 The calculations and queries allow for engaging, interactive locations to be incorporated in your application, whether for travel planning, shipping logistics, or other geographical analyses.
00:30:04.920 Finally, explore using foreign data wrappers, allowing access to external databases within your PostgreSQL instance. This feature enables easier integration of disparate data sources.
00:31:02.080 While useful, coupling different database systems can come with challenges regarding compatibility and management, so it's typically advisable to keep your Rails application managing those interactions.
00:31:46.960 As a bonus tip, don’t forget you can personalize your psql command line experience with a configuration file. It allows customization of the command line structure and visuals for an improved user interface while working.
00:32:31.960 Finally, I've authored a book on data visualization, which includes sections covering maps and geospatial analytics. If you’re interested, I have some copies to give away.
00:33:28.040 Feel free to reach out with any questions, as I enjoy discussing data. Thank you all for having me here; it’s my first time in Australia, and I’ve really enjoyed it!