Talks

Postgres, the Best Tool You're Already Using

Postgres, the Best Tool You're Already Using

by Adam Sanderson

The video titled "Postgres, the Best Tool You're Already Using" features a talk by Adam Sanderson at Rails Conf 2013. The presentation explores how Rails developers can effectively leverage PostgreSQL (Postgres) to enhance their applications as they evolve from version 1 to version 2 and beyond. Sanderson presents practical solutions to common issues faced by developers, particularly those arising in social network applications, using examples relevant to a fictitious social network for hedgehog enthusiasts.

Key Points Discussed:

  • Tagging:

    Sanderson introduces how to implement tagging in Postgres using text arrays for hedgehog records. He explains the use of the contains and overlaps operators to allow searching for hedgehogs tagged with specific attributes, demonstrating how to retrieve all hedgehogs with certain tags using ActiveRecord scopes.

  • Hierarchies:

    The challenge of modeling hierarchical data is addressed by discussing the use of materialized paths. This technique allows developers to represent parent-child relationships among hedgehog clubs. Sanderson explains how to query for the depth of a club and how to find children and parents within the tree structure using SQL array operations.

  • Custom Data Handling:

    Sanderson highlights Postgres' hstore as a way of dealing with sparse data. He discusses setting up an hstore column for arbitrary metadata, illustrating how to query this hash-like structure efficiently using proved capabilities of ActiveRecord.

  • Full Text Search:

    The video covers implementing full-text search capabilities in applications. Sanderson explains the use of ts_vector to preprocess text for searching and ts_query to handle search input. He emphasizes the usefulness of stemming and normalization in improving search results and demonstrates how to execute complex search queries with multiple terms using operators available in Postgres.

Conclusion and Takeaways:

By the end of the talk, attendees are equipped with practical skills to use Postgres' advanced features effectively. The talk emphasizes the ability to model complex relationships, manage sparse data, and implement fast search functions, all critical for enhancing a digital product stack. Sanderson encourages developers to embrace these features to build robust applications capable of meeting evolving demands, ultimately enabling them to participate confidently in project planning and discussions.

The combination of these techniques empowers Rails developers to take full advantage of Postgres capabilities, creating a more versatile and powerful application framework for their projects.

00:00:16.320 Hello everyone, this talk is about Postgres, the best tool that you're already using.
00:00:21.439 I'm Adam Sanderson, a full stack engineer at Liquid Planner for over five years. Just a heads-up for anyone who was at the first keynote, I flew in from Kansas.
00:00:32.960 I know many of you did too, as you're upgrading from Rails 2 to 3. With that in mind, I'll ensure that everything I discuss is applicable, although some parts may be more complex if you're not on the latest version of Rails.
00:00:45.360 However, this is all practical information that you can start using immediately. You can find me on GitHub under the username Adam Sanderson, and as soon as this is over and I find a Wi-Fi connection, I'll upload the slides there.
00:00:57.520 I'm also occasionally on Twitter and blog at monkeyandcrow.com. Liquid Planner, the company I work for, started in 2007 with Rails 1.x, and we have been using Postgres from the beginning.
00:01:11.680 During this talk, I will share some of the insights we've gained, the techniques we're exploring, and various things you can do with Postgres right now. We will touch on topics including tagging, hierarchies and modeling tree structures in Postgres, handling sparse data, and implementing full-text search.
00:01:37.920 For each topic, we'll first cover the SQL before discussing Active Record. If you don't want to see SQL, feel free to leave; I won’t be offended. We'll focus on Postgres 9 and use Ruby 1.9 syntax, which is more concise for presentation slides.
00:02:07.200 While I’ll primarily refer to Active Record 4, everything I discuss is fundamentally SQL. Therefore, you can still apply these concepts in older versions of Rails, though you'll need to put in a bit more work, perhaps with some gems.
00:02:33.920 A bit of backstory: you may not know this, but you've just finished building a fantastic new social network for hedgehog lovers worldwide. It’s called "Head with Me," and you can check it out—it’s wonderful! Things have been going great; you have a few users—about three—and now they are requesting new features like tagging.
00:03:06.159 Let’s say they want to tag their hedgehogs and find others with specific tags. How can we go about doing that? For this example, we'll have a table for hedgehogs, where each hedgehog has an ID, a name, an age, and interestingly, a set of tags stored as a text array.
00:03:18.959 I promised you Active Record. An Active Record for a hedgehog would look pretty much as you’d expect, but in Active Record 4, they added native support for text arrays or arrays in general. You can create an array in Postgres simply by using the array true option. It's somewhat of an odd syntax, but it works.
00:03:54.720 A little heads-up: if you're going to do this, opt for a text array instead of a string array in Active Record because Postgres assumes your array literals are of type text, not varchar, which is what you get with `t.string`.
00:04:10.720 If you use a string array, you'll encounter conflicts when Postgres complains that the operator you need for overlapping does not exist, as you would be trying to operate between varchars and text arrays. So, save yourself some time and just use `t.text`.
00:04:40.479 So what can we do with these arrays once we've created them? Two useful operators are the contains operator and the overlaps operator. The contains operator will return true if array A contains all elements of array B. Meanwhile, the overlap operator returns true if arrays A and B share any elements.
00:05:10.960 How will we use this to implement tagging? It's fairly straightforward: if we want to find all hedgehogs in our social network that are either spiny or prickly, we need to examine the tags and see which ones overlap with the array containing 'spiny' and 'prickly'.
00:05:52.240 This might yield results such as our hedgehog Marty being categorized as spiny, prickly, and cute, while Quillby is cuddly, prickly, and hungry. Notice we don't require all tags to match due to the overlap operator, as any shared element yields true.
00:06:15.840 Alternatively, if we want hedgehogs that are both spiny and prickly rather than either/or, we can utilize the contains operator to locate all hedgehogs whose tags contain both 'spiny' and 'prickly.' The results will now be a subset of our original query, yielding only Marty and Thomas.
00:06:58.240 This will show that both hedgehogs had arrays containing all required tags; it doesn’t matter in which order they appear during the matching as we’re solely concerned with set operations.
00:07:12.480 How can you effectively apply this in Active Record? Quite simply, you can include this as a normal condition, such as requesting all hedgehogs where the tags contain a specific array, and Active Record will replace a reference like 'spiny' and 'prickly' with a question mark.
00:07:52.640 But this is Active Record; shouldn't we aim for something more impressive? By creating scopes, we can teach Active Record more about our domain and how to handle tagging. We can encapsulate our set operations to find hedgehogs that have any tags or that must have all specified tags.
00:08:37.200 You’ll see that we have similar conditions; we're allowing parameters as the scope based on tags of interest, and Active Record is set to interpolate them seamlessly. You all attended the Active Record session earlier, so you have some understanding of what goes on behind the scenes, right?
00:09:21.440 To utilize this, we can easily execute a command like hedgehog.any_tags('spiny', 'large'). These are scopes we have taught Active Record to perform new tricks with, which allows us to continuously chain queries together to create more complex queries, such as finding all old spiny large hedgehogs.
00:09:53.680 This functionality enables various features to work together integratively.
00:10:04.560 Another aspect to consider is modeling hierarchies. Modeling hierarchies in SQL isn't easy, and I won't argue otherwise, but there are effective methods for handling such data. Apparently, there are thousands of hedgehog leagues, divisions, societies, and clubs that exist, all organized in strict hierarchies.
00:10:48.640 Users want to model these hierarchies, and if we take the North American League with its Western Division as a case example, we observe clubs such as the Cascadia Hog Friends and California Hedge Society. We need to support operations like determining the depth of a club, identifying all children of the North American League, or finding the parents of the Cascadia Hog Friends.
00:11:38.240 To achieve this, we utilize a materialized path in SQL and revisit the concept of SQL arrays. Our club model will comprise an ID, a name, and a path, which is an array of integers representing IDs of all parent records.
00:12:00.479 By using a materialized path, we can trace the route it takes to navigate through the hierarchy. For example, the North American League, sitting at the top with an ID (1) will have a path of [1]; the Eastern Division, which is a child, will hold the same ID indicating its parent lineage.
00:12:56.720 To compute a record's depth, you simply calculate the length of its path. In this case, Postgres has a useful function, array length, which can return the length of an array to define the depth—just be aware that we’re focusing on one-dimensional arrays here.
00:13:30.800 When would you want to apply this? If you wanted to display a summarized view of the top two tiers of hedgehog clubs. For this purpose, you can select club name, path, and depth while filtering the results to only include clubs with a depth of two or less.
00:14:03.680 You will retrieve results like the North American League, Eastern Division, Western Division, and the South American League—essentially the clubs with paths that extend only two levels deep.
00:14:50.800 Look now at how to find children of a specific club using the path we've established. To locate the children in our materialized path, you can query for all records where a specific record's ID exists in their respective paths.
00:15:32.879 If we know California Hedge Society has an ID of seven, we can query for records whose paths overlap the array containing seven, thus returning the California Hedge Society, Real Hogs of the OC, and Hipster Hogs as output.
00:16:14.079 If you take away one vital piece of information from this, it is that owning hedgehogs in California is apparently illegal — apparently a fact noted on Wikipedia.
00:16:53.440 What about finding parents?
00:16:57.440 As previously mentioned, the materialized path itself comprises the IDs of the record's parents. If we know California Hedge Society's path is [1, 3, 7], we compare their IDs against each record.
00:17:42.160 We represent this as an array in our where clause; thus, we can look for records overlapping with California Hedge Society's path—this will yield results like the North American League, Western Division, and the California Hedge Society itself.
00:18:32.640 This method is an effective way to model tree or hierarchical data in SQL. For Active Record users, arrays are exposed as standard Ruby arrays in Active Record 4.
00:19:19.600 To find how many parents or children a certain record has, you can define instance methods for children and parents respectively, passing in the relevant ID or path.
00:20:03.600 These methods allow you to query hierarchies just like any other Active Record method. You could opt to chain queries to limit the results, like retrieving only five parents.
00:20:36.320 Additionally, you could request all children of a particular club but only those containing hedgehogs, showcasing the flexibility of combining these features.
00:21:05.440 Now let's discuss handling custom or sparse data. If users want to record arbitrary information about their hedgehogs, such as shoe size, favorite food, or eye color, we don't necessarily need a database column for everything.
00:21:46.960 Postgres provides an excellent solution with hstore, which adds a column type functioning effectively as a hash. This makes it a useful alternative to Active Record's serialize method since you can actually query it.
00:22:26.080 A point to remember is that hstore isn’t installed by default in Postgres. Upon adding hstore, you will need to enable the hstore extension in Postgres by executing an arbitrary SQL command in a migration.
00:22:58.560 A heads-up: while Active Record 4 supports hstore, it does not include extensions in the default schema format. Active Record's default schema format only accommodates those using Active Record helpers.
00:23:34.880 It follows that if you utilize this in your tests, the tables won’t manifest in your test database, likely causing issues. To sidestep these complications, switching your schema format to SQL will ensure changes get captured accurately.
00:24:14.799 Now, if we proceed with defining a custom column as hstore and use Active Record to define it, it can act as a normal column type. For example, you would use `t.hstore` while specifying needed options.
00:24:57.680 Setting defaults and null constraints is crucial, such as defining an empty hstore as the default to avoid null entries, which could lead to errors when attempting to assign keys.
00:25:40.880 In the case of a non-empty hstore, it resembles a Ruby hash, appearing as key-value pairs with a simple representation. You can assign favorite foods and weights in this format, ensuring to quote key-value pairs, which can look similar to SQL injection attacks—so exercise caution with string concatenation.
00:26:35.039 Active Record handles these issues to a degree, but be cautious while directly executing potentially dangerous user inputs in your queries.
00:27:19.840 Moving on to operators, two key operators for querying an hstore are defined and accessor. The defined operator examines whether a record contains a specific key, while the accessor retrieves values from the hstore.
00:27:59.840 For instance, a query might find all favorite foods among hedgehogs in our network by selecting names alongside a pluck function to extract their favorite food from the custom field. This allows for results showing non-standard data.
00:28:43.740 Moving back to Active Record, we can encapsulate querying hstores into scoped methods for ease. By defining scopes such as ‘has key’ and ‘has value’, we teach Active Record to efficiently translate our queries.
00:29:39.840 This ensures that once established, you can quickly ask things like retrieving hedgehogs based on their custom characteristics. By employing Active Record, you minimize complexity while maximizing potential functionalities.
00:30:16.120 Now let’s discuss updating hstores. These are inherently treated as hashes in Ruby, so you can seamlessly set or replace key-value pairs in the custom field of a hedgehog.
00:31:00.000 Although be mindful that hstores are stored as strings; you’ll get back strings rather than integers from your hstore, so you need to handle that accordingly.
00:31:39.920 Next, let’s touch on full-text search capabilities. Your users want to search for content within their comments; they want an efficient method to find out what’s been discussed about hedgehogs.
00:32:16.240 We will illustrate this with comments each having an ID, associated hedgehog ID, and body. There are two primary types in Postgres' full-text search: TS vector, which represents pre-processed text for search, and TS query, which signifies search terms.
00:32:43.680 To create these, we define functions `to_tsvector` and `to_tsquery`, both also requiring a search configuration like English—this ensures normalization is performed.
00:33:09.600 This normalization process involves removing common stop words and stemming words, which reduces the impact of minor grammar variations, enabling broader and more accurate search functionalities.
00:33:29.440 The essential operator for searches is the `@@` operator, which checks if the TS vector contains search terms. You can also combine operators to execute comprehensive queries.
00:33:59.440 For example, if you wanted to find comments mentioning enjoying something, you'd prepare a comment's body using `to_tsvector` with the English configuration, querying with the term 'enjoying.'
00:34:36.640 This would return a range of results that mention variations like 'enjoy' or 'enjoys,' demonstrating the positive effect of stemming during the normalization process.
00:35:05.840 You can perform wildcard searches, which, while only supporting wildcards at the end of a term due to indexing structures, enhances user flexibility in searching.
00:35:43.920 To illustrate this, you could search for all comments related to oil and those starting with 'quill' while combining these two queries. You would still have to sanitize your user input before querying.
00:36:21.840 When wrapping this in scopes in Active Record, following similar logic from SQL will keep things organized. Through all this, the efficiency of full-text search dramatically enhances user experience.
00:36:59.680 Though a full-text search index isn't necessary, it significantly speeds up queries. To implement this, create indexes on the `to_tsvector` English call, ensuring consistent function use throughout your code.
00:37:28.760 In closing, remember with Postgres you can effectively use arrays for modeling tagging and hierarchy, utilize hstore for custom or sparse data, and leverage full-text search for enhanced data retrieval.
00:37:56.120 With that, I hope you enjoy the happy hour! Any questions?
00:38:21.040 You.