Ruby
Summarized using AI

MongoDB Rules

by Kyle Banker

In the talk "MongoDB Rules" presented by Kyle Banker at MountainWest RubyConf 2010, the speaker introduces MongoDB, an open-source, document-based NoSQL database. Banker's focus is on providing attendees with practical rules for effectively using MongoDB, blending introductory knowledge with actionable insights.

Key Points Discussed:

- Understanding MongoDB: Participants learn what MongoDB is and its operational advantages, particularly in terms of performance and scalability. Banker emphasizes not just the technical aspects but also the document-oriented nature of MongoDB that simplifies data handling.

- Rules to Use MongoDB Effectively:

- Rule 1: Get to know the real MongoDB by utilizing its JavaScript shell, which allows users to interact directly with MongoDB without relying on object mappers that may obscure its functionality.

- Rule 2: Prefer using ObjectIDs due to their efficiency and built-in timestamp feature.

- Rule 3: Leverage rich documents, focusing on how they allow for a more holistic representation of data compared to relational databases.

- Rule 4: Use array keys effectively to optimize queries and document structure.

- Rule 5: Implement atomic operators to enhance efficiency during updates and modifications.

- Rule 6: Pay attention to indexing to ensure queries run efficiently without causing significant delays.

- Examples: Banker discusses different applications of MongoDB, including use cases for SourceForge and GitHub, showcasing how they effectively utilize MongoDB's strengths. He also highlights the differences between MongoDB and CouchDB, emphasizing MongoDB's advantages in ad-hoc querying and atomic updates.

- Additional Takeaways: The talk concludes with recommendations regarding performance optimizations, the importance of keeping indexes in RAM, and considerations for using GridFS for managing binary data. Banker addresses various questions from the audience at the end, further clarifying MongoDB's capabilities and flexibility.

Overall, the presentation provides attendees with a thorough understanding of MongoDB's features and practical guidelines to maximize its utility in software development projects.

00:00:14.920 All right, we'll go ahead and get started. The name of my talk is 'MongoDB Rules', which has kind of a nice double ring to the title.
00:00:21.560 MongoDB is a database; it's an open source database written in C++ and document-based. I'll talk a lot about it.
00:00:29.000 Real quick, how many people have heard of MongoDB? Can you just raise your hands? Okay, cool.
00:00:34.520 How many people have used MongoDB? All right. I hope to speak to the middle during this talk.
00:00:39.640 It's not going to be entirely introductory. I want to give a set of rules for using it well.
00:00:44.680 Hopefully, in the process, if you haven't heard about MongoDB, you'll get a sense of what it's like.
00:00:50.640 Before I get into the meat of my talk, there's some really important business I want to take care of.
00:00:56.320 Okay, the important business: my name is Kyle Banker, and I work for a company called Tenen.
00:01:01.760 Tenen is the company that sponsors MongoDB's development. You can find me at kyoten on Twitter.
00:01:08.320 We're at mongodbe.org, where you can find very good documentation and download binaries.
00:01:14.119 This means you don't have to compile it, and you can get it running pretty quickly.
00:01:19.920 If you ever have a question about MongoDB, the Google Group is a great place to find answers.
00:01:25.240 Typically, you can get an answer in about 20 seconds, so check that out.
00:01:31.520 Now, I want to ask one more question: how many people here are using some type of NoSQL database?
00:01:39.119 Okay, so maybe like 30 to 40 percent of you.
00:01:44.759 Of those using a NoSQL database, raise your hand if you are using it for some kind of operational reason.
00:01:52.920 In other words, you need extra performance or scalability. So maybe half of you.
00:01:58.000 When the discussion comes to NoSQL databases, most of us are concerned about performance and scalability.
00:02:04.200 I think MongoDB addresses those concerns.
00:02:09.679 However, there's another side to it. You might not have a scaling problem, but there may still be a reason to use a NoSQL database like MongoDB.
00:02:16.000 It makes things easier to understand, and that was my initial reason for being attracted to it.
00:02:22.120 You'll see why that might be if you haven’t tried it already.
00:02:27.560 Rule number one is to get to know the real MongoDB.
00:02:32.840 What does that mean? There are some pretty good object mappers out there.
00:02:38.599 They give you a very Active Record-like interface, but they do hide some of the ease of actually working with the database.
00:02:45.400 When you're working with a relational database, your object-relational mapper is absolutely necessary.
00:02:51.360 In the case of MongoDB, it's not that difficult because we deal with something called a document.
00:02:57.840 A document is something in JSON, so when I say get to know the real DB, I mean to start playing with it.
00:03:03.680 We have a shell, and our shell isn't SQL; it's JavaScript.
00:03:09.840 I was able to build this little 'Try MongoDB' site in the spirit of 'Try Ruby'.
00:03:16.120 You can go to tr.mongod.com and actually go through a little tutorial and play with a real MongoDB instance in a JavaScript shell.
00:03:22.120 This gives you a sense of how the thing works.
00:03:27.159 When you're done with that, you can download the actual shell and do a lot of the same things.
00:03:33.560 The other thing about the real MongoDB is using the Ruby driver.
00:03:38.599 It's a very natural way of working with your data.
00:03:43.840 We build a connection in the same way you choose a database with a relational database.
00:03:49.799 Here's an example: we're requiring Ruby gems and requiring the driver. We build a connection.
00:03:56.920 We choose a database and choose a collection, in this case, we're working with a 'page views' collection.
00:04:02.439 Here's a sample document, and it's just a hash, right? It's a Ruby hash pointing to a couple of different types.
00:04:08.920 We got a date type in there. We have some strings, an integer, and we just pass that to our collection object and click save.
00:04:14.599 Now what happens when you click save? What happens when you call save on an object like that?
00:04:21.359 The driver is responsible for creating the primary key. Everyone sees that '_id' field—it's an ObjectID.
00:04:27.520 I'll explain what that consists of. You can use anything as an _id, but the MongoDB ObjectID works kind of like this.
00:04:33.240 The first four bytes are a timestamp, the next three bytes are machine ID, the following two are process ID, and the last are a counter.
00:04:40.800 It's a 12-byte ID. One of the nice things about using MongoDB's ObjectIDs is that they include a timestamp.
00:04:46.680 If you do a query and sort by object ID, you basically get things sorted by created at.
00:04:52.120 You can also extract the creation time from the object ID, so there's value in actually using that.
00:04:57.240 Additionally, the driver serializes the Ruby hash into a binary dictionary form called BSON.
00:05:02.479 You can read all about it at bsonspec.org; it's not difficult to understand.
00:05:08.600 All drivers, regardless of which one you use, serialize to BSON.
00:05:15.800 So when we call save, we add an _id, create that _id if you haven't done so yourself, serialize to BSON, and send it along the socket.
00:05:22.199 Since we already have the ID, we don't have to wait for any response from the database.
00:05:27.639 The philosophy is that you can always add machines and add clients.
00:05:32.720 If you're going to pound the database, you shouldn't have to wait for a response.
00:05:39.160 So the drivers do a little more work. Rule number two is to use ObjectIDs.
00:05:45.400 Many people in the Ruby community decide to use a string version instead.
00:05:52.600 ObjectID is a proper BSON type. It's a bit more efficient because it's 12 bytes versus 16.
00:05:58.240 It has that timestamp built into it, and it's the standard.
00:06:03.840 If you're working with ObjectIDs, that's just one gotcha to keep in mind.
00:06:10.240 The next thing is to use rich documents, which is one of the really important aspects.
00:06:16.639 MongoDB allows you to represent rich data structures within a single document.
00:06:23.160 I have a document here representing a cart or an order.
00:06:28.400 In the interest of talking about the advantages of rich documents, I want to show an anti-pattern.
00:06:34.039 This database diagram is a relational database for Magento, which is similar for other large e-commerce applications.
00:06:40.280 You notice many small tables. If you look at the different entities, you'll see lots of tables.
00:06:46.960 What are these tables doing? A lot of you know exactly what they’re doing.
00:06:52.440 You're simulating a flexible schema with these tables.
00:06:58.919 Each of these tables has a different field representing a different type.
00:07:04.199 If I need to add a dynamic integer attribute to a product, I use that specific table.
00:07:09.960 If I want to add a dynamic string attribute, I need another specific table.
00:07:16.960 The important question becomes: what is the join?
00:07:23.560 You can't just open a relational database console and see what a product looks like.
00:07:30.039 You can't reason about this style of data.
00:07:35.520 I think it's easier to reason about document-style data.
00:07:41.400 If I can represent all those relations in a document and modify that document as needed, it's a win.
00:07:47.000 I can easily question, for example, a product in MongoDB and get a complete representation.
00:07:54.800 MongoDB is human-oriented.
00:08:00.480 We can conceive of the data with a document model and look at objects as holistic entities.
00:08:06.400 This is a huge advantage of a document-oriented database, and a good reason to use it.
00:08:12.120 Going back to our order, we see a couple of line items.
00:08:18.680 For example, I have a 'lumberjack laptop case' I bought online.
00:08:25.079 Initially, I thought it looked kind of cool, but upon receiving it, I realized it looks like a lumberjack's laptop case.
00:08:30.880 That’s my lumberjack laptop case.
00:08:36.800 You can see I track the SKU and the list price in the same document.
00:08:42.600 I can also store the shipping address in the same document and different types of totals at the bottom.
00:08:48.480 The line items point to an array of other objects.
00:08:54.320 This is represented economically in this object.
00:09:00.240 There’s value when you can query or manipulate these items easily.
00:09:06.480 MongoDB has dynamic queries, similar to a relational database.
00:09:12.120 I don’t have to define everything upfront; I can perform simple queries.
00:09:18.680 For example, I’m looking for something by user ID.
00:09:24.200 When I create an index, notice that it's on an inner object.
00:09:30.640 I’m creating it on line items SKU, so you see in my line items object, it’s line items.SKU.
00:09:36.680 When I do that query, I can utilize an index.
00:09:42.880 The last thing I want to illustrate: the most expensive product purchased in the last week.
00:09:48.320 I'm creating a date range and passing those as a document.
00:09:54.600 My query is a document, and so is my data.
00:10:00.480 I'm saying where createdAt is greater than last week, less than today, sorted by total, and limited to one.
00:10:07.120 It can use an index, making it an extremely efficient query.
00:10:14.760 These are some other query operators. We can pass in arrays.
00:10:20.720 There are many special query operators that allow you to do much like a relational database.
00:10:27.440 Rule number four is that array keys rule.
00:10:33.280 Let me give you an example to simplify tiny relationships.
00:10:39.440 Tags are a good example: I'm having a social news site.
00:10:45.760 A link has a title, a URL, and an array of tags.
00:10:51.120 For example, 'Tech', 'startups', and 'time waster'.
00:10:57.440 Now I can create an index on the array field.
00:11:02.960 When I do this query, it's efficient, and I'm not wasting any time.
00:11:09.520 The query is straightforward without special iterating over documents.
00:11:15.760 In a relational database, I might have a table with userID, tagID, and a piece of text.
00:11:21.960 In contrast, we can eliminate that level of normalization and put it in the document.
00:11:27.360 The second aspect of the key rule is representing complex things like many-to-many relationships.
00:11:33.280 I can represent products and categories, for example.
00:11:39.120 Here, the category is just a title, and the product is a record.
00:11:44.760 The category IDs can be an array of ObjectIDs.
00:11:51.320 You can find all products in a category with a simple query.
00:11:57.840 You can perform the same process in the reverse; no join table required.
00:12:03.040 Rule number five is to use atomic operators.
00:12:09.080 Many of you know that Redis is famous for atomic operators; MongoDB has them too.
00:12:15.440 They allow you to modify a single key in place, which can be efficient.
00:12:22.080 For example, if you're building a site like Hacker News, implement upvote functionality.
00:12:28.320 Here's how the query looks; we find a post with a given ID.
00:12:34.400 The update includes an atomic operation. We add the voter ID while incrementing votes.
00:12:40.160 We don't have to pull down the document and modify it; all can be executed in one operation.
00:12:46.800 An example includes concert seats; we use the command 'find and modify', a special command.
00:12:52.960 We look at the seats collection and search for a particular concert seat without an expiry date.
00:12:59.440 The atomic update operation sets an expiry date in the next 15 minutes.
00:13:05.840 If the update succeeds, I get the document back, ensuring consistency.
00:13:12.240 While we don't have full transactions, atomic modifiers are versatile.
00:13:17.680 Some update operators include incrementing values, setting values, or interacting with arrays.
00:13:24.000 Let's talk briefly about map-reduce.
00:13:30.840 People have heard of map-reduce; think of it as functional programming for aggregation.
00:13:37.760 It’s particularly useful for large computations.
00:13:44.160 I’ll provide an example: finding the total of orders from a given zip code.
00:13:51.040 We emit the zip code as our key in the map function.
00:13:57.760 Values convert to totals, and our reduce function sums them.
00:14:04.080 Map-reduce is generally for ETL rather than live applications.
00:14:10.480 Rule number six is indexes. In MongoDB, a collection supports up to 40 indexes.
00:14:16.120 Indexes behave similarly to those in relational databases.
00:14:21.720 Knowledge about relational databases is useful for building indexes in MongoDB.
00:14:27.240 I've witnessed instances where large databases freeze when creating indexes, so be smart about it.
00:14:33.760 It's crucial to ensure that your queries match your indexes.
00:14:39.440 Be cautious about defining keys and indexing them.
00:14:45.480 GFS is a specification for storing binary data in MongoDB.
00:14:50.840 You can store large files in MongoDB, and it works on the driver level.
00:14:56.840 For example, if you have a picture of a lumberjack, you can use GFS to store it.
00:15:03.120 You define a grid object, open a file, and perform operations using GFS's API.
00:15:10.720 Everything in MongoDB works on the collection level.
00:15:17.960 We have a files collection for metadata and a chunks collection for the data.
00:15:24.320 This data organizes efficiently and can be fast.
00:15:31.840 A bit about MongoDB's speed and durability trade-off.
00:15:38.320 One technique used is memory-mapped files, which maps files onto virtual memory.
00:15:45.520 Writing information to disk feels like writing to memory.
00:15:52.240 The kernel manages what data resides in memory versus on disk.
00:15:58.640 MongoDB enforces an fsync to disk every minute.
00:16:05.119 If you shut down the master node, some corruption could occur.
00:16:11.440 We suggest replicating, using a master-slave setup, to improve durability.
00:16:18.120 Adjust the fsync if you want to fsync every 30 seconds.
00:16:25.640 MongoDB believes speed using memory-mapped files is more critical.
00:16:31.920 Multiple nodes better achieve durability.
00:16:38.320 Here are some performance notes: the drivers do extensive work.
00:16:45.840 Drivers generate ObjectIDs and serialize to BSON.
00:16:52.000 The Ruby driver can be slower due to the nature of Ruby.
00:16:58.640 When benchmarking Ruby, don’t benchmark on a single node.
00:17:05.440 Make sure to benchmark across multiple processes.
00:17:12.320 If you're pumping data into MongoDB via Ruby, you may not see results.
00:17:19.560 Using other languages, like Closure on the JVM, shows better core performance.
00:17:25.920 Additionally, embedded documents provide both a mental and computational win.
00:17:32.840 We avoid joins and can store related data in a single embedded document.
00:17:39.760 Queries should always use indexes.
00:17:46.800 Keep your working set and indexes in RAM, just like with a relational database.
00:17:52.560 We've seen issues when databases won't scale due to insufficient memory.
00:17:59.200 Auto-sharding becomes necessary if you can't keep everything in RAM.
00:18:06.960 Data is chunked and routed through an S client, distributing updates.
00:18:12.960 All shards are autonomous and don't communicate, allowing for scalability.
00:18:19.360 Not every setup requires sharding, but it's effective.
00:18:26.640 Two production cases: SourceForge uses MongoDB for project pages.
00:18:33.360 They store all project information in a single document for efficiency.
00:18:39.920 Currently, GitHub utilizes MongoDB for backend analytics.
00:18:46.200 Another example: the Harmony app simplifies its schema by switching from MySQL.
00:18:52.520 There's a diversity of other production deployments, and I welcome questions.
00:18:59.920 Are you familiar with Rescue? Do you think you could build that in MongoDB?
00:19:06.640 Now that you have the 'find and modify' command, you can easily set the status.
00:19:13.920 I think someone is working on it.
00:19:21.360 Can you describe what an embedded document is and talk more about it?
00:19:27.680 Embedding means having a document containing other documents.
00:19:34.400 In a simple blog, for example, you could store all comments inside the post object.
00:19:41.920 You can represent any structure with a Ruby hash in MongoDB.
00:19:48.760 There are many differences between CouchDB and MongoDB.
00:19:55.080 CouchDB uses HTTP; MongoDB uses a binary protocol over a socket.
00:20:01.760 CouchDB has a multi-master replication scheme; MongoDB scales via auto-sharding.
00:20:07.760 In CouchDB, you must build indexes using map-reduce; MongoDB allows more ad-hoc querying.
00:20:14.120 Additionally, MongoDB supports atomic updates, while CouchDB does not.
00:20:20.560 When starting with Ruby, many wrote like Java in Ruby.
00:20:26.960 What are some high-level designs to break our SQL background?
00:20:33.200 I have documents on data modeling available on our website.
00:20:39.440 Trivial relationships should be contained in a single document.
00:20:46.480 In more complicated relationships, consider your use cases.
00:20:53.680 There's nothing wrong with normalizing in MongoDB, it is designed for that.
00:20:59.760 People worry about needing short key names; in practice, it hasn't been an issue.
00:21:06.640 Character set encoding: everything needs to be in UTF-8.
00:21:11.920 Currently, we don’t support various other character sets.
00:21:18.000 There are many binary serialization standards like BSON versus others.
00:21:24.240 BSON is preferred since it maps nicely to known data structures.
00:21:31.320 What is the potential for accessing GridFS directly from web servers?
00:21:36.840 People have built tools for that; it’s completely possible.
00:21:43.920 One of my coworkers is developing an Nginx module to interact with GridFS.
00:21:50.240 The roadmap includes 2D Geo-indexing and sharding.
00:21:56.960 It’s critical for us to support 100% automatic failover for sharding.
00:22:02.960 Currently, managing it manually adds complexity.
00:22:09.680 There’s the concept of a replica set; I can tell you more about it later.
00:22:15.680 If you go back to the previous diagram, shards could appear as single points of failure.
00:22:22.640 The S client is lightweight and lives alongside application servers.
00:22:29.280 Your application could have multiple S clients, which reduces failure risk.
00:22:36.400 Explaining the query process, we have an 'explain' command to track query paths.
00:22:42.720 If needed, I can provide further clarification on performance optimization.
00:22:49.680 Thank you very much for your time.
Explore all talks recorded at MountainWest RubyConf 2010
+18