00:00:14.920
All right, we'll go ahead and get started. The name of my talk is 'MongoDB Rules', which has kind of a nice double ring to the title.
00:00:21.560
MongoDB is a database; it's an open source database written in C++ and document-based. I'll talk a lot about it.
00:00:29.000
Real quick, how many people have heard of MongoDB? Can you just raise your hands? Okay, cool.
00:00:34.520
How many people have used MongoDB? All right. I hope to speak to the middle during this talk.
00:00:39.640
It's not going to be entirely introductory. I want to give a set of rules for using it well.
00:00:44.680
Hopefully, in the process, if you haven't heard about MongoDB, you'll get a sense of what it's like.
00:00:50.640
Before I get into the meat of my talk, there's some really important business I want to take care of.
00:00:56.320
Okay, the important business: my name is Kyle Banker, and I work for a company called Tenen.
00:01:01.760
Tenen is the company that sponsors MongoDB's development. You can find me at kyoten on Twitter.
00:01:08.320
We're at mongodbe.org, where you can find very good documentation and download binaries.
00:01:14.119
This means you don't have to compile it, and you can get it running pretty quickly.
00:01:19.920
If you ever have a question about MongoDB, the Google Group is a great place to find answers.
00:01:25.240
Typically, you can get an answer in about 20 seconds, so check that out.
00:01:31.520
Now, I want to ask one more question: how many people here are using some type of NoSQL database?
00:01:39.119
Okay, so maybe like 30 to 40 percent of you.
00:01:44.759
Of those using a NoSQL database, raise your hand if you are using it for some kind of operational reason.
00:01:52.920
In other words, you need extra performance or scalability. So maybe half of you.
00:01:58.000
When the discussion comes to NoSQL databases, most of us are concerned about performance and scalability.
00:02:04.200
I think MongoDB addresses those concerns.
00:02:09.679
However, there's another side to it. You might not have a scaling problem, but there may still be a reason to use a NoSQL database like MongoDB.
00:02:16.000
It makes things easier to understand, and that was my initial reason for being attracted to it.
00:02:22.120
You'll see why that might be if you haven’t tried it already.
00:02:27.560
Rule number one is to get to know the real MongoDB.
00:02:32.840
What does that mean? There are some pretty good object mappers out there.
00:02:38.599
They give you a very Active Record-like interface, but they do hide some of the ease of actually working with the database.
00:02:45.400
When you're working with a relational database, your object-relational mapper is absolutely necessary.
00:02:51.360
In the case of MongoDB, it's not that difficult because we deal with something called a document.
00:02:57.840
A document is something in JSON, so when I say get to know the real DB, I mean to start playing with it.
00:03:03.680
We have a shell, and our shell isn't SQL; it's JavaScript.
00:03:09.840
I was able to build this little 'Try MongoDB' site in the spirit of 'Try Ruby'.
00:03:16.120
You can go to tr.mongod.com and actually go through a little tutorial and play with a real MongoDB instance in a JavaScript shell.
00:03:22.120
This gives you a sense of how the thing works.
00:03:27.159
When you're done with that, you can download the actual shell and do a lot of the same things.
00:03:33.560
The other thing about the real MongoDB is using the Ruby driver.
00:03:38.599
It's a very natural way of working with your data.
00:03:43.840
We build a connection in the same way you choose a database with a relational database.
00:03:49.799
Here's an example: we're requiring Ruby gems and requiring the driver. We build a connection.
00:03:56.920
We choose a database and choose a collection, in this case, we're working with a 'page views' collection.
00:04:02.439
Here's a sample document, and it's just a hash, right? It's a Ruby hash pointing to a couple of different types.
00:04:08.920
We got a date type in there. We have some strings, an integer, and we just pass that to our collection object and click save.
00:04:14.599
Now what happens when you click save? What happens when you call save on an object like that?
00:04:21.359
The driver is responsible for creating the primary key. Everyone sees that '_id' field—it's an ObjectID.
00:04:27.520
I'll explain what that consists of. You can use anything as an _id, but the MongoDB ObjectID works kind of like this.
00:04:33.240
The first four bytes are a timestamp, the next three bytes are machine ID, the following two are process ID, and the last are a counter.
00:04:40.800
It's a 12-byte ID. One of the nice things about using MongoDB's ObjectIDs is that they include a timestamp.
00:04:46.680
If you do a query and sort by object ID, you basically get things sorted by created at.
00:04:52.120
You can also extract the creation time from the object ID, so there's value in actually using that.
00:04:57.240
Additionally, the driver serializes the Ruby hash into a binary dictionary form called BSON.
00:05:02.479
You can read all about it at bsonspec.org; it's not difficult to understand.
00:05:08.600
All drivers, regardless of which one you use, serialize to BSON.
00:05:15.800
So when we call save, we add an _id, create that _id if you haven't done so yourself, serialize to BSON, and send it along the socket.
00:05:22.199
Since we already have the ID, we don't have to wait for any response from the database.
00:05:27.639
The philosophy is that you can always add machines and add clients.
00:05:32.720
If you're going to pound the database, you shouldn't have to wait for a response.
00:05:39.160
So the drivers do a little more work. Rule number two is to use ObjectIDs.
00:05:45.400
Many people in the Ruby community decide to use a string version instead.
00:05:52.600
ObjectID is a proper BSON type. It's a bit more efficient because it's 12 bytes versus 16.
00:05:58.240
It has that timestamp built into it, and it's the standard.
00:06:03.840
If you're working with ObjectIDs, that's just one gotcha to keep in mind.
00:06:10.240
The next thing is to use rich documents, which is one of the really important aspects.
00:06:16.639
MongoDB allows you to represent rich data structures within a single document.
00:06:23.160
I have a document here representing a cart or an order.
00:06:28.400
In the interest of talking about the advantages of rich documents, I want to show an anti-pattern.
00:06:34.039
This database diagram is a relational database for Magento, which is similar for other large e-commerce applications.
00:06:40.280
You notice many small tables. If you look at the different entities, you'll see lots of tables.
00:06:46.960
What are these tables doing? A lot of you know exactly what they’re doing.
00:06:52.440
You're simulating a flexible schema with these tables.
00:06:58.919
Each of these tables has a different field representing a different type.
00:07:04.199
If I need to add a dynamic integer attribute to a product, I use that specific table.
00:07:09.960
If I want to add a dynamic string attribute, I need another specific table.
00:07:16.960
The important question becomes: what is the join?
00:07:23.560
You can't just open a relational database console and see what a product looks like.
00:07:30.039
You can't reason about this style of data.
00:07:35.520
I think it's easier to reason about document-style data.
00:07:41.400
If I can represent all those relations in a document and modify that document as needed, it's a win.
00:07:47.000
I can easily question, for example, a product in MongoDB and get a complete representation.
00:07:54.800
MongoDB is human-oriented.
00:08:00.480
We can conceive of the data with a document model and look at objects as holistic entities.
00:08:06.400
This is a huge advantage of a document-oriented database, and a good reason to use it.
00:08:12.120
Going back to our order, we see a couple of line items.
00:08:18.680
For example, I have a 'lumberjack laptop case' I bought online.
00:08:25.079
Initially, I thought it looked kind of cool, but upon receiving it, I realized it looks like a lumberjack's laptop case.
00:08:30.880
That’s my lumberjack laptop case.
00:08:36.800
You can see I track the SKU and the list price in the same document.
00:08:42.600
I can also store the shipping address in the same document and different types of totals at the bottom.
00:08:48.480
The line items point to an array of other objects.
00:08:54.320
This is represented economically in this object.
00:09:00.240
There’s value when you can query or manipulate these items easily.
00:09:06.480
MongoDB has dynamic queries, similar to a relational database.
00:09:12.120
I don’t have to define everything upfront; I can perform simple queries.
00:09:18.680
For example, I’m looking for something by user ID.
00:09:24.200
When I create an index, notice that it's on an inner object.
00:09:30.640
I’m creating it on line items SKU, so you see in my line items object, it’s line items.SKU.
00:09:36.680
When I do that query, I can utilize an index.
00:09:42.880
The last thing I want to illustrate: the most expensive product purchased in the last week.
00:09:48.320
I'm creating a date range and passing those as a document.
00:09:54.600
My query is a document, and so is my data.
00:10:00.480
I'm saying where createdAt is greater than last week, less than today, sorted by total, and limited to one.
00:10:07.120
It can use an index, making it an extremely efficient query.
00:10:14.760
These are some other query operators. We can pass in arrays.
00:10:20.720
There are many special query operators that allow you to do much like a relational database.
00:10:27.440
Rule number four is that array keys rule.
00:10:33.280
Let me give you an example to simplify tiny relationships.
00:10:39.440
Tags are a good example: I'm having a social news site.
00:10:45.760
A link has a title, a URL, and an array of tags.
00:10:51.120
For example, 'Tech', 'startups', and 'time waster'.
00:10:57.440
Now I can create an index on the array field.
00:11:02.960
When I do this query, it's efficient, and I'm not wasting any time.
00:11:09.520
The query is straightforward without special iterating over documents.
00:11:15.760
In a relational database, I might have a table with userID, tagID, and a piece of text.
00:11:21.960
In contrast, we can eliminate that level of normalization and put it in the document.
00:11:27.360
The second aspect of the key rule is representing complex things like many-to-many relationships.
00:11:33.280
I can represent products and categories, for example.
00:11:39.120
Here, the category is just a title, and the product is a record.
00:11:44.760
The category IDs can be an array of ObjectIDs.
00:11:51.320
You can find all products in a category with a simple query.
00:11:57.840
You can perform the same process in the reverse; no join table required.
00:12:03.040
Rule number five is to use atomic operators.
00:12:09.080
Many of you know that Redis is famous for atomic operators; MongoDB has them too.
00:12:15.440
They allow you to modify a single key in place, which can be efficient.
00:12:22.080
For example, if you're building a site like Hacker News, implement upvote functionality.
00:12:28.320
Here's how the query looks; we find a post with a given ID.
00:12:34.400
The update includes an atomic operation. We add the voter ID while incrementing votes.
00:12:40.160
We don't have to pull down the document and modify it; all can be executed in one operation.
00:12:46.800
An example includes concert seats; we use the command 'find and modify', a special command.
00:12:52.960
We look at the seats collection and search for a particular concert seat without an expiry date.
00:12:59.440
The atomic update operation sets an expiry date in the next 15 minutes.
00:13:05.840
If the update succeeds, I get the document back, ensuring consistency.
00:13:12.240
While we don't have full transactions, atomic modifiers are versatile.
00:13:17.680
Some update operators include incrementing values, setting values, or interacting with arrays.
00:13:24.000
Let's talk briefly about map-reduce.
00:13:30.840
People have heard of map-reduce; think of it as functional programming for aggregation.
00:13:37.760
It’s particularly useful for large computations.
00:13:44.160
I’ll provide an example: finding the total of orders from a given zip code.
00:13:51.040
We emit the zip code as our key in the map function.
00:13:57.760
Values convert to totals, and our reduce function sums them.
00:14:04.080
Map-reduce is generally for ETL rather than live applications.
00:14:10.480
Rule number six is indexes. In MongoDB, a collection supports up to 40 indexes.
00:14:16.120
Indexes behave similarly to those in relational databases.
00:14:21.720
Knowledge about relational databases is useful for building indexes in MongoDB.
00:14:27.240
I've witnessed instances where large databases freeze when creating indexes, so be smart about it.
00:14:33.760
It's crucial to ensure that your queries match your indexes.
00:14:39.440
Be cautious about defining keys and indexing them.
00:14:45.480
GFS is a specification for storing binary data in MongoDB.
00:14:50.840
You can store large files in MongoDB, and it works on the driver level.
00:14:56.840
For example, if you have a picture of a lumberjack, you can use GFS to store it.
00:15:03.120
You define a grid object, open a file, and perform operations using GFS's API.
00:15:10.720
Everything in MongoDB works on the collection level.
00:15:17.960
We have a files collection for metadata and a chunks collection for the data.
00:15:24.320
This data organizes efficiently and can be fast.
00:15:31.840
A bit about MongoDB's speed and durability trade-off.
00:15:38.320
One technique used is memory-mapped files, which maps files onto virtual memory.
00:15:45.520
Writing information to disk feels like writing to memory.
00:15:52.240
The kernel manages what data resides in memory versus on disk.
00:15:58.640
MongoDB enforces an fsync to disk every minute.
00:16:05.119
If you shut down the master node, some corruption could occur.
00:16:11.440
We suggest replicating, using a master-slave setup, to improve durability.
00:16:18.120
Adjust the fsync if you want to fsync every 30 seconds.
00:16:25.640
MongoDB believes speed using memory-mapped files is more critical.
00:16:31.920
Multiple nodes better achieve durability.
00:16:38.320
Here are some performance notes: the drivers do extensive work.
00:16:45.840
Drivers generate ObjectIDs and serialize to BSON.
00:16:52.000
The Ruby driver can be slower due to the nature of Ruby.
00:16:58.640
When benchmarking Ruby, don’t benchmark on a single node.
00:17:05.440
Make sure to benchmark across multiple processes.
00:17:12.320
If you're pumping data into MongoDB via Ruby, you may not see results.
00:17:19.560
Using other languages, like Closure on the JVM, shows better core performance.
00:17:25.920
Additionally, embedded documents provide both a mental and computational win.
00:17:32.840
We avoid joins and can store related data in a single embedded document.
00:17:39.760
Queries should always use indexes.
00:17:46.800
Keep your working set and indexes in RAM, just like with a relational database.
00:17:52.560
We've seen issues when databases won't scale due to insufficient memory.
00:17:59.200
Auto-sharding becomes necessary if you can't keep everything in RAM.
00:18:06.960
Data is chunked and routed through an S client, distributing updates.
00:18:12.960
All shards are autonomous and don't communicate, allowing for scalability.
00:18:19.360
Not every setup requires sharding, but it's effective.
00:18:26.640
Two production cases: SourceForge uses MongoDB for project pages.
00:18:33.360
They store all project information in a single document for efficiency.
00:18:39.920
Currently, GitHub utilizes MongoDB for backend analytics.
00:18:46.200
Another example: the Harmony app simplifies its schema by switching from MySQL.
00:18:52.520
There's a diversity of other production deployments, and I welcome questions.
00:18:59.920
Are you familiar with Rescue? Do you think you could build that in MongoDB?
00:19:06.640
Now that you have the 'find and modify' command, you can easily set the status.
00:19:13.920
I think someone is working on it.
00:19:21.360
Can you describe what an embedded document is and talk more about it?
00:19:27.680
Embedding means having a document containing other documents.
00:19:34.400
In a simple blog, for example, you could store all comments inside the post object.
00:19:41.920
You can represent any structure with a Ruby hash in MongoDB.
00:19:48.760
There are many differences between CouchDB and MongoDB.
00:19:55.080
CouchDB uses HTTP; MongoDB uses a binary protocol over a socket.
00:20:01.760
CouchDB has a multi-master replication scheme; MongoDB scales via auto-sharding.
00:20:07.760
In CouchDB, you must build indexes using map-reduce; MongoDB allows more ad-hoc querying.
00:20:14.120
Additionally, MongoDB supports atomic updates, while CouchDB does not.
00:20:20.560
When starting with Ruby, many wrote like Java in Ruby.
00:20:26.960
What are some high-level designs to break our SQL background?
00:20:33.200
I have documents on data modeling available on our website.
00:20:39.440
Trivial relationships should be contained in a single document.
00:20:46.480
In more complicated relationships, consider your use cases.
00:20:53.680
There's nothing wrong with normalizing in MongoDB, it is designed for that.
00:20:59.760
People worry about needing short key names; in practice, it hasn't been an issue.
00:21:06.640
Character set encoding: everything needs to be in UTF-8.
00:21:11.920
Currently, we don’t support various other character sets.
00:21:18.000
There are many binary serialization standards like BSON versus others.
00:21:24.240
BSON is preferred since it maps nicely to known data structures.
00:21:31.320
What is the potential for accessing GridFS directly from web servers?
00:21:36.840
People have built tools for that; it’s completely possible.
00:21:43.920
One of my coworkers is developing an Nginx module to interact with GridFS.
00:21:50.240
The roadmap includes 2D Geo-indexing and sharding.
00:21:56.960
It’s critical for us to support 100% automatic failover for sharding.
00:22:02.960
Currently, managing it manually adds complexity.
00:22:09.680
There’s the concept of a replica set; I can tell you more about it later.
00:22:15.680
If you go back to the previous diagram, shards could appear as single points of failure.
00:22:22.640
The S client is lightweight and lives alongside application servers.
00:22:29.280
Your application could have multiple S clients, which reduces failure risk.
00:22:36.400
Explaining the query process, we have an 'explain' command to track query paths.
00:22:42.720
If needed, I can provide further clarification on performance optimization.
00:22:49.680
Thank you very much for your time.