Rails Pacific 2014

Better Rails by Knowing Better Database

Better Rails by Knowing Better Database

by Ding ding Ye

In the presentation titled "Better Rails by Knowing Better Database," delivered by Ding Ding Ye at Rails Pacific 2014, the speaker emphasizes the importance of understanding databases to enhance the effectiveness of Rails applications. The discussion highlights the interconnectedness of Rails and database systems, showcasing how a better grasp of database principles can lead to more efficient, reliable, and scalable applications.

Key points discussed include:

- Database Fundamentals: The speaker explains basic concepts of relational databases, including tables, relationships, and queries. Emphasizing how Rails interacts with databases, he encourages developers to have a solid foundation in these principles.

- Optimizing Queries: Ye discusses strategies for writing better queries to improve performance. This includes insights on avoiding N+1 query problems and leveraging eager loading to reduce database hits.

- Leveraging Indexes: The importance of database indexing is highlighted, explaining how it can speed up data retrieval and improve overall application responsiveness. The speaker provides examples demonstrating scenarios where indexing can significantly optimize performance.

- Understanding ActiveRecord: The presentation dives into the ActiveRecord library, detailing its role in abstracting database interactions in Rails applications. The speaker shares tips to leverage ActiveRecord effectively, including common pitfalls to avoid.

- Real-World Case Studies: Ye shares anecdotes from his experiences working with various applications, showcasing how a deeper understanding of database structures has led to improved application performance, illustrating points with concrete examples.

The conclusion drives home the message that mastering database concepts is not just beneficial, but essential for Rails developers who aim to build robust applications. By understanding the underlying data management, developers can write cleaner code and avoid many common performance issues that arise from poor database utilization.

00:00:14.240 hello so are you hungry so I have a good news for you for
00:00:20.840 all of you so after my session you can have lunch so first thanks to the organizers
00:00:28.000 over real P for me to the conference sh my experience like what I talked in
00:00:33.360 yesterday pan discussion became me a senior developer without solid and Broad knowledges we can't become a sen new
00:00:39.840 developer this year R had turned 10 years old and is still one of the best
00:00:46.320 framework in the world and the C team keeps adding new adding the best work practice into the framework however no
00:00:54.160 days we have more and more voice seeing that real is not for beginner anymore it's true and not true
00:01:00.559 is true because there are so many background knowledges besides rules we
00:01:05.560 should know it's not true because real real did a good job to hide those details whatever if we want to turn sen
00:01:13.360 as architecture we must and should learn those knowledges such as op Sy Concepts
00:01:20.280 database system Concepts fundamental compiling text today my topic is about
00:01:26.479 database with reals uh for first let me introduce myself I'm Ying from mainland China you
00:01:33.720 can find me with at at Twitter Facebook and we I'm the founder of f. and missing
00:01:40.399 simple and productive CL tool for product teams I'm the organizer of Ron
00:01:45.439 China we will hold the sixth conference R China Ron of China at November 1st and
00:01:51.240 2nd so welcome to our parties everyone I'm also running a podcast
00:01:57.000 program called tiawa it's targeting engineers stops in China okay now let's start my
00:02:04.240 topic so we all know that real world applications are complex especially Enterprise applications our goal is to
00:02:11.440 have a low high load but a low response time this is the S TI CL Ser
00:02:16.920 architecture most used in real world and we have the front end web server handling the requests from browser or
00:02:23.959 client and then BLS it to the apps the application server will accept those
00:02:29.519 request and interacts with the database server for data exchange and there are many studies to
00:02:36.400 test where the performance part neck resides among those components which discussed just now based on appearance
00:02:43.000 following as the static of the performance B neck probability in each of its component we have see that there
00:02:50.599 are 44% percentage of performance B de appears in database server and 32
00:02:58.400 percentage appears in application server 14 in the web server and 10% on the
00:03:05.360 network so this result shows that almost 76 percentage of the performance bonex
00:03:12.159 appear in the application and the database servers identify performance but next source is important to optimize
00:03:20.000 the application performance as we can see D database performance is most critical for
00:03:26.920 application performance as this Main cator for performance B next there are many poor performance costs in database
00:03:34.239 such as inefficient secure statement um R data model or in proper
00:03:40.480 dat data model design in effective database configurations missing or inefficient
00:03:48.959 index small or inefficient dat cache or quality cach and the long concurrent
00:03:54.879 model and the possible dead logs so let's learn those con steps to better
00:04:01.239 avoid those problems every application starts with data modeling this is the job of the
00:04:07.439 database designer turning users requirements into the application data models let's see a simple
00:04:15.120 example this everyone here should be very familiar with this code signif it's a m file creating the table callede
00:04:22.160 users with attribute U ID name email confirmed V basic info and account ID
00:04:28.520 and we Define several con constraint to our scheme such a name should not be know confirmed should be forc by default
00:04:35.400 it's called Integrity constraint in database Integrity constru a list of
00:04:40.880 predictions we add to ensure that changes made to the database do not res
00:04:46.479 in a loss of data consisten to protect from accidental damage to the database
00:04:52.840 so let's turn back to our examples it will create a primary key called ID for
00:04:58.000 user table and we will know that the ID field is an auto increment field which
00:05:04.240 always has value this is called Prim primary key constraint also called entity constraint the second kind of
00:05:12.000 constraint is not non- constraint because we marking the name of the user should not be n the Third Kind of con is
00:05:19.199 unial constant seeing that the uid field should not be should be unical
00:05:26.479 across unical across the table the four one is the referential constraint
00:05:31.800 ensuring that the account ID of the user should also appear in account
00:05:39.039 table so database itself has those Conant active record claims that those
00:05:45.160 intelligence should belong to in your application models not in the database so that's why we use validations and
00:05:52.479 dependent to enforce data in integrity and the benefits in doing so is when we
00:05:57.560 do it in the application Level we can easily customize AR message when the constant is violated so such as not n
00:06:05.680 constant with validat pres of and unial for validat unical and referential
00:06:12.560 constraint use validates assciated so let's talk about
00:06:18.319 referential constraint a bit more before we create and record user we want to make sure that the created account
00:06:25.680 record exists after record created what happens if we delete the account if
00:06:32.199 without any control the user record will be left in dirty State and because it can't find the
00:06:39.240 account anymore which void the referential constraint and this will cause bugs in our
00:06:47.280 application so that's why we use dependent to control what happened to
00:06:53.000 the ass object when the account is destroyed there are several available options as follows so for delete destroy
00:07:01.000 and delete all means it will delete the all ass users to also be destroyed or
00:07:07.599 deleted this is on delete Cascade on the datab table
00:07:13.120 schema and for NY means that it should be set the US account ID
00:07:21.160 of the user to NE and that will cause callback are not exuded and
00:07:28.720 restrict with exception and the show with error means the arror or exception
00:07:33.759 should be released if there any user so account should not be
00:07:38.919 destroyed that's on the reject on the database cre table in
00:07:45.759 Integrity so however it's still possible that we will violate theal constraint if we use statement such as account. delete
00:07:53.360 ID because no callback will be CED it will generate a sec statement directory
00:08:01.199 and the including no callback will be acced including dependent ass action so
00:08:06.520 if we need real falling key support we should use the GM
00:08:12.000 falling and it's included in the rul 4.2 by
00:08:17.560 default so let's talk about another topic view it's not the view of MVC it's
00:08:24.000 a database view view is not used a lot in real app but it can be very useful
00:08:29.879 for proper situations basically view is just the visual relationship or
00:08:35.440 according let's take an example so seeing that a user can
00:08:40.839 belongs to many projects and it might be the member of a project or the administrator of the project so now we
00:08:48.640 have the requirements to get the list of members the user can
00:08:54.279 manage so I Define a faction called admin user in user model this pring as
00:09:00.399 this so we have see that it's a Jo statement and it's a wrong we statement including a in statement and a sub query
00:09:09.600 so this in so it's too complex and the
00:09:15.600 code smell is not as good then how should we reflect by creating a view since will be
00:09:23.720 much easier so this is we can create a view directory and reflect the among users
00:09:32.000 this is the view definition it's similar to the secq generated in the just uh in the last admin user faction so there are
00:09:39.079 several points I want to mention so first is a Creator replace view user Pro
00:09:44.160 SL project slmh so it's the view name and we have select so we only select the
00:09:52.000 fields we need which is the user ID and admin user ID and project ID and we have the inter
00:09:58.880 John so it's a subquery but we mark it as a as a temporary table called admin
00:10:05.560 memberships so when we have this view we can reflect
00:10:11.000 our code using we Define a because the view uh
00:10:17.320 active recorder trigger views similar as tables so we can have a user project membership class which inherited from AC
00:10:24.480 call base and it will use the view directory and active record will trigger
00:10:31.760 view as no difference as the table so and then we have has many associations
00:10:39.200 which can directly fetch admin users so this benefits view has bring to us it
00:10:45.600 gives us simple and customiz interface and what's more the index on the original table can still be used which
00:10:52.920 is really good however be aware that the view is
00:10:58.040 not a real table it's wrong every time view is referenced in a quy and most of time view is not
00:11:05.399 insertable updatable or deletable so if it's best shoed for
00:11:11.720 select there so you may concern that the performance is not very good because the query of view definition is WR every
00:11:19.120 time when it's quered referenced in a query so in Secure standard there are another
00:11:26.399 kind of view called materialized view which will include the qual and store
00:11:31.800 the result Into The View when the command is issued so we can direct fetch
00:11:37.279 the table results without additional quy it's
00:11:42.360 prefetched we use create materials view table name to create a material
00:11:47.720 view however as a query is prefetched and executed once the command is issued
00:11:54.560 it will not be automatically updated and unless we cause a refresh
00:12:01.120 material view Memory refresh view so material review
00:12:07.560 is best suited for relationship really changed such a reporting and you can St
00:12:14.079 so for example you can start yes this project this into a materialized view to
00:12:19.560 catch result instead of qu qu the cach qu the results every time
00:12:25.839 when it's called that's the use case of Mater View and the view which can bring benefits to
00:12:34.920 us so let's talk about another topic which
00:12:40.000 index uh recall the performance bck Source it says that nearly 10 percentage
00:12:46.839 of the data database performance problem are caused by missing ineffective or
00:12:52.360 inefficient index index is a m primary mechanism to get improved performance
00:12:59.839 on a database so if giving a table do you know which index should be created
00:13:07.160 is there any rules to create index what's the trade off so let's first see
00:13:13.199 how index work in database internally so now giving a user table
00:13:20.000 with the following records what if we want to find the record whose IDE is for
00:13:25.480 suppose without any index the operation will sear record one by one comparing it
00:13:30.959 ID to four until find record so first it's AR it's four second
00:13:39.040 it's four no third it's four no four yes
00:13:44.320 so we find the record that's we call it as a four table scan imagine when we have millions or
00:13:51.440 tens of millions of Records this will be very inefficient that's why we need the index
00:13:57.680 to call it to speed up with the index on ID field which is by
00:14:03.720 default for the prime key field we can directly locate the location of the
00:14:09.680 recorder and then fetch it directory so it contains two steps first we get the
00:14:18.240 pointer of the set recod from the index table and then then we get exact recod
00:14:25.759 using the pointer so now the question is how to get the point of that record
00:14:32.360 quickly from the index table there are several algorithms we can use it's
00:14:37.560 called index type in the database each index type uses a
00:14:43.600 different algorithm and best suited for different kind types of queries B+ Tre
00:14:50.560 is the most widely used in almost all the database hush is another type which is
00:14:56.480 supported by post skill and my skill besides that there are still other types
00:15:02.759 such as gist SP gist and G today I will talk about B tree
00:15:09.279 only so B plus tree is a kind of balanced tree this image is a 4 degree B
00:15:16.959 plus tree from the image we can see that in the from a node in the left side it's
00:15:23.199 the value it's the value key key value Which is less than the key uh key in the
00:15:29.560 no node and in the right side is larger or equal to so here I don't want to talk
00:15:36.000 about how this tree is construct let's see how we can use this tree to search
00:15:41.120 for records to get the point of the record suppose we now we want to find a record whose ID is four it will transver
00:15:49.199 the path from the root node to the leaf node first it will try to find the four
00:15:56.360 and by comparing to five it's smaller so it it will go to the left
00:16:01.800 side and comparing to two it will go to the right side okay then it find that we
00:16:08.160 have reached to the leaf nodes we will just bring this whole block into the memory and then actually the records one
00:16:16.720 by one so from two is three and it's four okay we find the record and then we
00:16:24.040 get the pointer of the record and then use the pointer to fet record directory
00:16:29.240 so algorithm complexity for a four degree tree is log 2 N so n is depend on
00:16:35.360 how many records we have for example we have eight here so it takes three steps
00:16:40.800 to get to the Lea nodes we can choose another degree As You Wish For example now we use four and
00:16:48.240 we can use six 16 100 1,000 it depends on the index field of this index on the
00:16:56.639 type of the field which you to index and the complex
00:17:02.040 complexity algor algorithm is Log and the depth Tre and and so we can even
00:17:11.959 with millions of Records or thousands of millions record the depth is not
00:17:17.880 very high and we can search it in the few steps which can be very efficient oh
00:17:24.559 based on so from the example we can know that B pro is very useful for ques like
00:17:31.280 equal and for ques with compare such as less than for less than it will just
00:17:38.799 find the value of four and then to travels back hand backward and for the
00:17:45.880 between it will find the V at the tree and then we iterate the dis block one by
00:17:52.000 one to find TI reach VR also for the
00:17:57.480 larg than find the value and then actually the Le nodes one by
00:18:03.400 one so from this this image and the explanation we can know that b plus 3 is
00:18:11.120 very suitable for Fields whose values of rise and so if the field just contains
00:18:17.760 four values such as the B value or in N Fields the index will be a bit useless
00:18:24.640 because there are many duplicates and the depth will be so it will almost
00:18:31.120 taken like four table SC so in the implementation of some database when the
00:18:37.080 index field are uh DC with lots of duplications the databas the database
00:18:43.280 will automatically turn to use the four table scan instead of the
00:18:48.760 index so the so after knowing that so what's should
00:18:54.760 we create the index as much as possible so we before answering this question we
00:19:00.000 should know that there are several downside of the index we should be aware of first each index we created will cost
00:19:08.240 additional storage space however as nowadays the storage is
00:19:13.840 very cheap is a marginal downside which you should not concern second it will take additional
00:19:20.240 time to create this index when record created so it will slow down the
00:19:25.440 recorded ction this is a meding level down side but it's is not a big problem the third downside is the main one we
00:19:32.159 should take into consideration while creating index index maintains is cost
00:19:37.559 if the values index updates frequently and it might be offset benefits than
00:19:43.360 heavy index because the maintainance will take two steps to for to the B
00:19:49.080 Construction it first will delete the value and then add it back so if the database updates the field value
00:19:57.080 frequently so it's not good to have the index then not so after knowing those downside and
00:20:04.720 know how index Works internally do you we can now get some conclusions about
00:20:10.919 how to pick up index first it will depend on the size of table if the size
00:20:17.120 of table is small for example it just contains some hundreds or thousands of Records there will there will be no much
00:20:24.720 difference between indexing and a full table scan and the database such post G SK will choose to use four table scan
00:20:32.360 directly to search the records one by one even if there are available index to
00:20:38.440 use second is the data distributions index are only effective when the values
00:20:44.760 of field is dense third is the qu rate comparing to
00:20:49.880 updated load because index maintains is costy as we just discussed in the down
00:20:55.120 side you shouldn't create index whose value are changed frequen but not qu too
00:21:01.120 often so that that's way the rules how we to pick up
00:21:06.440 index then we turn transaction talk about transaction is another important
00:21:12.360 Concept in database transactions is the unit of program equs that access and possibly
00:21:20.880 update right data items understanding the concepts of a transaction is critical for
00:21:26.919 understanding and implementing up of data in a database so let's see a real example in real
00:21:33.679 first suppose now we want to update the field the name of the user and the user
00:21:39.279 has a presense validation on name field so from from the generated secure we can
00:21:45.279 see that it starts with a begin statement and then do the job to update
00:21:50.799 the name and then the commit statement at the end means that the transaction is
00:21:56.840 succeeded and it will it has saved into the database the second example is ended
00:22:03.080 with the low back because it has Violet the name constraint and then it claims
00:22:09.919 to give up the update operations begin begin commit commit and lowb back are
00:22:17.320 the control unit of the transaction in the database these two examples are
00:22:22.880 single statement and we can use the transaction class method to run multiple
00:22:28.559 statement in one transaction so this example when we have so we have the
00:22:35.720 transaction Cass method and we'll do two things it will begin statement and run
00:22:42.400 statement one statement two and then commit so the way batch update
00:22:48.840 the name of field a active project ID in
00:22:54.640 batch so in one transaction so to speed up what interesting about transaction uh
00:23:02.200 callbacks I know very many developers are confusing why after save is not able
00:23:07.279 to see the records from another process for example when we trigger a background
00:23:12.480 process job in the after save and it's cute such as in the S and accute and it
00:23:20.000 can't find the record ID the ID of that record and you will raise arror that's
00:23:26.480 POS you might be possible have meet this situation so if we know
00:23:32.600 how transaction works it's easy to understand why when save and Destroy
00:23:37.919 code is automatically wrapped in a transaction and the Callback chain will
00:23:43.080 be like this so when we try to save the user it will start with the begin
00:23:48.799 statement of the transaction and it will be run before save call back and the
00:23:55.720 third step is a real save step and and the fourth step is the after save step
00:24:02.279 and then it's the committ step committ step statement so when after save is called
00:24:11.080 the transaction is still not committed that means in another process you might
00:24:16.840 be not being able to see the change you made to the database so it kind find the
00:24:22.880 record and it will just raise the record and not find and the final step is the
00:24:28.480 commit step so if you really need the background job such some Works to do
00:24:35.159 make making sure that the record has been updated or created in the database
00:24:40.960 you should use after commit callback so indeed transaction
00:24:48.840 management is very complex in the datab in the database design and
00:24:54.320 implementations the database system should follow the acid property of the transaction a is automatic which
00:25:02.720 means either all operation of the transaction happens properly in the D database or not happen so because the bu
00:25:11.039 transaction we just use if one of the statement of save field the whole transaction should be low
00:25:17.600 back so the future cannot leave the database in a state where a transaction par cuted so consistent is the acction
00:25:25.440 of the individual transaction should preserve the consistent of the database so if the state of database before the
00:25:32.600 transaction is consistent after after the extion the state should still be cap
00:25:37.960 consistent so it should not be changed the state of the database before and after the
00:25:43.559 transaction I is isolation it should ensure congruent Equity transaction
00:25:48.919 should not be should be isolated from one other just like no other transaction ised conr concr the is d right ability
00:25:59.240 it make sure that once the transaction has completed successfully the change
00:26:05.760 has made to the database should persist even if there are assistent
00:26:11.000 Futures so let's see an example to understand how this a works so the
00:26:16.840 complexing comes from conent congruent processing like our real real
00:26:22.520 application might update the same table or record at the same time by different process or SL for example so so this is
00:26:28.960 a transaction one which try to update the user record three and meanwhile there is
00:26:38.279 another Pro transaction T2 trying to read the user record user so after those
00:26:44.799 as those two transaction are running congruently so what's the returning user
00:26:50.120 record from TR T2 what's what do say is the first po statement the same as the
00:26:57.840 second the post statement is there anyone want to guess so if you think it's the same please raise your
00:27:05.200 hand anyway so the same two Port statement
00:27:12.840 return the same results okay I will leave us several
00:27:20.679 seconds for you to catch the code and do you think it will return the
00:27:27.240 different statement different results anyone want to raise the hand think is
00:27:35.919 different I think no answer is correct because the answer is dependent and might not might be the same might not
00:27:43.640 depend on the different congruent model or transaction asolution level so there are several asolution
00:27:50.600 level we can use so the first the first is real committed is
00:27:58.440 level which is by default in post skill so Le commit means that when I I run a
00:28:06.200 statement the statement can only see loss committed before it
00:28:11.440 begins so let's take example and to explain so when the
00:28:18.919 first uh when the CH uh first the find in transaction two
00:28:24.960 ised the transaction one is already commit uh is not committed and it not
00:28:30.360 update the value it's not updated the name value so it will put it will print
00:28:38.640 the old value and when after the sleep it will try it will find the value find
00:28:45.880 the record ID with three again but the transaction one is already committed and
00:28:53.320 so trans two will get the new updated record so the second
00:28:59.080 input it will print the different value as the first was old and the second will
00:29:06.480 be print the new newest uh name so this for the
00:29:15.000 read committed it's by default at post G skill so here is another uh aeration
00:29:23.600 level which is by default in my CQ repeated repeated read means all
00:29:28.760 statement of the C transaction can only sa the rows committed before the first
00:29:33.960 quy of dat modification statement willed in this transaction so when the first qu executed it will save the snapshot into
00:29:41.559 the database memory and the and the following ques will just use a
00:29:47.120 snapshot so using that let's back to the
00:29:54.120 example when the first find of transaction two exuded
00:29:59.919 it saves with a snapshot and for queries in the same transaction before the
00:30:05.120 commit statement it will always return the same record no matter whether there are other transaction updating the
00:30:11.919 record or not so in the transaction two
00:30:17.640 the second post statement or the second final statement will always return return the old value when the first
00:30:24.559 statement executed so it will return yes so in if the transaction asolution level
00:30:31.480 is a repeated read it will return the same value and this is by default in my
00:30:37.919 SEC so it act different in my SEC and post secq so you should know that when
00:30:43.080 you use a proper database and to know how transaction
00:30:48.159 works the third is the seriz isolation level cable is similar to repeat table
00:30:55.600 read but the transaction just a name the transaction will run one by
00:31:01.039 one so let's use back to the example again so it will as it's seriz so the
00:31:10.480 transaction the other transaction will only start when another transaction is
00:31:15.919 completed so no matter whether transaction one is runfor or transaction
00:31:20.960 two runfor it will always run only one transaction and will it will return the
00:31:27.720 same value but it will rise whether it's it might be the old value or the new
00:31:34.039 value depends on which transtion run
00:31:39.919 first so there is another uh isolation level which call read uncommitted read uncommitted means
00:31:48.399 that the transaction can always read the value not committed so the changes are
00:31:54.519 always visible so back to the example with Just sh so no matter whether the
00:32:00.000 transaction one is committed or not it's the trans transaction two can always
00:32:05.080 read the value and it's the most useless isolution level which we should never use because VI it will get the
00:32:13.080 unexpected results and post G secure doesn't support this level but will just
00:32:20.320 treat it the same as read
00:32:25.480 committed okay so as that's the all all the thing