Summarized using AI

Chasing Pandas

Daniel Baark • October 13, 2017 • Selangor, Malaysia • Talk

In the RubyConf MY 2017 talk titled "Chasing Pandas," Daniel Baark, a Ruby developer from Zappy Store, discusses the importance of data processing in Ruby programming, particularly for large datasets. He highlights the challenges faced while automating market research surveys, which often require real-time data interrogation and analysis. Daniel reviews multiple data analysis tools, starting with traditional methods such as Excel and R, before introducing Python's pandas library as a significant player because of its high performance and extensive community support. He argues that as Ruby developers, there is a necessity to incorporate data processing capabilities into Ruby applications to ensure the language remains competitive against Python, especially in areas like machine learning and deep learning.

Key Points Discussed:
- Introduction to Data Processing in Ruby: Daniel emphasizes the need for effective data processing tools in Ruby to handle large volumes of information efficiently.
- Current Tools Available: He reviews existing tools like Excel, R, and Python's pandas, indicating that while R is powerful, Python's pandas has gained considerable popularity and support.
- Introducing Quattro: Daniel presents Quattro, a gem developed by his team to serve as a bridge between Ruby and Python's pandas, allowing Ruby applications to leverage pandas' capabilities.
- Performance Comparison: He compares the performance of Quattro with Pandas and another Ruby gem, Daru, to illustrate how Quattro achieves performance levels close to pandas while maintaining simplicity in API design.
- Community Engagement: The talk addresses the importance of community contributions to improve Ruby's data processing solutions, encouraging developers to get involved with gems like Daru and Quattro.
- Future of Data Processing in Ruby: Daniel expresses optimism about the potential of Ruby for data science, pending the release of features like open-sourcing Quattro and further developments in associated gems.

Conclusions and Takeaways:
- For data-heavy applications, current solutions like pandas outperform Ruby's offerings, but Quattro represents significant progress in this area.
- Data analysis is a growing requirement in software development, and Ruby must evolve to compete effectively.
- Community engagement is crucial in advancing Ruby's data processing libraries, and developers are encouraged to actively participate in these initiatives.

Chasing Pandas
Daniel Baark • October 13, 2017 • Selangor, Malaysia • Talk

Speaker: Daniel Baark

Website: http://rubyconf.my

Produced by Engineers.SG

RubyConf MY 2017

00:00:06.050 hi everybody I'm Daniel I'm a rubyist from the UK this is actually my first
00:00:12.440 conference that I've been to outside the UK so thank you all for the warm welcome
00:00:19.720 right now I'm a developer at a company called zappy store based in London and
00:00:25.820 what we do is automate market research end to end so a typical use case for us
00:00:31.669 will be something like a big supermarket sending out a survey to a hundred
00:00:37.310 thousand respondents asking them about say 50 brands maybe 30 attributes that
00:00:42.859 they associate with each one you'll have things like whether they buy it their age their gender or the different
00:00:48.590 segmentation and we basically produce all the reporting on that so that's a
00:00:54.949 ton of data analysis that we need to do we need real-time interrogation of the data so you want to be able to filter on
00:01:03.170 say different segments or different brands and you still want to be able to
00:01:08.900 have all your statistics update in real time so that's quite a challenge because we're a rails app say this talk is
00:01:17.600 basically going to be looking at how we've done that and also with the other
00:01:23.330 tools that are available in Ruby right now so I know a few people were asking about data processing early already
00:01:29.690 hopefully this talk will answer some of those questions
00:01:37.680 we take a step back first and think what do we actually need from a language to
00:01:42.930 be able to do data processing well in it the main things that we need are the
00:01:49.740 right data structures for it say we need to be able to read data in from a variety of sources we usually want to be
00:01:56.310 able to index that data on some attribute we want to be able to do fast
00:02:01.350 mathematical operations arithmetic and stuff set theory things like slicing and
00:02:06.600 filtering we usually want to be able to do time series plotting stuff over time and we obviously like to be able to
00:02:14.010 visualize that data nicely so what are
00:02:21.480 people using for this right now a big one although it might seem strange is Excel I would want to talk about a
00:02:30.060 developer conference maybe but most of the partners whose IP we automate are
00:02:36.090 still using Excel to do this kind of thing by hand right now and it's pretty much the default for non-programmers r
00:02:44.519 is the traditional heavy weights it's been around since 1997 it's used a lot
00:02:49.920 in universities in academia it's pretty much got every statistical function you could ever think of a ton of charting
00:02:56.610 libraries but it's not the most pleasant language to work with for general computing so that's where Python pandas
00:03:04.890 has really been able to make up a lot of ground it started in sort of 2007-2008 it's got
00:03:13.769 a lot of the same advantages of a lot of the same statistical packages now but packaged in the Python language which is
00:03:20.250 obviously nice it'll work with the general-purpose programming I guess quite a few of you will be aware that's
00:03:25.980 already an interesting newcomer I think is Julia I don't know how many of you've
00:03:32.040 seen that before but it's basically also a language purpose written for
00:03:37.140 high-performance computing while still hopefully having all the kind of nice things that we like about scripting
00:03:42.930 languages like Ruby I've not seen it used in a production environment for this yet it's quite
00:03:49.739 recent if anybody has I'd love to hear about it so next question why do data
00:03:59.700 analysis in Ruby well I mean people
00:04:04.859 often say that you should try to use the right tool for the right for the job and so it might seem counterintuitive to
00:04:10.349 want to do this in Ruby at all when we already have those other products but
00:04:15.540 obviously we're all here because we love working in Ruby we you know enjoy Ruby we want to be able to do these things as
00:04:21.150 well and I think also for many people the data processing is a nice to have
00:04:27.240 but maybe not the main focus a lot of people in here I guess we'll be building rails apps web frameworks and so on and
00:04:33.710 for them being web first will be the most important thing but it would be nice to have data processing as well so
00:04:41.250 it is important that we have that and I also think that more and more if those
00:04:51.150 libraries aren't available other languages will start to do the web stuff better and Ruby will lose ground if you
00:04:59.250 look at sort of the recent trends and things like machine learning being a big thing deep learning all these things
00:05:05.460 most people are working with them are working with them probably in Python because they have all the packages
00:05:11.070 available all the community around that and that all kind of starts with the bottom layer of data processing for
00:05:17.639 large data sets so
00:05:23.159 you know so what is pandas it's an open-source Python library the name
00:05:29.650 comes from panel data analysis and it was developed as a high-performance library for financial data but it's been
00:05:36.370 taken over by the open source community it's super fast it uses numpy as its
00:05:42.669 sort of underlying array for fast numeric operations its main data
00:05:48.789 structure is the data frame which you can sort of think of like a table and Excel if you like and the series which
00:05:56.020 is kind of like a single column in that data frame it's got great visualization integration through matplotlib it's got
00:06:04.180 a super active community and yeah it's really really fast a lot of its critical code paths have actually been ported to
00:06:10.599 C via syphon it's driving Python growth
00:06:16.930 so a stack overflow block at the bottom there that was published quite recently that said that Python is the fastest
00:06:24.400 growing of the major languages right now by Stack Overflow metrics and that that
00:06:29.740 growth was actually being driven by pandas being the most active tag within Python
00:06:40.690 so our solution is some a gem called Quattro that unfortunately is not open
00:06:49.130 sourced yet but we are looking to open-source it soon which is why I'm talking about it now
00:06:55.940 so the first question is why Quattro specifically and then I'll talk about
00:07:01.250 what it is so again our data as I described is kind of it's very loosely
00:07:06.290 structured surveys and questions can come in many different forms it's not always the same but they're very
00:07:12.410 dimensional so in a typical survey we might have you know a few hundred thousand rows and we might have 50 or
00:07:19.940 100 columns all of which can be correlated and all of which we kind of need to be able to compare so we around
00:07:29.570 sort of around 2014 we started running into problems where our datasets were
00:07:34.670 getting too large to be able to handle and the only things that existed at the time which was kind of NRA and GSL which
00:07:41.120 we were using but we were finding that rapidly the data modeling was getting far too complex for the speed that it
00:07:48.170 provided and even then it just wasn't really fast enough for what we needed to do so we also didn't want to completely
00:07:57.290 rewrite our rails up because we had a mature product we had a whole team of rails developers Ruby developers so we
00:08:03.950 thought what else can we do and what we came up with was effectively a
00:08:10.010 translation layer between Ruby and Python or between Ruby and pandas so
00:08:15.520 it's kind of inspired by the way that active record scopes and Lisp works it
00:08:21.350 treats your code as data and it builds up expressions like you have unless
00:08:29.380 where each expression can then be the input to the next expression and so on
00:08:35.450 so forth until you have like a tree with nodes and that entire tree can be passed
00:08:40.580 via rescue workers through Redis to Python workers but then execute that
00:08:46.100 code and pandas and send it back down the wire and we can evaluate it in Ruby
00:08:52.930 the first thing to talk about there really is the performance because basically it is running pandas it's very
00:09:00.160 very close all that we've really got is the overhead of the message brokering which is usually in the order of kind of
00:09:08.260 microseconds per node another thing to mention is that we have in some cases
00:09:16.529 sort of odd on the side of keeping the API simple and clean pandas tends to tends to be sort of
00:09:22.960 super flexible it allows you to do a ton of things so we've kind of tried to keep
00:09:29.650 it as simple as possible while still allowing the maximum functionality we
00:09:35.680 don't have or we won't have a visualization library integrated by default because most of the charting
00:09:41.170 that we do is very custom and so this is
00:09:46.410 this is kind of something I want to show you this is how you can add a new method mapping in literally just two or three
00:09:52.870 lines so that this this is basically mapping a unique method on so our data
00:09:59.680 structure is called the measure table which is kind of equivalent to the data frame in pandas and this would be adding
00:10:05.800 a unique method on the measure table that does the exact same as prop duplicates in pandas so that makes this
00:10:13.839 gem sort of super easy to extend and to keep up to date with latest developments in pandas because it's literally just a
00:10:19.839 couple of lines to add a new method we also have the ability to inject a kind
00:10:25.990 of custom Python code at runtime which we use for sort of stuff that we don't
00:10:31.360 consider to be sensible to keep in the core watch our engine things like more machine learning type scifi libraries
00:10:40.310 another gender I contribute to you that I want to talk about because this is available now and it's also I find very
00:10:46.200 useful is the diary gem which is part of the syru library it's created by sama
00:10:52.410 Deshmukh in 2014 stands for data analysis in Ruby although apparently it's also a Hindu
00:10:59.040 word for alcohol which he took some inspiration from and you can kind of
00:11:04.350 think of it as being Ruby's nearest native equivalent to pandas it uses n
00:11:10.290 matrix as its equivalent to numpy for fast numerical operations its data
00:11:15.810 structures of the data frame and the vector rather than the series and it also has various visualization libraries
00:11:22.200 integrated although I would say that they are probably more fragmented and
00:11:28.800 less complete than what you get with Python so I want to talk a little bit
00:11:36.810 about the community effect has really driven pandas forward if you look at the
00:11:44.190 chart there you can see the commits and the contributors that pandas has daaru and Quatro Quatro obviously is not
00:11:50.790 particularly meaningful since it's not open source yet but just for comparison sake I will not include it and obviously
00:11:57.990 you can see that pandas hugely outstrips anything in sorry bee in terms of its
00:12:05.250 community adoption this is github issues open and closed and that might seem like
00:12:12.300 a frightening amount of issues for a project to have but I guess what it really shows is that it's being used a
00:12:17.970 lot people are reporting issues and closing them and in fact there's been more issues closed on pandas just on CSV
00:12:25.290 handling than has been opened across all of SCI Ruby put together which kind of tells you that this is a production
00:12:30.750 ready battle-hardened library and last one is Stack Overflow posts
00:12:38.529 that's the pandas tag I tried to find it but unfortunately the diary tag doesn't
00:12:43.910 exist yet there have been I did match to find about 18 or so questions that were
00:12:49.459 related to it but not tagged I will tag them after this talk and about 20 or so
00:12:55.220 more that were tagged to sigh Ruby all of which I would say were answered by people working on the gem say it's not
00:13:04.100 all bad and this you know this kind of is my next point that just because it's a small community doesn't mean it's not
00:13:10.220 a great community everybody working on it is super active they do respond to
00:13:15.620 you pull requests very quickly two issues two questions both on email and
00:13:21.170 slack channels and on github itself and the gem is picking up within the syru
00:13:27.320 bee community more and more people do seem to be getting involved
00:13:35.300 so I'm going to attempt to show a demo I
00:13:40.670 was hoping that I would have a slightly higher resolution so that I could do these side-by-side but hopefully I tried
00:13:47.640 it earlier and you couldn't see anything so I'll do them sequentially and hopefully you still get the idea so this
00:13:54.149 is basically just going to be a quick basic example of how you would do something in pandas if I can get my
00:14:00.450 mouse and so what we're doing here is we're reading a CSV that I have found on
00:14:05.730 the internet that contains IMDB Hollywood movie data about sort of social media like the amount of Facebook
00:14:11.610 Likes that the actor and the movie have got we're going to index it on the movie title remove any duplicates from the
00:14:19.260 data set then we're going to merge that on the movie title index with financial
00:14:26.130 data for the movie just a quick example
00:14:31.170 of some filtering so for example on the director name and a string pattern in
00:14:37.649 the genre something you know a simple statistic like the mean gross revenue by
00:14:43.740 director and the top term and then see a quick visualization and I'll show how that would look in daaru
00:14:49.740 and in Quatro so I was hoping to do this side-by-side but basically things to look out for are kind of the syntax and
00:14:56.579 these performance so you can see these are pretty quick
00:15:11.310 just click through so you've got like a
00:15:16.980 nice progression chart line there of sort of movie Facebook Likes against the gross profit we've got our table here of
00:15:24.029 top ten groups like group buy which is a pretty common operation I'll filter data
00:15:31.290 our original table nerds table sorry and
00:15:39.140 so we can see that was all pretty speedy I'm gonna demonstrate the same thing in
00:15:45.420 darou now which this so this is open source that's Quatro this is open source right
00:15:58.230 now you can get this just by doing gem install daaru
00:16:03.770 that's fine okay so we've loaded the CSV we can see
00:16:09.510 that was a little bit slow but I'm still fine this is interesting so daaru actually is
00:16:14.760 a little bit more opinionated on indexes it doesn't let you set an index with duplicates you have to use categorical
00:16:20.910 indexes for that whereas pandas does let you even though house categoricals as well so we'll
00:16:27.870 clean up the duplicates first so here you can start to see already a little
00:16:38.619 cause for a fact what we get our result we can now set
00:16:49.839 the index now that it's a unique index we can again load a data frame from a
00:16:56.350 hash we can merge it this is the one where
00:17:04.850 you really tell so this this takes about two minutes and I'm gonna move on to the Quatro one while this is running sorry
00:17:16.339 where's my mouse gone yeah
00:17:28.259 so as you can see the performance here is much closer to what we had with pandas which pretty much incident and
00:17:38.529 there we go obviously like I said we don't have a visualization library integrated with this because we do custom so that's not
00:17:45.789 much to share there let's go back and see whether the Dury one has finished yet and it's still going
00:18:12.830 in the mean time oh yeah one other thing
00:18:18.770 I wanted to show you on the Quatro example was so this is an example of
00:18:27.110 where I say that we've kind of kept the API simple rather than implementing absolutely everything that pandas has so
00:18:34.430 whereas pandas has a join method an emerge method we've just implemented this as a single reset index merge set
00:18:43.220 index which basically means you can use it with multi index merges even if they don't completely match up if you're just
00:18:49.550 matching on one or two indexes or something like that which you can't do with the straight join method okay this
00:18:59.930 is actually taking awkwardly long so I'm going to move on
00:19:18.620 so looking at the performance service this is kind of a table of just the methods that I was running there we're
00:19:26.280 on a hundred times so you can see that the difference is a bit better and what we find is that Daru is generally at
00:19:32.070 least two orders of magnitude slower Quatro is roughly equivalent to pandas
00:19:38.010 plus as I said the overhead per node and per round trip this might seem very sort
00:19:45.810 of harsh and disappointing on daaru but it is quite a new gem still it's
00:19:52.530 basically at the moment they're working on 1.0 release with a stable API so
00:19:58.740 really getting kind of feature complete has been the main focus and not performance there is work to come that
00:20:04.200 will hopefully improve this quite a bit
00:20:12.520 so again that's just that in graph form where we can see the cumulative performance on the log scale and we can
00:20:18.280 see that Quatro actually does pretty well but that's quite a naive
00:20:24.130 benchmarking case there's actually several things that we do in Quatro that makes the performance better and you
00:20:29.650 would think still so the first thing is single worker transactions which is that
00:20:35.710 basically the way it's set up by default is kind of similar to what Jim was
00:20:41.590 describing in his talk of a functional setup where each expression can be
00:20:46.809 evaluated in isolation and will always provide the same result by using single
00:20:52.210 worker transactions we can actually sort of say the same workers should process a whole block of expressions so that we
00:20:59.530 can cash we can shard and cache sub expressions so that if the same sub expression appears in multiple
00:21:05.470 expressions we don't have to recalculate that portion we can just take that again we can also do we can basically treat it
00:21:13.179 as a compiler because effectively what it is doing is mapping to you pandas code so we can do stuff like tree
00:21:19.480 rewrites and index partitioning whereby if you do something like let's say an
00:21:24.490 arithmetic operation you're multiplying everything by scalar and then you're slicing out some index value then quatro
00:21:31.300 can actually be made smart enough to know okay I'm only going to take this slice of the data anyway so rewrite the
00:21:37.809 tree so that I only take that slice from the beginning and only do the arithmetic operation on the subset that I need
00:21:43.270 which obviously means they end up doing your calculations on smaller data sets and leads to better performance so the
00:21:53.110 biggest blocker for all of them this is a quote from where's McKinney's block he's the blog sorry he's the guy that
00:21:58.840 created pandas and his rule of thumb is that you should have roughly 5 to 10
00:22:03.910 times as much RAM as the size of your data set which is obviously pretty huge
00:22:08.970 but that's definitely been true of our experience in production Ram has
00:22:14.020 absolutely been the biggest performance blocker so future development across all
00:22:20.800 the gems the first thing to look out there is arrow the Apache arrow product project which
00:22:27.380 is where McKinney's sort of newest project and it's basically a memory format specification that supports zero
00:22:34.460 copy data reads for fast access without the serialization overhead so that you
00:22:39.680 can basically avoid this huge Ram spike and it's also designed to be an
00:22:44.900 interoperable standards so that hopefully they'll be bindings to this are for pandas for Ruby so that you can
00:22:50.600 use the same data structure what the same data across different structures
00:22:55.990 with the Roo the next big thing coming up is as I say the version one release and also there's another Sai ruby gem
00:23:02.420 could Rubik that I'm going to look at in just a second that will hopefully allow us to rewrite some of its slowest
00:23:08.180 portions of C extensions at the same way that pandas has done with syphon pandas
00:23:14.750 has kind of moved to JIT compilation a little bit for places where they found extra optimization needed so they're
00:23:20.450 using number which is based on the LLVM stack and for Quatro the biggest thing
00:23:25.550 is going to be open sourcing obviously we're hoping to do that quite soon so if
00:23:31.790 we just take a quick look at the rube X example I'm not sure if that's clear but I did put it on separate slides just in
00:23:37.790 case so this is just a simple example of how it should work say this is kind of
00:23:43.940 your standards function that you might write in Ruby this would be what your
00:23:49.190 typical C extension might look like for it not so pleasant to write necessarily
00:23:55.010 if you're not a developer and this is how you could do it in Rubik's basically
00:24:01.580 just by declaring your types and it'll compile down to C code I would say this
00:24:06.620 is still quite early days for example it doesn't handle recursive functions well
00:24:12.740 yet it's not as far as I know been tested anywhere in production we don't
00:24:18.440 even have it in darou yet but it is the next thing I think to look at is going to be integrating this into dairy
00:24:30.870 I do this we just see
00:24:38.110 yeah I just want to prove this has now returned the rest of these are quite a
00:24:45.220 bit quicker so we can filter fine we can
00:24:51.510 group by find the mean and sort top turn and we've got this is just a simple
00:24:58.840 visualization example there giving a similar plot to what we had in pandas of a scatter of the likes versus gross
00:25:12.650 yeah just some closing thoughts on this basically I think it's pretty clear that
00:25:17.840 if you're doing hardcore data analysis and you've got large data sets or
00:25:22.970 extremely high performance requirements right now pandas is gonna be your only option but I think realistically we're
00:25:31.790 not that far away from hopefully having Ruby be good enough that it's not
00:25:38.420 something you turn away from it's worth bearing in mind that pandas started out ten years after our and has been
00:25:47.059 remarkably successful so the fact that we are may be a few years behind is not necessarily devastating quatro being
00:25:57.020 released I would hope will really drive adoption of Ruby for data science I really hope
00:26:02.929 that people get to using it and yeah I would say get involved if you have data
00:26:08.960 science requirements do check out the dairy gem do contribute do watch out for pandas being open sourced lastly I would
00:26:17.240 like to say thank you to the people that have helped make these tools possible so exactly that's Brendan and the rest of
00:26:24.080 the Quatro team at SCI Ruby that's some air in the rest of the team and at SCI PI that's Wesley I would also like to
00:26:31.880 point you if you're interested in this both Brendan and Samir have given excellent talks on Quattro and on daaru
00:26:39.080 in the past but are worth googling and yeah obviously feel free to contact me
00:26:58.550 thank you so much Daniel I'm gonna take some questions data processing - sure so
00:27:08.240 we start like that all right I mean the performance issue that can we solve it like how we used to most Ruby problem
00:27:17.270 can we just probe machines and and or is it just really a fundamental language
00:27:23.000 problem that make it snow tough question I think ultimately if you if you look at
00:27:31.250 how pandas has gotten so fast and that's really what you're comparing it to you right pandas has gotten so fast by hugely
00:27:38.480 leveraging C code syphon C extensions and I think it's always going to be
00:27:44.960 difficult to compete with that in pure Ruby obviously there may be a way but I
00:27:51.470 would say realistically probably you would want to go the same route and start using something like Rubik's to
00:27:56.780 start compiling some of these down okay so my Rubik's right so instead of reusing Rubik's with basically just like
00:28:03.040 Ruby with some extensions that you need to fight and all that yeah so you
00:28:10.460 basically provided types for it I'm like yeah oh wait so that just not move to
00:28:18.080 like crystals it's basically Ruby with tightline sure
00:28:26.500 I guess because this is easy to integrate into existing Ruby projects that so full disclosure I don't know
00:28:34.279 enough about crystal to make a full comparison that but what I would say is that you can just install this as a gem
00:28:40.190 and it'll work with your existing Ruby code so I think that's the primary advantage that's pretty good
00:28:47.330 any more questions Chris
00:29:00.970 so the only way they could release
00:29:09.220 so I think the Conda the first question was is there interest in the community
00:29:14.720 in this being open sourced and I think from the sort of questions that I've heard around these two days and sort of
00:29:20.600 the response that we've had in other places I think it's clear that there is some demand for data processing Ruby and
00:29:25.940 I think that probably there are people that would want to use this we are quite keen in the company to get this open
00:29:32.240 source as quickly as possible I think the blocker at the moment is not so much anything that sort of can be
00:29:38.600 helped with it's really just cleaning out the last of kind of client IP in the code and making it ready to be released
00:29:45.289 which the majority of that work has been done but there is just you know a little bit left yep all right there's a good
00:29:53.090 question one more question yes PC
00:29:59.950 sorry I didn't hit a fuzz but how do you set rules for data analysis
00:30:14.029 as in how we make sure that our modeling is sort of statistically valid that way
00:30:35.900 so in our specific context we do that by quite tightly controlling the data so we
00:30:42.419 work with our partners to structure the survey so we work with leading market
00:30:48.090 research agencies and they provide a lot of the methodology that makes sure that you know the way we gather the data and
00:30:53.970 the way we process it is statistically valid that for the most part happens
00:31:04.169 automatically it's that's kind of the business we're in is trying to eliminate as much of the manual process of that as
00:31:10.260 possible there is obviously still some but the vast majority of it is all right
00:31:19.080 all right one more round up Daniel too much and just a final thank you to
00:31:24.720 Jimmy and the organizers for having me here
Explore all talks recorded at RubyConf MY 2017
+16