00:00:06.050
hi everybody I'm Daniel I'm a rubyist from the UK this is actually my first
00:00:12.440
conference that I've been to outside the UK so thank you all for the warm welcome
00:00:19.720
right now I'm a developer at a company called zappy store based in London and
00:00:25.820
what we do is automate market research end to end so a typical use case for us
00:00:31.669
will be something like a big supermarket sending out a survey to a hundred
00:00:37.310
thousand respondents asking them about say 50 brands maybe 30 attributes that
00:00:42.859
they associate with each one you'll have things like whether they buy it their age their gender or the different
00:00:48.590
segmentation and we basically produce all the reporting on that so that's a
00:00:54.949
ton of data analysis that we need to do we need real-time interrogation of the data so you want to be able to filter on
00:01:03.170
say different segments or different brands and you still want to be able to
00:01:08.900
have all your statistics update in real time so that's quite a challenge because we're a rails app say this talk is
00:01:17.600
basically going to be looking at how we've done that and also with the other
00:01:23.330
tools that are available in Ruby right now so I know a few people were asking about data processing early already
00:01:29.690
hopefully this talk will answer some of those questions
00:01:37.680
we take a step back first and think what do we actually need from a language to
00:01:42.930
be able to do data processing well in it the main things that we need are the
00:01:49.740
right data structures for it say we need to be able to read data in from a variety of sources we usually want to be
00:01:56.310
able to index that data on some attribute we want to be able to do fast
00:02:01.350
mathematical operations arithmetic and stuff set theory things like slicing and
00:02:06.600
filtering we usually want to be able to do time series plotting stuff over time and we obviously like to be able to
00:02:14.010
visualize that data nicely so what are
00:02:21.480
people using for this right now a big one although it might seem strange is Excel I would want to talk about a
00:02:30.060
developer conference maybe but most of the partners whose IP we automate are
00:02:36.090
still using Excel to do this kind of thing by hand right now and it's pretty much the default for non-programmers r
00:02:44.519
is the traditional heavy weights it's been around since 1997 it's used a lot
00:02:49.920
in universities in academia it's pretty much got every statistical function you could ever think of a ton of charting
00:02:56.610
libraries but it's not the most pleasant language to work with for general computing so that's where Python pandas
00:03:04.890
has really been able to make up a lot of ground it started in sort of 2007-2008 it's got
00:03:13.769
a lot of the same advantages of a lot of the same statistical packages now but packaged in the Python language which is
00:03:20.250
obviously nice it'll work with the general-purpose programming I guess quite a few of you will be aware that's
00:03:25.980
already an interesting newcomer I think is Julia I don't know how many of you've
00:03:32.040
seen that before but it's basically also a language purpose written for
00:03:37.140
high-performance computing while still hopefully having all the kind of nice things that we like about scripting
00:03:42.930
languages like Ruby I've not seen it used in a production environment for this yet it's quite
00:03:49.739
recent if anybody has I'd love to hear about it so next question why do data
00:03:59.700
analysis in Ruby well I mean people
00:04:04.859
often say that you should try to use the right tool for the right for the job and so it might seem counterintuitive to
00:04:10.349
want to do this in Ruby at all when we already have those other products but
00:04:15.540
obviously we're all here because we love working in Ruby we you know enjoy Ruby we want to be able to do these things as
00:04:21.150
well and I think also for many people the data processing is a nice to have
00:04:27.240
but maybe not the main focus a lot of people in here I guess we'll be building rails apps web frameworks and so on and
00:04:33.710
for them being web first will be the most important thing but it would be nice to have data processing as well so
00:04:41.250
it is important that we have that and I also think that more and more if those
00:04:51.150
libraries aren't available other languages will start to do the web stuff better and Ruby will lose ground if you
00:04:59.250
look at sort of the recent trends and things like machine learning being a big thing deep learning all these things
00:05:05.460
most people are working with them are working with them probably in Python because they have all the packages
00:05:11.070
available all the community around that and that all kind of starts with the bottom layer of data processing for
00:05:17.639
large data sets so
00:05:23.159
you know so what is pandas it's an open-source Python library the name
00:05:29.650
comes from panel data analysis and it was developed as a high-performance library for financial data but it's been
00:05:36.370
taken over by the open source community it's super fast it uses numpy as its
00:05:42.669
sort of underlying array for fast numeric operations its main data
00:05:48.789
structure is the data frame which you can sort of think of like a table and Excel if you like and the series which
00:05:56.020
is kind of like a single column in that data frame it's got great visualization integration through matplotlib it's got
00:06:04.180
a super active community and yeah it's really really fast a lot of its critical code paths have actually been ported to
00:06:10.599
C via syphon it's driving Python growth
00:06:16.930
so a stack overflow block at the bottom there that was published quite recently that said that Python is the fastest
00:06:24.400
growing of the major languages right now by Stack Overflow metrics and that that
00:06:29.740
growth was actually being driven by pandas being the most active tag within Python
00:06:40.690
so our solution is some a gem called Quattro that unfortunately is not open
00:06:49.130
sourced yet but we are looking to open-source it soon which is why I'm talking about it now
00:06:55.940
so the first question is why Quattro specifically and then I'll talk about
00:07:01.250
what it is so again our data as I described is kind of it's very loosely
00:07:06.290
structured surveys and questions can come in many different forms it's not always the same but they're very
00:07:12.410
dimensional so in a typical survey we might have you know a few hundred thousand rows and we might have 50 or
00:07:19.940
100 columns all of which can be correlated and all of which we kind of need to be able to compare so we around
00:07:29.570
sort of around 2014 we started running into problems where our datasets were
00:07:34.670
getting too large to be able to handle and the only things that existed at the time which was kind of NRA and GSL which
00:07:41.120
we were using but we were finding that rapidly the data modeling was getting far too complex for the speed that it
00:07:48.170
provided and even then it just wasn't really fast enough for what we needed to do so we also didn't want to completely
00:07:57.290
rewrite our rails up because we had a mature product we had a whole team of rails developers Ruby developers so we
00:08:03.950
thought what else can we do and what we came up with was effectively a
00:08:10.010
translation layer between Ruby and Python or between Ruby and pandas so
00:08:15.520
it's kind of inspired by the way that active record scopes and Lisp works it
00:08:21.350
treats your code as data and it builds up expressions like you have unless
00:08:29.380
where each expression can then be the input to the next expression and so on
00:08:35.450
so forth until you have like a tree with nodes and that entire tree can be passed
00:08:40.580
via rescue workers through Redis to Python workers but then execute that
00:08:46.100
code and pandas and send it back down the wire and we can evaluate it in Ruby
00:08:52.930
the first thing to talk about there really is the performance because basically it is running pandas it's very
00:09:00.160
very close all that we've really got is the overhead of the message brokering which is usually in the order of kind of
00:09:08.260
microseconds per node another thing to mention is that we have in some cases
00:09:16.529
sort of odd on the side of keeping the API simple and clean pandas tends to tends to be sort of
00:09:22.960
super flexible it allows you to do a ton of things so we've kind of tried to keep
00:09:29.650
it as simple as possible while still allowing the maximum functionality we
00:09:35.680
don't have or we won't have a visualization library integrated by default because most of the charting
00:09:41.170
that we do is very custom and so this is
00:09:46.410
this is kind of something I want to show you this is how you can add a new method mapping in literally just two or three
00:09:52.870
lines so that this this is basically mapping a unique method on so our data
00:09:59.680
structure is called the measure table which is kind of equivalent to the data frame in pandas and this would be adding
00:10:05.800
a unique method on the measure table that does the exact same as prop duplicates in pandas so that makes this
00:10:13.839
gem sort of super easy to extend and to keep up to date with latest developments in pandas because it's literally just a
00:10:19.839
couple of lines to add a new method we also have the ability to inject a kind
00:10:25.990
of custom Python code at runtime which we use for sort of stuff that we don't
00:10:31.360
consider to be sensible to keep in the core watch our engine things like more machine learning type scifi libraries
00:10:40.310
another gender I contribute to you that I want to talk about because this is available now and it's also I find very
00:10:46.200
useful is the diary gem which is part of the syru library it's created by sama
00:10:52.410
Deshmukh in 2014 stands for data analysis in Ruby although apparently it's also a Hindu
00:10:59.040
word for alcohol which he took some inspiration from and you can kind of
00:11:04.350
think of it as being Ruby's nearest native equivalent to pandas it uses n
00:11:10.290
matrix as its equivalent to numpy for fast numerical operations its data
00:11:15.810
structures of the data frame and the vector rather than the series and it also has various visualization libraries
00:11:22.200
integrated although I would say that they are probably more fragmented and
00:11:28.800
less complete than what you get with Python so I want to talk a little bit
00:11:36.810
about the community effect has really driven pandas forward if you look at the
00:11:44.190
chart there you can see the commits and the contributors that pandas has daaru and Quatro Quatro obviously is not
00:11:50.790
particularly meaningful since it's not open source yet but just for comparison sake I will not include it and obviously
00:11:57.990
you can see that pandas hugely outstrips anything in sorry bee in terms of its
00:12:05.250
community adoption this is github issues open and closed and that might seem like
00:12:12.300
a frightening amount of issues for a project to have but I guess what it really shows is that it's being used a
00:12:17.970
lot people are reporting issues and closing them and in fact there's been more issues closed on pandas just on CSV
00:12:25.290
handling than has been opened across all of SCI Ruby put together which kind of tells you that this is a production
00:12:30.750
ready battle-hardened library and last one is Stack Overflow posts
00:12:38.529
that's the pandas tag I tried to find it but unfortunately the diary tag doesn't
00:12:43.910
exist yet there have been I did match to find about 18 or so questions that were
00:12:49.459
related to it but not tagged I will tag them after this talk and about 20 or so
00:12:55.220
more that were tagged to sigh Ruby all of which I would say were answered by people working on the gem say it's not
00:13:04.100
all bad and this you know this kind of is my next point that just because it's a small community doesn't mean it's not
00:13:10.220
a great community everybody working on it is super active they do respond to
00:13:15.620
you pull requests very quickly two issues two questions both on email and
00:13:21.170
slack channels and on github itself and the gem is picking up within the syru
00:13:27.320
bee community more and more people do seem to be getting involved
00:13:35.300
so I'm going to attempt to show a demo I
00:13:40.670
was hoping that I would have a slightly higher resolution so that I could do these side-by-side but hopefully I tried
00:13:47.640
it earlier and you couldn't see anything so I'll do them sequentially and hopefully you still get the idea so this
00:13:54.149
is basically just going to be a quick basic example of how you would do something in pandas if I can get my
00:14:00.450
mouse and so what we're doing here is we're reading a CSV that I have found on
00:14:05.730
the internet that contains IMDB Hollywood movie data about sort of social media like the amount of Facebook
00:14:11.610
Likes that the actor and the movie have got we're going to index it on the movie title remove any duplicates from the
00:14:19.260
data set then we're going to merge that on the movie title index with financial
00:14:26.130
data for the movie just a quick example
00:14:31.170
of some filtering so for example on the director name and a string pattern in
00:14:37.649
the genre something you know a simple statistic like the mean gross revenue by
00:14:43.740
director and the top term and then see a quick visualization and I'll show how that would look in daaru
00:14:49.740
and in Quatro so I was hoping to do this side-by-side but basically things to look out for are kind of the syntax and
00:14:56.579
these performance so you can see these are pretty quick
00:15:11.310
just click through so you've got like a
00:15:16.980
nice progression chart line there of sort of movie Facebook Likes against the gross profit we've got our table here of
00:15:24.029
top ten groups like group buy which is a pretty common operation I'll filter data
00:15:31.290
our original table nerds table sorry and
00:15:39.140
so we can see that was all pretty speedy I'm gonna demonstrate the same thing in
00:15:45.420
darou now which this so this is open source that's Quatro this is open source right
00:15:58.230
now you can get this just by doing gem install daaru
00:16:03.770
that's fine okay so we've loaded the CSV we can see
00:16:09.510
that was a little bit slow but I'm still fine this is interesting so daaru actually is
00:16:14.760
a little bit more opinionated on indexes it doesn't let you set an index with duplicates you have to use categorical
00:16:20.910
indexes for that whereas pandas does let you even though house categoricals as well so we'll
00:16:27.870
clean up the duplicates first so here you can start to see already a little
00:16:38.619
cause for a fact what we get our result we can now set
00:16:49.839
the index now that it's a unique index we can again load a data frame from a
00:16:56.350
hash we can merge it this is the one where
00:17:04.850
you really tell so this this takes about two minutes and I'm gonna move on to the Quatro one while this is running sorry
00:17:16.339
where's my mouse gone yeah
00:17:28.259
so as you can see the performance here is much closer to what we had with pandas which pretty much incident and
00:17:38.529
there we go obviously like I said we don't have a visualization library integrated with this because we do custom so that's not
00:17:45.789
much to share there let's go back and see whether the Dury one has finished yet and it's still going
00:18:12.830
in the mean time oh yeah one other thing
00:18:18.770
I wanted to show you on the Quatro example was so this is an example of
00:18:27.110
where I say that we've kind of kept the API simple rather than implementing absolutely everything that pandas has so
00:18:34.430
whereas pandas has a join method an emerge method we've just implemented this as a single reset index merge set
00:18:43.220
index which basically means you can use it with multi index merges even if they don't completely match up if you're just
00:18:49.550
matching on one or two indexes or something like that which you can't do with the straight join method okay this
00:18:59.930
is actually taking awkwardly long so I'm going to move on
00:19:18.620
so looking at the performance service this is kind of a table of just the methods that I was running there we're
00:19:26.280
on a hundred times so you can see that the difference is a bit better and what we find is that Daru is generally at
00:19:32.070
least two orders of magnitude slower Quatro is roughly equivalent to pandas
00:19:38.010
plus as I said the overhead per node and per round trip this might seem very sort
00:19:45.810
of harsh and disappointing on daaru but it is quite a new gem still it's
00:19:52.530
basically at the moment they're working on 1.0 release with a stable API so
00:19:58.740
really getting kind of feature complete has been the main focus and not performance there is work to come that
00:20:04.200
will hopefully improve this quite a bit
00:20:12.520
so again that's just that in graph form where we can see the cumulative performance on the log scale and we can
00:20:18.280
see that Quatro actually does pretty well but that's quite a naive
00:20:24.130
benchmarking case there's actually several things that we do in Quatro that makes the performance better and you
00:20:29.650
would think still so the first thing is single worker transactions which is that
00:20:35.710
basically the way it's set up by default is kind of similar to what Jim was
00:20:41.590
describing in his talk of a functional setup where each expression can be
00:20:46.809
evaluated in isolation and will always provide the same result by using single
00:20:52.210
worker transactions we can actually sort of say the same workers should process a whole block of expressions so that we
00:20:59.530
can cash we can shard and cache sub expressions so that if the same sub expression appears in multiple
00:21:05.470
expressions we don't have to recalculate that portion we can just take that again we can also do we can basically treat it
00:21:13.179
as a compiler because effectively what it is doing is mapping to you pandas code so we can do stuff like tree
00:21:19.480
rewrites and index partitioning whereby if you do something like let's say an
00:21:24.490
arithmetic operation you're multiplying everything by scalar and then you're slicing out some index value then quatro
00:21:31.300
can actually be made smart enough to know okay I'm only going to take this slice of the data anyway so rewrite the
00:21:37.809
tree so that I only take that slice from the beginning and only do the arithmetic operation on the subset that I need
00:21:43.270
which obviously means they end up doing your calculations on smaller data sets and leads to better performance so the
00:21:53.110
biggest blocker for all of them this is a quote from where's McKinney's block he's the blog sorry he's the guy that
00:21:58.840
created pandas and his rule of thumb is that you should have roughly 5 to 10
00:22:03.910
times as much RAM as the size of your data set which is obviously pretty huge
00:22:08.970
but that's definitely been true of our experience in production Ram has
00:22:14.020
absolutely been the biggest performance blocker so future development across all
00:22:20.800
the gems the first thing to look out there is arrow the Apache arrow product project which
00:22:27.380
is where McKinney's sort of newest project and it's basically a memory format specification that supports zero
00:22:34.460
copy data reads for fast access without the serialization overhead so that you
00:22:39.680
can basically avoid this huge Ram spike and it's also designed to be an
00:22:44.900
interoperable standards so that hopefully they'll be bindings to this are for pandas for Ruby so that you can
00:22:50.600
use the same data structure what the same data across different structures
00:22:55.990
with the Roo the next big thing coming up is as I say the version one release and also there's another Sai ruby gem
00:23:02.420
could Rubik that I'm going to look at in just a second that will hopefully allow us to rewrite some of its slowest
00:23:08.180
portions of C extensions at the same way that pandas has done with syphon pandas
00:23:14.750
has kind of moved to JIT compilation a little bit for places where they found extra optimization needed so they're
00:23:20.450
using number which is based on the LLVM stack and for Quatro the biggest thing
00:23:25.550
is going to be open sourcing obviously we're hoping to do that quite soon so if
00:23:31.790
we just take a quick look at the rube X example I'm not sure if that's clear but I did put it on separate slides just in
00:23:37.790
case so this is just a simple example of how it should work say this is kind of
00:23:43.940
your standards function that you might write in Ruby this would be what your
00:23:49.190
typical C extension might look like for it not so pleasant to write necessarily
00:23:55.010
if you're not a developer and this is how you could do it in Rubik's basically
00:24:01.580
just by declaring your types and it'll compile down to C code I would say this
00:24:06.620
is still quite early days for example it doesn't handle recursive functions well
00:24:12.740
yet it's not as far as I know been tested anywhere in production we don't
00:24:18.440
even have it in darou yet but it is the next thing I think to look at is going to be integrating this into dairy
00:24:30.870
I do this we just see
00:24:38.110
yeah I just want to prove this has now returned the rest of these are quite a
00:24:45.220
bit quicker so we can filter fine we can
00:24:51.510
group by find the mean and sort top turn and we've got this is just a simple
00:24:58.840
visualization example there giving a similar plot to what we had in pandas of a scatter of the likes versus gross
00:25:12.650
yeah just some closing thoughts on this basically I think it's pretty clear that
00:25:17.840
if you're doing hardcore data analysis and you've got large data sets or
00:25:22.970
extremely high performance requirements right now pandas is gonna be your only option but I think realistically we're
00:25:31.790
not that far away from hopefully having Ruby be good enough that it's not
00:25:38.420
something you turn away from it's worth bearing in mind that pandas started out ten years after our and has been
00:25:47.059
remarkably successful so the fact that we are may be a few years behind is not necessarily devastating quatro being
00:25:57.020
released I would hope will really drive adoption of Ruby for data science I really hope
00:26:02.929
that people get to using it and yeah I would say get involved if you have data
00:26:08.960
science requirements do check out the dairy gem do contribute do watch out for pandas being open sourced lastly I would
00:26:17.240
like to say thank you to the people that have helped make these tools possible so exactly that's Brendan and the rest of
00:26:24.080
the Quatro team at SCI Ruby that's some air in the rest of the team and at SCI PI that's Wesley I would also like to
00:26:31.880
point you if you're interested in this both Brendan and Samir have given excellent talks on Quattro and on daaru
00:26:39.080
in the past but are worth googling and yeah obviously feel free to contact me
00:26:58.550
thank you so much Daniel I'm gonna take some questions data processing - sure so
00:27:08.240
we start like that all right I mean the performance issue that can we solve it like how we used to most Ruby problem
00:27:17.270
can we just probe machines and and or is it just really a fundamental language
00:27:23.000
problem that make it snow tough question I think ultimately if you if you look at
00:27:31.250
how pandas has gotten so fast and that's really what you're comparing it to you right pandas has gotten so fast by hugely
00:27:38.480
leveraging C code syphon C extensions and I think it's always going to be
00:27:44.960
difficult to compete with that in pure Ruby obviously there may be a way but I
00:27:51.470
would say realistically probably you would want to go the same route and start using something like Rubik's to
00:27:56.780
start compiling some of these down okay so my Rubik's right so instead of reusing Rubik's with basically just like
00:28:03.040
Ruby with some extensions that you need to fight and all that yeah so you
00:28:10.460
basically provided types for it I'm like yeah oh wait so that just not move to
00:28:18.080
like crystals it's basically Ruby with tightline sure
00:28:26.500
I guess because this is easy to integrate into existing Ruby projects that so full disclosure I don't know
00:28:34.279
enough about crystal to make a full comparison that but what I would say is that you can just install this as a gem
00:28:40.190
and it'll work with your existing Ruby code so I think that's the primary advantage that's pretty good
00:28:47.330
any more questions Chris
00:29:00.970
so the only way they could release
00:29:09.220
so I think the Conda the first question was is there interest in the community
00:29:14.720
in this being open sourced and I think from the sort of questions that I've heard around these two days and sort of
00:29:20.600
the response that we've had in other places I think it's clear that there is some demand for data processing Ruby and
00:29:25.940
I think that probably there are people that would want to use this we are quite keen in the company to get this open
00:29:32.240
source as quickly as possible I think the blocker at the moment is not so much anything that sort of can be
00:29:38.600
helped with it's really just cleaning out the last of kind of client IP in the code and making it ready to be released
00:29:45.289
which the majority of that work has been done but there is just you know a little bit left yep all right there's a good
00:29:53.090
question one more question yes PC
00:29:59.950
sorry I didn't hit a fuzz but how do you set rules for data analysis
00:30:14.029
as in how we make sure that our modeling is sort of statistically valid that way
00:30:35.900
so in our specific context we do that by quite tightly controlling the data so we
00:30:42.419
work with our partners to structure the survey so we work with leading market
00:30:48.090
research agencies and they provide a lot of the methodology that makes sure that you know the way we gather the data and
00:30:53.970
the way we process it is statistically valid that for the most part happens
00:31:04.169
automatically it's that's kind of the business we're in is trying to eliminate as much of the manual process of that as
00:31:10.260
possible there is obviously still some but the vast majority of it is all right
00:31:19.080
all right one more round up Daniel too much and just a final thank you to
00:31:24.720
Jimmy and the organizers for having me here