00:00:06.240
Video equipment rental costs were paid for by Peep Code Screencasts.
00:00:19.380
Okay, well this is about Scientific Computing with Ruby and the GSA.
00:00:25.680
This is what I told them I would talk about, and then I decided to name the GSA something that fits better in the Google space.
00:00:31.320
So, Tigu—you won't find much about it except for if you find a lizard or me.
00:00:39.719
That's the reason I changed it. Tigu is just a lizard from South America.
00:00:45.180
I actually have a friend of mine who created that logo for me, so we have a logo too.
00:00:52.920
Anyway, this is me. I'm David Richards, and I wrote Tigu.
00:00:58.500
I've been writing software for about a dozen years or so. I decided I was unhappy, so I went back to school.
00:01:05.100
This time, I'm studying something called System Science, which involves systems, math, computers, and a lot of machine learning.
00:01:11.340
I'm trying to understand complex systems; it's a PhD program.
00:01:16.619
I get to hang out with cool, smart people, and I like it. They teach me a lot.
00:01:22.500
A friend of mine said I should use the quote, "In God We Trust; all others must bring data."
00:01:27.720
I'm finding I have a lot of reason to figure out my data and make it useful.
00:01:33.240
I want to integrate with what I'm doing with what's happening in the outside world.
00:01:41.400
And, of course, I want to use Ruby. I have found many great bindings and tools.
00:01:48.659
I've been having fun with some projects and decided—I was thinking, what do you call it? Last night, I just said, "The F-word!" I'll write a framework.
00:01:54.180
So, what I'm building is a large workbench for complex systems. It's generic in nature and should adapt to what you're doing.
00:02:00.659
If you need data, if you need to think about data, whether you're working in a production environment or looking for a one-off solution, this might be a place to work on things.
00:02:07.320
For instance, if you were doing the Netflix competition and wanted that million dollars, how would you approach it?
00:02:14.760
Maybe you've thought about it, or maybe you haven't. You sit down, and it's a complex problem. We've all been working on it for a while.
00:02:21.420
I registered, and I have the data, but I haven't done anything with it yet. It's a complex problem.
00:02:26.520
What the idea with Netflix is, they said, "We'll pay a million dollars to anybody that can improve our recommendation system by 10%."
00:02:31.879
They're recommending what movies to rent, effectively providing added value for the customers.
00:02:37.860
You choose Netflix instead of Blockbuster because they understand your preferences.
00:02:43.920
It was worth a million dollars to them, and after rewriting their engine, they achieved that 10% improvement.
00:02:49.739
They decided to open up the challenge to the community, so a lot of people have signed up to work on that problem.
00:02:56.459
It's a very complex challenge, and the winning team right now consists of two computer scientists and a statistician from Bell Labs.
00:03:01.560
They have put together 107 different models and are using four or five different approaches in their modeling.
00:03:07.379
But they're combining them in interesting ways to achieve better performance.
00:03:13.140
But how would you do all that? With Ruby, how would you combine the models and keep track of what you're working on?
00:03:19.379
How can you work without constantly writing one-off scripts?
00:03:25.860
This is where Tigu was invented—to work with large data spaces in terabytes and above.
00:03:33.480
To perform complex analysis integrated with real infrastructure.
00:03:41.700
In other words, you don't have to do all your transformations before getting to work.
00:03:47.159
You can just start working, perform transformations in Ruby, and hopefully complete your complex analysis before you retire.
00:03:52.620
Then with the resources available, you can achieve the desired results.
00:03:58.319
That's the general framework of what we'd like to work on.
00:04:03.599
Some concrete examples I've experienced working on with colleagues include research in genomics.
00:04:11.459
There’s a researcher in Portland studying the genetic causes of diabetes.
00:04:17.359
She came to our program to study a specific method to simplify her mathematical models.
00:04:23.220
She completed the class and said her dataset was too large for the available libraries.
00:04:30.360
I had been doing something similar and after discussing her requirements, we were able to help her work with her dataset.
00:04:37.199
The problem space we face is the need for flexibility while being cost-aware.
00:04:43.280
We must integrate with existing resources and scale to meet the size of the problem.
00:04:50.040
There are some great solutions available. Does anybody know about the R language?
00:04:56.460
The R language is designed for statistical analysis, and I love it.
00:05:03.600
Some of the best statisticians in the world are using R, so if your project is statistical in nature, you can likely find a solution in R.
00:05:09.840
You can include libraries easily, and there are incredible alternatives such as Matlab, Mathematica, and Octave.
00:05:15.540
These solutions are flexible; many integrate programming languages.
00:05:21.780
They scale well at times, but cost can be an issue as some tools are commercial.
00:05:28.440
Many engineering labs use Matlab as their default mathematics tool.
00:05:34.740
It was developed by Stephen Wolfram, who also authored the book on cellular automata.
00:05:40.740
He created Mathematica, another excellent resource.
00:05:46.920
I think Weka fits the problem space better, at least in the areas I consider.
00:05:53.220
With JRuby, you can easily integrate Weka, which is a powerful solution.
00:05:58.800
Mikhail Baryshnikov, a great dancer, rejects comparisons like 'he's the best'—there's no such thing.
00:06:07.320
Similarly, in complex spaces, there isn't a one-size-fits-all solution; different needs require different approaches.
00:06:14.880
My basic idea is that we work in an ecosystem of data and ideas; many inputs come from numerous directions.
00:06:21.720
I wanted to provide a framework that allows me to bring in any algorithm without excessive refactoring.
00:06:28.620
I would like to use it if it's better. I don’t want to reinvent wheels.
00:06:35.720
Currently, I’m able to access many neat tools with Tigu.
00:06:42.240
To give you an idea of how things will work, I’m looking at Rinda and Hadoop.
00:06:49.920
Rinda is Ruby-centric—a Ruby version of Linda, focused on parallel processing.
00:06:56.040
Hadoop is a MapReduce environment that Google developed in 2005.
00:07:02.640
They use it to manage major data problems by defining a map function that might count lines in a file.
00:07:10.020
Next, they partition the problem into thousands of nodes and run everything in parallel.
00:07:16.440
The reduce function combines the output from all nodes, leading to a straightforward approach.
00:07:23.340
You don’t need a background in distributed computing or functional programming to understand how to do basic tasks.
00:07:30.420
However, the problem with a MapReduce framework is that many older libraries don’t utilize that structure.
00:07:36.900
The libraries tend to be user-friendly, requiring minimal thought as long as parameters are set correctly.
00:07:44.520
So it depends on your dataset and the problem space you're addressing.
00:07:51.540
I've architected things in a way that allows us to explore other directions, which is exciting.
00:07:58.020
Hopefully, with the resources we have, we can achieve optimal results.
00:08:02.880
To clarify, I want to emphasize that this is an MIT-licensed, open-source project.
00:08:10.080
I'm willing to help and collaborate, not trying to sell services.
00:08:16.680
I've been following the Hadoop list, gathering ideas over the last six months.
00:08:24.300
By November, I hope to solidify a plan to bring this onto EC2 and Amazon Web Services.
00:08:30.180
The idea is that once you're set up, it primarily revolves around your workflow.
00:08:35.400
This framework provides a generic way to approach your problems.
00:08:41.579
You start with a job that is essentially a class capable of handling tasks.
00:08:47.700
You will write a directive; we'll review examples shortly.
00:08:53.520
The directive you write will be passed to a workload manager.
00:08:59.760
The workload manager will return what’s relevant, and you can work from there.
00:09:04.920
Let's explore the next slide, which contains important ideas about the ontology.
00:09:11.220
There is no one best algorithm for anything, and you don't know them all.
00:09:17.820
In analysis, you're confined to what you know and feel comfortable with.
00:09:23.460
For large problems like the Netflix competition, it's crucial to collaborate on better algorithms.
00:09:29.820
Without collective knowledge, learning halts at whether your code worked.
00:09:37.380
The vision behind Tigu is that if you're using this framework and you have a new algorithm,
00:09:43.920
you can submit it. It will go to Tigu Hub, where I'll test your code against standard datasets and publish benchmarks.
00:09:50.279
The wiki will provide insights into various algorithms' performances, making it accessible.
00:09:58.079
You don't need to start with academic literature to understand how they work. You can read about it in layperson terms.
00:10:04.740
If you prefer additional citations, feel free to Google it or run it to see how it performs with your data.
00:10:11.520
The workload manager keeps track of whether it's up to date on all job signatures.
00:10:17.940
It examines whether it knows all the ways to solve the problem.
00:10:23.520
If you could provide specifics about the algorithm you want to run, it would determine if it has the necessary transformation algorithm.
00:10:30.180
It will evaluate whether it can solve the problem using your selected method.
00:10:36.780
You don't always need the specific code; just knowing that you're working with an artificial neural network is enough.
00:10:42.960
The workload manager suggests which methods and data types to consider.
00:10:51.600
We derive this information from the data mining world, which typically avoids programming languages.
00:10:57.600
Instead, they speak in terms of ideas, like tabular data or graphs.
00:11:05.160
At the algorithm level, you'll specify the constraints to run the algorithm.
00:11:14.340
It may require additional parameters, models you developed in Tigu, or Ruby equations.
00:11:20.280
Finally, you'll know the benchmarks, and the workload manager also retains them.
00:11:25.859
What it does is optimize based on your priorities—popular algorithms.
00:11:32.760
You can also optimize for server time, calendar time, or trusted sources, based on your needs.
00:11:39.420
If you have a budget of $10 to spend on Amazon this afternoon, you can constrain it to a 10 cents per hour limit.
00:11:45.360
You can determine how many hours you want to allocate to server time.
00:11:50.940
You also have execution time limits; you could be patient and wait two minutes or two years.
00:11:57.840
The post-execution time is particularly interesting.
00:12:04.740
The workload manager uses a temporal difference algorithm to explore paths.
00:12:10.800
It runs the best way that knows how to give you your work and determines the optimal answer.
00:12:17.100
In post-production, it examines how best to achieve a better solution for future runs.
00:12:24.240
This is an ambitious and large-scale project that offers a lot of potential.
00:12:31.020
Now moving back to the inner workings of Tigu.
00:12:38.520
The workload creates a model that provides a result set from the data.
00:12:48.180
The model and result set are the two outcomes you achieve.
00:12:54.780
You can either take the results and move on or reuse the model in production.
00:13:01.920
You can also validate models against new or untested datasets.
00:13:07.320
That's the idea behind plugins.
00:13:13.080
The goal is for everyone to learn, not only individually but also as a community.
00:13:19.920
Tigu as a workload manager will learn too; it's about enhancing our capacity to adapt.
00:13:27.180
Research conducted by organizations like Simon to understand this better also supports these ideas.
00:13:34.920
The more jobs you run, the better the workload manager will be at performing tasks in the future based on past experiences.
00:13:44.520
Much learning will be documented in a lab book, keeping a transactional record.
00:13:51.960
You won’t just be accumulating logs, as we’ll also maintain pertinent records of runs and response times.
00:14:01.920
The aim is to analyze errors and devise strategies for future work in a console-like environment.
00:14:09.180
This winter, I'm focusing on a module called Human Elements.
00:14:16.560
The core algorithms should be solid enough for use by different parties.
00:14:22.740
Human Elements will serve as a GUI frontend written in Flex.
00:14:30.060
It will provide essential tools for the academic and business worlds without requiring extensive Ruby experience.
00:14:36.780
The Human Elements module is designed to focus on the lab book aspect.
00:14:43.260
The primary goal is to simplify reaching good answers and enhancing workflows.
00:14:49.680
I get pretty excited when practicing this, which likely stems from fatigue.
00:14:56.040
But I want to share code and ideas to resources if you're interested.
00:15:05.760
For example, let’s consider the traveling salesman problem.
00:15:10.320
Suppose a salesman must visit ten cities—what’s the shortest route that doesn’t revisit any city?
00:15:15.960
This is a classic problem that comes into play in many logistics scenarios.
00:15:20.760
For this example, we’ll compare a couple of algorithms using the console.
00:15:27.120
You might write a simple distance class to sum an array of distances.
00:15:34.860
The directive in the console might appear similar to this: 'directive do budget do'.
00:15:40.740
It defines your budget and connection to features working together in the process.
00:15:46.740
After you write this, you can run it through the console, and it should yield results quickly.
00:15:54.480
This is where you start developing ideas further; you can assert how things will function.
00:16:04.500
I would generate a model and result set, providing insights into the calculations I just made.
00:16:10.800
You can either collect data from those insights or advance the model for real-world applications.
00:16:16.080
That way, even if you're initializing a model, adjusting the parameters will yield satisfactory results.
00:16:23.220
The goal is for everyone to learn, not just individually, but also as a community.
00:16:30.720
The workload manager will learn and enhance its knowledge base. This will lead to a better framework.
00:16:35.820
In summary, we can continually improve the algorithms available and how we utilize them.
00:16:41.460
If you're learning or collaborating with Tigu, you'll easily build more encompassing analyses.
00:16:48.960
That's the essence of what I want to achieve with this project.
00:16:55.080
For questions or clarifications, please don’t hesitate to engage with me or share thoughts.
00:17:01.680
I appreciate your involvement and feedback on these topics.
00:17:08.040
Thank you for your attention, and I'm looking forward to the idea of collaboration!
00:17:20.040
That's it from my side!
00:17:26.040
Video equipment rental costs were paid for by Peep Code Screencasts.