Talks
Scientific Computing with Ruby Tegu (formerly GSA)
Summarized using AI

Scientific Computing with Ruby Tegu (formerly GSA)

by David Richards

The video "Scientific Computing with Ruby Tegu" presented by David Richards at LoneStarRuby Conf 2008 focuses on the development and applications of the Tigu framework for scientific computing using the Ruby programming language. The talk covers a variety of key points related to complex system analysis, data handling, and the integration of various algorithms within a unified framework. Richards highlights his background in software development and system science, emphasizing the need for efficient handling of large datasets and the importance of collaboration in tackling complex problems.

Key Points Discussed:

  • Background on Tigu Framework: The transition from GSA to Tigu, which reflects a lizard from South America, symbolizes the adaptability required in complex scientific tasks.
  • Integration of Data and Algorithms: Richards emphasizes that effective scientific computing requires combining different models and algorithms efficiently, particularly illustrated through the Netflix recommendation system challenge.
  • Use of Ruby and External Libraries: The framework aims to leverage Ruby for data manipulation while integrating with tools such as R, Matlab, and Weka for enhanced statistical analysis.
  • Workload Management: Tigu will function as a workload manager targeting optimization for data analysis, allowing users to track and integrate varied algorithms without excessive refactoring.
  • Open-source Collaboration: The project is MIT-licensed, aiming for openness and community participation in algorithm development and analysis.
  • Case Studies: A notable example shared involves assisting a researcher studying diabetes, which showcases the framework's practical application by handling large datasets.
  • Anticipated Features: Richards discusses future prospects for Tigu, including accessibility through a GUI and continuous learning from past algorithms’ performances to enhance future executions.

Key Conclusions:

  • Tigu is intended to be a flexible platform for complex data analysis, promoting community knowledge sharing and collaboration.
  • Effective scientific computing relies heavily on the interplay of various algorithms and the seamless integration of tools suited for different tasks.
  • The ambition behind Tigu is to enable users to tackle intricate problems efficiently while learning and improving continuously through collective input.
00:00:06.240 Video equipment rental costs were paid for by Peep Code Screencasts.
00:00:19.380 Okay, well this is about Scientific Computing with Ruby and the GSA.
00:00:25.680 This is what I told them I would talk about, and then I decided to name the GSA something that fits better in the Google space.
00:00:31.320 So, Tigu—you won't find much about it except for if you find a lizard or me.
00:00:39.719 That's the reason I changed it. Tigu is just a lizard from South America.
00:00:45.180 I actually have a friend of mine who created that logo for me, so we have a logo too.
00:00:52.920 Anyway, this is me. I'm David Richards, and I wrote Tigu.
00:00:58.500 I've been writing software for about a dozen years or so. I decided I was unhappy, so I went back to school.
00:01:05.100 This time, I'm studying something called System Science, which involves systems, math, computers, and a lot of machine learning.
00:01:11.340 I'm trying to understand complex systems; it's a PhD program.
00:01:16.619 I get to hang out with cool, smart people, and I like it. They teach me a lot.
00:01:22.500 A friend of mine said I should use the quote, "In God We Trust; all others must bring data."
00:01:27.720 I'm finding I have a lot of reason to figure out my data and make it useful.
00:01:33.240 I want to integrate with what I'm doing with what's happening in the outside world.
00:01:41.400 And, of course, I want to use Ruby. I have found many great bindings and tools.
00:01:48.659 I've been having fun with some projects and decided—I was thinking, what do you call it? Last night, I just said, "The F-word!" I'll write a framework.
00:01:54.180 So, what I'm building is a large workbench for complex systems. It's generic in nature and should adapt to what you're doing.
00:02:00.659 If you need data, if you need to think about data, whether you're working in a production environment or looking for a one-off solution, this might be a place to work on things.
00:02:07.320 For instance, if you were doing the Netflix competition and wanted that million dollars, how would you approach it?
00:02:14.760 Maybe you've thought about it, or maybe you haven't. You sit down, and it's a complex problem. We've all been working on it for a while.
00:02:21.420 I registered, and I have the data, but I haven't done anything with it yet. It's a complex problem.
00:02:26.520 What the idea with Netflix is, they said, "We'll pay a million dollars to anybody that can improve our recommendation system by 10%."
00:02:31.879 They're recommending what movies to rent, effectively providing added value for the customers.
00:02:37.860 You choose Netflix instead of Blockbuster because they understand your preferences.
00:02:43.920 It was worth a million dollars to them, and after rewriting their engine, they achieved that 10% improvement.
00:02:49.739 They decided to open up the challenge to the community, so a lot of people have signed up to work on that problem.
00:02:56.459 It's a very complex challenge, and the winning team right now consists of two computer scientists and a statistician from Bell Labs.
00:03:01.560 They have put together 107 different models and are using four or five different approaches in their modeling.
00:03:07.379 But they're combining them in interesting ways to achieve better performance.
00:03:13.140 But how would you do all that? With Ruby, how would you combine the models and keep track of what you're working on?
00:03:19.379 How can you work without constantly writing one-off scripts?
00:03:25.860 This is where Tigu was invented—to work with large data spaces in terabytes and above.
00:03:33.480 To perform complex analysis integrated with real infrastructure.
00:03:41.700 In other words, you don't have to do all your transformations before getting to work.
00:03:47.159 You can just start working, perform transformations in Ruby, and hopefully complete your complex analysis before you retire.
00:03:52.620 Then with the resources available, you can achieve the desired results.
00:03:58.319 That's the general framework of what we'd like to work on.
00:04:03.599 Some concrete examples I've experienced working on with colleagues include research in genomics.
00:04:11.459 There’s a researcher in Portland studying the genetic causes of diabetes.
00:04:17.359 She came to our program to study a specific method to simplify her mathematical models.
00:04:23.220 She completed the class and said her dataset was too large for the available libraries.
00:04:30.360 I had been doing something similar and after discussing her requirements, we were able to help her work with her dataset.
00:04:37.199 The problem space we face is the need for flexibility while being cost-aware.
00:04:43.280 We must integrate with existing resources and scale to meet the size of the problem.
00:04:50.040 There are some great solutions available. Does anybody know about the R language?
00:04:56.460 The R language is designed for statistical analysis, and I love it.
00:05:03.600 Some of the best statisticians in the world are using R, so if your project is statistical in nature, you can likely find a solution in R.
00:05:09.840 You can include libraries easily, and there are incredible alternatives such as Matlab, Mathematica, and Octave.
00:05:15.540 These solutions are flexible; many integrate programming languages.
00:05:21.780 They scale well at times, but cost can be an issue as some tools are commercial.
00:05:28.440 Many engineering labs use Matlab as their default mathematics tool.
00:05:34.740 It was developed by Stephen Wolfram, who also authored the book on cellular automata.
00:05:40.740 He created Mathematica, another excellent resource.
00:05:46.920 I think Weka fits the problem space better, at least in the areas I consider.
00:05:53.220 With JRuby, you can easily integrate Weka, which is a powerful solution.
00:05:58.800 Mikhail Baryshnikov, a great dancer, rejects comparisons like 'he's the best'—there's no such thing.
00:06:07.320 Similarly, in complex spaces, there isn't a one-size-fits-all solution; different needs require different approaches.
00:06:14.880 My basic idea is that we work in an ecosystem of data and ideas; many inputs come from numerous directions.
00:06:21.720 I wanted to provide a framework that allows me to bring in any algorithm without excessive refactoring.
00:06:28.620 I would like to use it if it's better. I don’t want to reinvent wheels.
00:06:35.720 Currently, I’m able to access many neat tools with Tigu.
00:06:42.240 To give you an idea of how things will work, I’m looking at Rinda and Hadoop.
00:06:49.920 Rinda is Ruby-centric—a Ruby version of Linda, focused on parallel processing.
00:06:56.040 Hadoop is a MapReduce environment that Google developed in 2005.
00:07:02.640 They use it to manage major data problems by defining a map function that might count lines in a file.
00:07:10.020 Next, they partition the problem into thousands of nodes and run everything in parallel.
00:07:16.440 The reduce function combines the output from all nodes, leading to a straightforward approach.
00:07:23.340 You don’t need a background in distributed computing or functional programming to understand how to do basic tasks.
00:07:30.420 However, the problem with a MapReduce framework is that many older libraries don’t utilize that structure.
00:07:36.900 The libraries tend to be user-friendly, requiring minimal thought as long as parameters are set correctly.
00:07:44.520 So it depends on your dataset and the problem space you're addressing.
00:07:51.540 I've architected things in a way that allows us to explore other directions, which is exciting.
00:07:58.020 Hopefully, with the resources we have, we can achieve optimal results.
00:08:02.880 To clarify, I want to emphasize that this is an MIT-licensed, open-source project.
00:08:10.080 I'm willing to help and collaborate, not trying to sell services.
00:08:16.680 I've been following the Hadoop list, gathering ideas over the last six months.
00:08:24.300 By November, I hope to solidify a plan to bring this onto EC2 and Amazon Web Services.
00:08:30.180 The idea is that once you're set up, it primarily revolves around your workflow.
00:08:35.400 This framework provides a generic way to approach your problems.
00:08:41.579 You start with a job that is essentially a class capable of handling tasks.
00:08:47.700 You will write a directive; we'll review examples shortly.
00:08:53.520 The directive you write will be passed to a workload manager.
00:08:59.760 The workload manager will return what’s relevant, and you can work from there.
00:09:04.920 Let's explore the next slide, which contains important ideas about the ontology.
00:09:11.220 There is no one best algorithm for anything, and you don't know them all.
00:09:17.820 In analysis, you're confined to what you know and feel comfortable with.
00:09:23.460 For large problems like the Netflix competition, it's crucial to collaborate on better algorithms.
00:09:29.820 Without collective knowledge, learning halts at whether your code worked.
00:09:37.380 The vision behind Tigu is that if you're using this framework and you have a new algorithm,
00:09:43.920 you can submit it. It will go to Tigu Hub, where I'll test your code against standard datasets and publish benchmarks.
00:09:50.279 The wiki will provide insights into various algorithms' performances, making it accessible.
00:09:58.079 You don't need to start with academic literature to understand how they work. You can read about it in layperson terms.
00:10:04.740 If you prefer additional citations, feel free to Google it or run it to see how it performs with your data.
00:10:11.520 The workload manager keeps track of whether it's up to date on all job signatures.
00:10:17.940 It examines whether it knows all the ways to solve the problem.
00:10:23.520 If you could provide specifics about the algorithm you want to run, it would determine if it has the necessary transformation algorithm.
00:10:30.180 It will evaluate whether it can solve the problem using your selected method.
00:10:36.780 You don't always need the specific code; just knowing that you're working with an artificial neural network is enough.
00:10:42.960 The workload manager suggests which methods and data types to consider.
00:10:51.600 We derive this information from the data mining world, which typically avoids programming languages.
00:10:57.600 Instead, they speak in terms of ideas, like tabular data or graphs.
00:11:05.160 At the algorithm level, you'll specify the constraints to run the algorithm.
00:11:14.340 It may require additional parameters, models you developed in Tigu, or Ruby equations.
00:11:20.280 Finally, you'll know the benchmarks, and the workload manager also retains them.
00:11:25.859 What it does is optimize based on your priorities—popular algorithms.
00:11:32.760 You can also optimize for server time, calendar time, or trusted sources, based on your needs.
00:11:39.420 If you have a budget of $10 to spend on Amazon this afternoon, you can constrain it to a 10 cents per hour limit.
00:11:45.360 You can determine how many hours you want to allocate to server time.
00:11:50.940 You also have execution time limits; you could be patient and wait two minutes or two years.
00:11:57.840 The post-execution time is particularly interesting.
00:12:04.740 The workload manager uses a temporal difference algorithm to explore paths.
00:12:10.800 It runs the best way that knows how to give you your work and determines the optimal answer.
00:12:17.100 In post-production, it examines how best to achieve a better solution for future runs.
00:12:24.240 This is an ambitious and large-scale project that offers a lot of potential.
00:12:31.020 Now moving back to the inner workings of Tigu.
00:12:38.520 The workload creates a model that provides a result set from the data.
00:12:48.180 The model and result set are the two outcomes you achieve.
00:12:54.780 You can either take the results and move on or reuse the model in production.
00:13:01.920 You can also validate models against new or untested datasets.
00:13:07.320 That's the idea behind plugins.
00:13:13.080 The goal is for everyone to learn, not only individually but also as a community.
00:13:19.920 Tigu as a workload manager will learn too; it's about enhancing our capacity to adapt.
00:13:27.180 Research conducted by organizations like Simon to understand this better also supports these ideas.
00:13:34.920 The more jobs you run, the better the workload manager will be at performing tasks in the future based on past experiences.
00:13:44.520 Much learning will be documented in a lab book, keeping a transactional record.
00:13:51.960 You won’t just be accumulating logs, as we’ll also maintain pertinent records of runs and response times.
00:14:01.920 The aim is to analyze errors and devise strategies for future work in a console-like environment.
00:14:09.180 This winter, I'm focusing on a module called Human Elements.
00:14:16.560 The core algorithms should be solid enough for use by different parties.
00:14:22.740 Human Elements will serve as a GUI frontend written in Flex.
00:14:30.060 It will provide essential tools for the academic and business worlds without requiring extensive Ruby experience.
00:14:36.780 The Human Elements module is designed to focus on the lab book aspect.
00:14:43.260 The primary goal is to simplify reaching good answers and enhancing workflows.
00:14:49.680 I get pretty excited when practicing this, which likely stems from fatigue.
00:14:56.040 But I want to share code and ideas to resources if you're interested.
00:15:05.760 For example, let’s consider the traveling salesman problem.
00:15:10.320 Suppose a salesman must visit ten cities—what’s the shortest route that doesn’t revisit any city?
00:15:15.960 This is a classic problem that comes into play in many logistics scenarios.
00:15:20.760 For this example, we’ll compare a couple of algorithms using the console.
00:15:27.120 You might write a simple distance class to sum an array of distances.
00:15:34.860 The directive in the console might appear similar to this: 'directive do budget do'.
00:15:40.740 It defines your budget and connection to features working together in the process.
00:15:46.740 After you write this, you can run it through the console, and it should yield results quickly.
00:15:54.480 This is where you start developing ideas further; you can assert how things will function.
00:16:04.500 I would generate a model and result set, providing insights into the calculations I just made.
00:16:10.800 You can either collect data from those insights or advance the model for real-world applications.
00:16:16.080 That way, even if you're initializing a model, adjusting the parameters will yield satisfactory results.
00:16:23.220 The goal is for everyone to learn, not just individually, but also as a community.
00:16:30.720 The workload manager will learn and enhance its knowledge base. This will lead to a better framework.
00:16:35.820 In summary, we can continually improve the algorithms available and how we utilize them.
00:16:41.460 If you're learning or collaborating with Tigu, you'll easily build more encompassing analyses.
00:16:48.960 That's the essence of what I want to achieve with this project.
00:16:55.080 For questions or clarifications, please don’t hesitate to engage with me or share thoughts.
00:17:01.680 I appreciate your involvement and feedback on these topics.
00:17:08.040 Thank you for your attention, and I'm looking forward to the idea of collaboration!
00:17:20.040 That's it from my side!
00:17:26.040 Video equipment rental costs were paid for by Peep Code Screencasts.
Explore all talks recorded at LoneStarRuby Conf 2008
+18