Treading Water in a Stream of Data

In the talk "Treading Water in a Stream of Data," Jeremy Hinegardner discusses the complexities and methodologies involved in handling streaming data versus static datasets. He emphasizes the necessity of real-time data analysis in various forms, including server logs and social media streams. Hinegardner begins by engaging the audience, asking them to identify their experience level with streaming data and batch processing. This sets the stage for a discussion about the evolving definition of 'big data,' which he suggests is often subjective and relates to individual comfort levels with data processing challenges.

He highlights several key points:

- Definitions and Concepts: Hinegardner reviews definitions of streaming data and contrasts it with batch data. Streaming data is characterized as elements processed one at a time, while batch data is processed in larger groups.
- Acquisition Methods: He outlines four methods for obtaining data: polling, notifications/webhooks, data payloads, and push systems. Each method is scrutinized for its advantages and disadvantages, with an emphasis on the need for contingency plans to recover from data losses, especially in real-time scenarios.
- Real-Time Processing Challenges: Acknowledging that while streaming data offers immediate insights, it also comes with challenges like managing constant updates, error handling, and potential data omissions during downtime.
- Examples from Industry: Hinegardner references systems like Twitter’s firehose, describing the pitfalls of push systems where missed connections can lead to lost data.
- Best Practices: He advocates for preparing primary, secondary, and tertiary data acquisition methods, reinforcing the importance of maintaining extensive archives for future analysis. Hinegardner summarizes that preparedness and adaptability are crucial in managing big data effectively.

The conclusion draws attention to the parallels between big data and previous data warehousing concepts, advocating for continuous data availability to enable new discoveries and enhance data analysis capabilities. He underscores that having easy access to comprehensive datasets is vital in preventing missed opportunities and facilitating insightful analysis.

Overall, Hinegardner's talk provides a comprehensive overview of how to effectively approach streaming data with the right strategies, illuminating the balance between harnessing immediate data insights while understanding the complexities that come with them.

Treading Water in a Stream of Data
Jeremy Hinegardner • February 28, 2013 • Earth

Data arrives in all sorts of forms, more and more today we are seeing data arrive in event-like systems: server logs, twitter, superfeedr notifications, github events, rubygems web hooks, couchdb change notifications, etc. We want to analyze streams of data, and find useful pieces of information in them.

In this talk, using an existing dataset, we will go through the process of obtaining, manipulating, processing, analyzing and storing a stream of data. We will attempt to touch upon a variety of possible topics including: differences between processing static datasets and stream datasets, pitfalls of stream processing, analyzing data in real time and using the datastream itself as data source.

Help us caption & translate this video!

http://amara.org/v/FGdZ/

Big Ruby 2013