00:00:23.949
My name is Shelby Switzer. As Graham said, I work for a company down in Denver called Notion.
00:00:27.260
We make incredibly smart home sensors that can tell you if a door opened when you're not home, if you have a water leak, and what your temperature is. They can detect ambient light and do a bunch of other stuff.
00:00:35.450
At Notion, I work on the API and some of our web services. As you can imagine, being in the Internet of Things, we deal with lots of data—tons and tons of logs and distributed systems.
00:00:45.800
But today, I'm not going to talk about any of that. I'm going to talk about a project that I work on called API in a Box, which essentially works with small data. The world runs on small data, and as developers, we sometimes forget that most people in their daily lives work with Excel spreadsheets and CSV files.
00:01:01.790
They don't really have the problems of having to understand accelerometer data or know what to do with it. A perfect example of the world running on small data comes up in civic hacking all the time.
00:01:14.270
In my free time, I like to work with Code for America brigades. Civic hacking is how I got into programming years ago in South Carolina. I then moved to Atlanta and worked with the Code for Atlanta Brigade. Now that I'm in Denver, I work with the Code for Denver Brigade a lot. Small data is always a roadblock. It's not even necessarily about getting the data, which you might think is the hardest part; it’s about what to do with the data once you have it.
00:01:41.479
To illustrate, I'm going to tell you a story from Atlanta. We worked on a project called 'Show Me the Food.' Essentially, the Code for America Brigade in Atlanta had a really good relationship with the city of Atlanta, as well as the Atlanta Food Bank. We were looking for projects to help the food bank solve critical food issues in the city, one of which was food deserts, a major problem in Atlanta and also in Denver.
00:02:07.549
The brigade, along with the Atlanta Food Bank, aimed to find a way to identify food deserts, map them out, and build a canvassing tool to allow people in the community, perhaps those living in food deserts, to enter convenience stores. They could identify stores that might not offer a variety of healthy options but do stock fresh produce.
00:02:38.680
So, they could start mapping out where fresh produce was available as opposed to just convenience stores. But, does everyone here know what a food desert is? Let me explain.
00:02:48.420
A food desert refers to a large area where people have limited access to healthful food options. For example, they may not have access to vegetables for a nice dinner and have to rely on fast-food places or convenience stores, ultimately resulting in unhealthy diets.
00:03:15.100
After we developed the canvassing tool and identified these food deserts, we wanted to visualize our findings and ultimately change the world, or at least our small part of it. So, where do we start? This is our data: two CSV files that we received from the city.
00:03:35.050
The first file is a tax digest, which lists all the businesses in the city, and the second is a statewide list of businesses that accept EBT, similar to food stamps. As you can imagine, there is going to be duplicate data between these files. They also come from two different governmental organizations, making their format completely different. You can't trust that any of the data is standardized, consistent, or even free of typos.
00:04:13.180
When we got this data, we had to figure out what to do with it. Where do we keep it? How do we interact with it? Do we normalize it? Do we deduplicate it? Which database should we use? If we even needed a database, and being developers, we all had really strong opinions about how to approach this.
00:04:42.179
The project sort of stalled because we didn't know exactly what to do with the data. Once we figured that out, we had to start building on top of it, and since we were all from different backgrounds with different environments and skill sets, organizing all of this presented a significant challenge for civic hacking.
00:05:09.280
In the process of this project, as well as many others, we identified five big needs when working with government data. The first need was an easy and free way to store and maintain the data. Additionally, we wanted to ensure the data remained unchanged. A conversation that regularly came up was whether we could simply pull the business name, latitude, and longitude, and just ditch the rest of the data.
00:05:39.279
However, just because we only needed certain details for our use case doesn't mean that the other columns wouldn’t be useful for others wanting to interact with the dataset.
00:06:06.650
We also needed to integrate or develop a flexible and robust API to make the data not only accessible but also easy to interact with.
00:06:32.800
We didn’t want hosting costs because we were non-profit and needed to set it up simply on any server or computer for anyone giving up their Tuesday night to hack, without wanting to spend the whole night installing Postgres.
00:06:57.550
So, the first step in maintaining and storing the data was to simply pop it in GitHub. The data is open anyway; it belongs to the people, so there's no reason why it can't be in an open repository. It's free, and most of us know how to work with it. It’s definitely better than passing around a thumb drive.
00:07:30.060
Next, once we want to interact with the data and get it into our environment, we need to establish a local data store to facilitate easy interaction, rather than just connecting to the GitHub API. We chose Elasticsearch as our document store for these CSV files.
00:07:54.820
Elasticsearch is schema-less, allowing us to avoid forcing data into tables, and it uses JSON documents. Most web developers are quite familiar with working with JSON, along with its RESTful API, and it offers powerful search capabilities, which is crucial for our purpose.
00:08:12.719
We have to be able to search and understand data that is not normalized, which can have typos and varies between integers, strings, and floats. Elasticsearch provides a lot of functionality around that, so we do a little bit of work on top of it.
00:08:43.120
When we dump the data from the GitHub repository into Elasticsearch, we essentially define the mappings. Mappings represent the columns of our data. We have fields for name, latitude, longitude, business type, and we add a metadata field for the file source, indicating from which CSV file this data originates.
00:09:14.030
We also want to maintain all data as strings. Within Elasticsearch, there are numerous data types available, but we cannot guarantee that any column will consistently be an integer, which is why we keep everything as strings.
00:09:48.720
So far, we have solved two of our needs: we have an easy and free way to store metadata in GitHub without changing the data, and we keep it in Elasticsearch with some added metadata. We are now on the way to securing a flexible and robust API. Elasticsearch provides the underlying functionality.
00:10:20.450
This is where Ruby comes in. We set up a simple Sinatra API on top of it, making things very simple. We also do some work to ensure that this API is engaging to interact with, and we want to make it self-descriptive with ample metadata attached to the data we retrieve.
00:10:51.930
This is where Collection JSON comes into play. Not everyone may be familiar with using hypermedia types and their APIs, but Collection JSON is a format designed for describing collections of data while formatting representations of those collections.
00:11:16.159
It's authored by a gentleman named Mike Amundsen, who does substantial work in the REST community. Essentially, it allows us to create a standardized format for reading data and gathering metadata about our resources.
00:11:43.200
In this standardized format, we have a top-level object called 'collection,' where we obtain the URL for the collection containing our resources. Then, there is an object called 'items,' which is just an array of all my resources, each with a unique URL and the data associated with them.
00:12:11.920
This resource here, for example, is Bob's Convenience Store with its latitude, longitude, business type, and also the file source it came from. What makes this particularly powerful is a secondary object called 'queries.' This takes my Elasticsearch mappings, which originate from the CSV column headers, and presents them as query parameters.
00:12:40.870
This allows me to search based on name, latitude/longitude, or business type without needing to know anything about the original CSV file. I just know the available queries as they have been provided to me.
00:13:06.880
Additionally, I can apply meta queries on top of that, leveraging Elasticsearch's functionality, such as deduplication, which collapses duplicates based on different fields. This makes it significantly more powerful than simply displaying raw data.
00:13:40.570
Now that I've defined my API requirements, how do I make this solution portable and free, without needing hosting, yet still replicable? This is where Docker comes in.
00:14:00.960
Docker is trendy, and we use it at Notion all the time. It's incredibly useful in the civic hacking space when dealing with various people in differing environments. With Docker, all they need to install is Docker instead of hosting Postgres, Ruby, or Ruby Version Manager.
00:14:33.799
By using Docker, we contain all our API functionality within a container. So, how does it all fit together? Essentially, the process looks like this: you place the data in GitHub, and when you spin up your Docker containers, it runs a Rake task that grabs the data from the GitHub API and dumps it into an Elasticsearch index with the corresponding mappings and metadata.
00:15:07.260
Then, we access this Sinatra API in its own container, which provides hypermedia responses. I like to use Docker Compose for this so that I can spin up everything using just the 'docker-compose up' command.
00:15:38.300
This is an example of what the Compose file looks like. I'm specifying both my API container and my Elasticsearch container, which uses a base image while exposing my ports appropriately.
00:16:03.340
The steps are straightforward: you throw the data in the repository, build the Docker image locally, then run 'docker-compose up' or simply execute the 'docker run' command, specifying the origin repository from which the data is pulled.
00:16:33.460
Then, you can start making requests via curl or any library to access it through HTTP. You'd access it at /resources. The resource name remains consistent as 'resources,' rather than changing to '/businesses' or '/organizations.'
00:17:01.479
So, now I can deduplicate the data or utilize other queries that have become accessible through the hypermedia format. In summary, I've managed to meet all of my requirements: easy and free data storage and maintenance without altering the raw data, a flexible and robust API that doesn't require hosting, and anyone can set it up.
00:17:33.790
A particularly powerful aspect of this setup is that if we’re working with this data, it's still accessible. If I have ten people in a room, we can all collaborate on this data, building an iOS app for this canvassing tool, while all having access to the same data through the same API. It doesn’t need to be hosted anywhere.
00:18:01.700
I encourage you all to get involved. I know that the daily grind can take a lot of our time, but consider helping with my project or find your local brigade to become actively involved in the community. Start solving issues that directly impact people in your own community.
00:18:29.479
Thank you.