Talks

API in a box

API in a box

by Shelby Switzer

In the presentation "API in a Box", Shelby Switzer discusses the challenges of managing and utilizing local open data, particularly for smaller entities like cities and towns that lack resources. Switzer emphasizes the importance of addressing issues related to 'small data', particularly for civic hacking initiatives. Key points include:

  • Civic Hacking Importance: Switzer highlights her involvement with Code for America brigades, showcasing how local data can empower communities to solve pressing issues, illustrated by the project "Show Me the Food", which aimed to identify food deserts in Atlanta.
  • Data Challenges: When working with government data, issues such as fragmented formats, duplicates, and lack of standardization create significant hurdles. For example, during the "Show Me the Food" project, two CSV files containing business data from different sources presented compatibility challenges.
  • Key Needs Identified: Switzer outlines five needs for effective data management in civic hacking:
    • Easy and free data storage and maintenance
    • Unchanging data safeguarding
    • A flexible and robust API for data interaction
    • The ability to search non-standardized data and
    • Avoid hosting costs for non-profit initiatives.
  • Implementing Solutions: The initial step involved using GitHub to store open data. Subsequently, Elasticsearch was chosen as a schema-less document store to interact with this data efficiently, allowing for powerful search capabilities.
  • API Creation: A Sinatra API was set up over Elasticsearch to facilitate user interaction, this API included metadata for clear understanding and provided a user-friendly format using Collection JSON, enabling simplified queries without needing original data knowledge.
  • Docker Utilization: To ensure portability and ease of use, Docker containers encapsulate all functionality, allowing users to run the project with minimal setup. The process is streamlined using Docker Compose for easy deployment.

In conclusion, Switzer encourages developers to leverage these tools to maintain open data accessibility within their communities, emphasizing the collaborative potential of this setup without reliance on external hosting. The project reflects a model where data-driven civic initiatives can thrive through community engagement and technological simplicity. She invites others to participate in similar projects and contribute to local problem-solving efforts.

00:00:23.949 My name is Shelby Switzer. As Graham said, I work for a company down in Denver called Notion.
00:00:27.260 We make incredibly smart home sensors that can tell you if a door opened when you're not home, if you have a water leak, and what your temperature is. They can detect ambient light and do a bunch of other stuff.
00:00:35.450 At Notion, I work on the API and some of our web services. As you can imagine, being in the Internet of Things, we deal with lots of data—tons and tons of logs and distributed systems.
00:00:45.800 But today, I'm not going to talk about any of that. I'm going to talk about a project that I work on called API in a Box, which essentially works with small data. The world runs on small data, and as developers, we sometimes forget that most people in their daily lives work with Excel spreadsheets and CSV files.
00:01:01.790 They don't really have the problems of having to understand accelerometer data or know what to do with it. A perfect example of the world running on small data comes up in civic hacking all the time.
00:01:14.270 In my free time, I like to work with Code for America brigades. Civic hacking is how I got into programming years ago in South Carolina. I then moved to Atlanta and worked with the Code for Atlanta Brigade. Now that I'm in Denver, I work with the Code for Denver Brigade a lot. Small data is always a roadblock. It's not even necessarily about getting the data, which you might think is the hardest part; it’s about what to do with the data once you have it.
00:01:41.479 To illustrate, I'm going to tell you a story from Atlanta. We worked on a project called 'Show Me the Food.' Essentially, the Code for America Brigade in Atlanta had a really good relationship with the city of Atlanta, as well as the Atlanta Food Bank. We were looking for projects to help the food bank solve critical food issues in the city, one of which was food deserts, a major problem in Atlanta and also in Denver.
00:02:07.549 The brigade, along with the Atlanta Food Bank, aimed to find a way to identify food deserts, map them out, and build a canvassing tool to allow people in the community, perhaps those living in food deserts, to enter convenience stores. They could identify stores that might not offer a variety of healthy options but do stock fresh produce.
00:02:38.680 So, they could start mapping out where fresh produce was available as opposed to just convenience stores. But, does everyone here know what a food desert is? Let me explain.
00:02:48.420 A food desert refers to a large area where people have limited access to healthful food options. For example, they may not have access to vegetables for a nice dinner and have to rely on fast-food places or convenience stores, ultimately resulting in unhealthy diets.
00:03:15.100 After we developed the canvassing tool and identified these food deserts, we wanted to visualize our findings and ultimately change the world, or at least our small part of it. So, where do we start? This is our data: two CSV files that we received from the city.
00:03:35.050 The first file is a tax digest, which lists all the businesses in the city, and the second is a statewide list of businesses that accept EBT, similar to food stamps. As you can imagine, there is going to be duplicate data between these files. They also come from two different governmental organizations, making their format completely different. You can't trust that any of the data is standardized, consistent, or even free of typos.
00:04:13.180 When we got this data, we had to figure out what to do with it. Where do we keep it? How do we interact with it? Do we normalize it? Do we deduplicate it? Which database should we use? If we even needed a database, and being developers, we all had really strong opinions about how to approach this.
00:04:42.179 The project sort of stalled because we didn't know exactly what to do with the data. Once we figured that out, we had to start building on top of it, and since we were all from different backgrounds with different environments and skill sets, organizing all of this presented a significant challenge for civic hacking.
00:05:09.280 In the process of this project, as well as many others, we identified five big needs when working with government data. The first need was an easy and free way to store and maintain the data. Additionally, we wanted to ensure the data remained unchanged. A conversation that regularly came up was whether we could simply pull the business name, latitude, and longitude, and just ditch the rest of the data.
00:05:39.279 However, just because we only needed certain details for our use case doesn't mean that the other columns wouldn’t be useful for others wanting to interact with the dataset.
00:06:06.650 We also needed to integrate or develop a flexible and robust API to make the data not only accessible but also easy to interact with.
00:06:32.800 We didn’t want hosting costs because we were non-profit and needed to set it up simply on any server or computer for anyone giving up their Tuesday night to hack, without wanting to spend the whole night installing Postgres.
00:06:57.550 So, the first step in maintaining and storing the data was to simply pop it in GitHub. The data is open anyway; it belongs to the people, so there's no reason why it can't be in an open repository. It's free, and most of us know how to work with it. It’s definitely better than passing around a thumb drive.
00:07:30.060 Next, once we want to interact with the data and get it into our environment, we need to establish a local data store to facilitate easy interaction, rather than just connecting to the GitHub API. We chose Elasticsearch as our document store for these CSV files.
00:07:54.820 Elasticsearch is schema-less, allowing us to avoid forcing data into tables, and it uses JSON documents. Most web developers are quite familiar with working with JSON, along with its RESTful API, and it offers powerful search capabilities, which is crucial for our purpose.
00:08:12.719 We have to be able to search and understand data that is not normalized, which can have typos and varies between integers, strings, and floats. Elasticsearch provides a lot of functionality around that, so we do a little bit of work on top of it.
00:08:43.120 When we dump the data from the GitHub repository into Elasticsearch, we essentially define the mappings. Mappings represent the columns of our data. We have fields for name, latitude, longitude, business type, and we add a metadata field for the file source, indicating from which CSV file this data originates.
00:09:14.030 We also want to maintain all data as strings. Within Elasticsearch, there are numerous data types available, but we cannot guarantee that any column will consistently be an integer, which is why we keep everything as strings.
00:09:48.720 So far, we have solved two of our needs: we have an easy and free way to store metadata in GitHub without changing the data, and we keep it in Elasticsearch with some added metadata. We are now on the way to securing a flexible and robust API. Elasticsearch provides the underlying functionality.
00:10:20.450 This is where Ruby comes in. We set up a simple Sinatra API on top of it, making things very simple. We also do some work to ensure that this API is engaging to interact with, and we want to make it self-descriptive with ample metadata attached to the data we retrieve.
00:10:51.930 This is where Collection JSON comes into play. Not everyone may be familiar with using hypermedia types and their APIs, but Collection JSON is a format designed for describing collections of data while formatting representations of those collections.
00:11:16.159 It's authored by a gentleman named Mike Amundsen, who does substantial work in the REST community. Essentially, it allows us to create a standardized format for reading data and gathering metadata about our resources.
00:11:43.200 In this standardized format, we have a top-level object called 'collection,' where we obtain the URL for the collection containing our resources. Then, there is an object called 'items,' which is just an array of all my resources, each with a unique URL and the data associated with them.
00:12:11.920 This resource here, for example, is Bob's Convenience Store with its latitude, longitude, business type, and also the file source it came from. What makes this particularly powerful is a secondary object called 'queries.' This takes my Elasticsearch mappings, which originate from the CSV column headers, and presents them as query parameters.
00:12:40.870 This allows me to search based on name, latitude/longitude, or business type without needing to know anything about the original CSV file. I just know the available queries as they have been provided to me.
00:13:06.880 Additionally, I can apply meta queries on top of that, leveraging Elasticsearch's functionality, such as deduplication, which collapses duplicates based on different fields. This makes it significantly more powerful than simply displaying raw data.
00:13:40.570 Now that I've defined my API requirements, how do I make this solution portable and free, without needing hosting, yet still replicable? This is where Docker comes in.
00:14:00.960 Docker is trendy, and we use it at Notion all the time. It's incredibly useful in the civic hacking space when dealing with various people in differing environments. With Docker, all they need to install is Docker instead of hosting Postgres, Ruby, or Ruby Version Manager.
00:14:33.799 By using Docker, we contain all our API functionality within a container. So, how does it all fit together? Essentially, the process looks like this: you place the data in GitHub, and when you spin up your Docker containers, it runs a Rake task that grabs the data from the GitHub API and dumps it into an Elasticsearch index with the corresponding mappings and metadata.
00:15:07.260 Then, we access this Sinatra API in its own container, which provides hypermedia responses. I like to use Docker Compose for this so that I can spin up everything using just the 'docker-compose up' command.
00:15:38.300 This is an example of what the Compose file looks like. I'm specifying both my API container and my Elasticsearch container, which uses a base image while exposing my ports appropriately.
00:16:03.340 The steps are straightforward: you throw the data in the repository, build the Docker image locally, then run 'docker-compose up' or simply execute the 'docker run' command, specifying the origin repository from which the data is pulled.
00:16:33.460 Then, you can start making requests via curl or any library to access it through HTTP. You'd access it at /resources. The resource name remains consistent as 'resources,' rather than changing to '/businesses' or '/organizations.'
00:17:01.479 So, now I can deduplicate the data or utilize other queries that have become accessible through the hypermedia format. In summary, I've managed to meet all of my requirements: easy and free data storage and maintenance without altering the raw data, a flexible and robust API that doesn't require hosting, and anyone can set it up.
00:17:33.790 A particularly powerful aspect of this setup is that if we’re working with this data, it's still accessible. If I have ten people in a room, we can all collaborate on this data, building an iOS app for this canvassing tool, while all having access to the same data through the same API. It doesn’t need to be hosted anywhere.
00:18:01.700 I encourage you all to get involved. I know that the daily grind can take a lot of our time, but consider helping with my project or find your local brigade to become actively involved in the community. Start solving issues that directly impact people in your own community.
00:18:29.479 Thank you.