Data Processing

Summarized using AI

Tracking COVID 19 with Ruby

Olivier Lacan • December 18, 2020 • Online

In the video titled 'Tracking COVID-19 with Ruby,' Olivier Lacan shares his journey and experiences as a programmer who contributed to tracking COVID-19 data in Florida through the use of Ruby. The presentation highlights the intersection of technology and public health during the pandemic, emphasizing the importance of data accessibility and transparency.

Key Points Discussed:
- Initial Awareness of COVID-19: Olivier recounts his early awareness of COVID-19 from his co-worker whose family was directly affected in Wuhan, China, leading to his growing curiosity about the virus's spread in the U.S.
- Data Discovery: After discovering state and county health websites, Lacan recognized the absence of comprehensive data, prompting him to join the COVID Tracking Project, which was initiated to compile and present accurate testing and case information.
- Volunteer Efforts: He describes his transition from coding to data entry within a volunteer team, emphasizing the collaborative nature of their work through platforms like Slack and Google Sheets. They aimed to ensure reliable and updated data was available for public consumption.
- Ruby Tools Developed: Lacan developed several Ruby-based tools to facilitate data gathering, including Paperboy, which automated news searches for COVID-19 updates, and a parser for accessing JSON data from Florida's health dashboard. His work illustrated how a basic understanding of programming can drive substantial impact in public health initiatives.
- Data Transparency in Florida: Lacan highlights issues around data opacity from Florida's Department of Health and discusses instances of misinformation from the state administration about COVID-19 metrics and hospitalizations. His efforts led him to connect with local journalists and researchers to advocate for more transparent data reporting.
- Public Advocacy: He shares his insights on the importance of funding journalism and supporting accurate reporting, emphasizing the ethical responsibility programmers and technologists hold in times of public crises. Lacan concludes with a call to action, urging technologists to volunteer and contribute their skills for the greater good.

Through this narrative, Olivier emphasizes the power of technology in public health, the necessity for collaboration, and the critical role of accurate data in combating misinformation during the pandemic.

Tracking COVID 19 with Ruby
Olivier Lacan • December 18, 2020 • Online

Programmers are not epidemiologists, but epidemiologists have never needed programmers more. Not for our viral opinions but for our ability to retrieve large data sets and make them understandable through expressive code. As the pandemic was silently taking hold in the United States in early 2020, I used simple web and Ruby tools to gather invaluable data from often obscure state data sources in order to understand the extent of the pandemic in my area. I never expected this would lead me to become a contributor to the pirate CDC.

Olivier Lacan
Olivier likes to use computers to help people. He maintained Code School for many years and now builds tools to support exciting new learning modalities at Pluralsight. He created those Shields.io badges that plaster your open source READMEs and tried to make those same projects more accessible with Keep a Changelog. Recently, he contributed to the COVID Tracking Project at The Atlantic and focused on uncovering pandemic data for the state of Florida.

RubyConf 2020

00:00:00.560 Hi RubyConf! I know it feels like COVID-19 is tracking you right now, but let's talk about tracking it with Ruby. I'm Olivier Lacan, I used to build Code School, and I created a project called Keep a Changelog. Today, I work at Pluralsight where I make tools to manage our content.
00:00:20.080 Do you remember when you first heard about COVID-19? For me, it was in December 2019. My co-worker Frank is from China, and his wife lived in Wuhan, yes, that one. During meetings, Frank would give us updates about his wife—how she had to stay home, order groceries, and how weird everything was. He wanted to send her masks by mail. On December 31, 2019, the WHO was informed of the outbreak in Wuhan City.
00:00:38.480 Infections were reported in Japan and South Korea on January 20, then in the US and Taiwan the next day. Oddly, on January 22, the Netflix show 'Pandemic' came out. I didn't watch it then; I thought it felt alarmist. The subtitle was 'How to Prevent an Outbreak.' On January 25, the day of the Lunar New Year, my friends and I went to Chuangloo Garden, a Chinese restaurant in Orlando, hoping to see a parade.
00:01:02.160 When we walked in around lunchtime, the place looked empty. There were maybe two or three tables, and it's not really a small place. It was weird; there was no parade either—that was weirder. I went home with that strange feeling that lingered for days. One question kept creeping back: is it spreading here? A few days later, I stumbled upon the Johns Hopkins University dashboard. This is how it looked on February 2, 2020—42,000 cases worldwide, mostly in mainland China.
00:01:38.560 But a month later, on March 8, there were a hundred thousand cases worldwide. I don't remember February much; I don't know if you do. I just remember the feeling of weirdness not going away. The same question was nagging: is it here in Orlando? No one seemed to know. Eventually, I thought there must be some stats somewhere, either at the state or the county level, and that's when I found the Florida Department of Health website. On March 14, they had mostly an empty page listing 70 positive cases in all of Florida.
00:02:06.800 Four deaths were listed, but it was unclear whether those people died in Florida or somewhere else. Two days later, on March 16, the page had become a large HTML table. Tests from private labs were added, and it showed if a positive case had traveled or been in contact with other cases. Five deaths were reported. On March 17, I saw that a new dashboard similar to the one from Johns Hopkins was released by the Florida Department of Health on GitHub. I also found a project called COVID-19 Tracking, with a crawler written in Ruby that was ingesting data from the older HTML table.
00:02:45.080 I opened an issue, of course, and the next day I got an email from Jeff Hammerbacher, who co-founded the COVID Tracking Project, asking if I wanted to join. I said, 'Absolutely! Sign me up.' The COVID Tracking Project was co-founded by Jeff and two journalists from The Atlantic, Robinson Meyer and Alexis Madrigal. They had just started the project on March 7, after writing a piece about how little testing was happening in the US.
00:03:16.199 I offered my Ruby help and mentioned I'd found a way to reverse engineer the JSON endpoints that were feeding data to the new Florida dashboard. It was incredible to find so many dedicated people who felt that something was going very wrong as well. Within hours, instead of writing Ruby code, I was doing data entry by hand in a spreadsheet with dozens of other volunteers. And this spreadsheet—it was glorious! I've never seen anything like it. The whole API of the PowerScopeTracking.com is built on top of this one spreadsheet. Back in March, dozens of people worked together on Slack in different shifts, three times a day, watching emergent processes being developed for these data entry shifts.
00:03:59.680 Watching how the Google Sheets format worked was extremely useful because everyone could see who's working on what. When you start working on a state, you mark it with your initials, start gathering data, and fill out each column. If anything's off, you can make notes for whoever does data entry next, or you could just start a Slack thread to discuss it. Every single data entry is double-checked independently by someone else before publication. Now, that might seem overly manual, but there's some automation. There's a Python tool called URL Watch, which is used to watch for updates to data sources.
00:04:37.680 If anything changes on that part of a website we used to pull data, URL Watch will flip a flag so the data entry team knows that there might be an update to check for. This is not to say that my lovely Ruby code wasn't needed. One of the biggest hurdles when the project started back in March was that we simply did not have data for a lot of states, or what little data we had was outdated or extremely inconsistent. To help with that, we had a group of people specifically pouring through news reports for any state or territory that had few official updates.
00:05:03.440 On March 21, with the help of my friend Casey Faist, I put together a really simple Ruby tool called Paperboy. Using the free news API, it allowed us to run automated searches against articles mentioning COVID-19 in specific states, with specific keywords like 'positive.' This might sound incredibly simplistic, but it helped us ensure that we didn't miss anything from the local press. The tool itself was incredibly simple. Originally, it was a script I had to run on my own machine, but then I decided to make it a Rack app hosted on Heroku, and it's actually still running on a free dyno to this day.
00:05:40.399 And this is my point: what brought me to this project was, first, my curiosity, and second, my ability to find random data on the internet. A lot of people with more time on their hands than I do can find random data online. However, a much smaller set of people can set up a tiny app to talk to a free API to make something tedious a lot less tedious, and therefore realistic for a group of people to do. That's kind of a magical power.
00:06:11.120 I can tell you, working with volunteers who are very talented in their respective fields of expertise definitely helps you realize how much you can accomplish with basic programming skills. I mentioned how hard it was to find good data. Surprisingly, some of the biggest states in the US had the worst data, with some only communicating data through awkward and unnecessary press releases, where volunteers had to take screenshots of low-quality video streams. But unexpectedly, Florida was doing well. I mentioned the Florida dashboard earlier and how it led me to the COVID Tracking Project, which intrigued me. It was clearly pulling data asynchronously, so I started poking at it.
00:07:07.520 It was making dozens of JSON requests, but it was really hard to tell if it was gathering pre-rendered HTML or just communicating with a back-end API, which would have been really interesting. Then I found it—there was a pattern in the URLs. It was retrieving JSON from a REST endpoint system with clearly named feature layers. After poking around the ArcGIS website where it was hosted for a few days and comparing notes with other scraping-focused programmers in the project, who found similar sources for different states, I realized that there was essentially an open directory of available feature layers from the Florida Department of Health.
00:08:06.800 I found fascinating data about EPA brownfields in my neighborhood, but more importantly, I found a catalog of COVID-related data published by the FDoh, including state, county, and even individual case data. In those feature layers, I discovered a data schema that was sometimes even documented. The data itself was JSON. Obviously, I wanted it all, so I made a parser on March 29, again using basic tools: HTTP, GET, JSON parse, and some good old-fashioned printing to standard out. The following day, I found a similar source for California and added a way to export to CSV to share with my co-volunteers.
00:08:58.480 I also made a new website and called it Flovid. I generated a simple HTML table of the data aggregated from the raw JSON sources to make them more easily digestible for data entry. I even added some caching to speed things up so people wouldn't have to wait for the large JSON payload, which contained 2,000 positive cases, to download and get parsed. Each metric was listed with its full name, marked by a star if we tracked that particular metric with the COVID Tracking Project, and included descriptions to help people understand what it meant in the context of Florida. For example, many metrics included both residents and non-residents, which wasn't necessarily the case for other states.
00:09:41.480 On April 1, I wanted to add new pages for Utah and Washington, so I needed some sort of router to show different data based on the path in the URL. After trying a few different solutions, I settled on Hanami Router, which worked great and reminded me of Sinatra. I liked it so much that I added Alaska, Georgia, and Texas the same day and refactored the app to be a bit more object-oriented, with a parent state class that all states inherited from.
00:10:23.679 Each state had roughly equivalent kinds of data: county level, statewide, and rarely case-line detail as extensive as Florida's. I decided to cache each state's data for 15 minutes in Redis to avoid constantly hammering servers for data that was definitely not refreshed more than once a day. But I added a cache buster query parameter that could be triggered by clicking a force reload link—not my best idea, but it did the trick. The next day, I added Louisiana and decided to rename the site to Ovid, since it was no longer specific to Florida.
00:11:11.240 In the following days, expecting more states to join the ranks, I rewrote the rallying layer to use an array of state classes dynamically populated using the nifty inherited class hook. I also made a sort of DSL to define each metric I tracked—whether it was a raw value or an aggregate sum. Then, I added Minnesota, New Jersey, and Missouri to the list. Almost every day came with fixes as states perpetually changed their metrics, names, data layers, or boolean values without any notice. As infection rates were climbing, I tried to find data about hospital bed availability, but my contact at the FDoh told me that they couldn't even get this information from the state.
00:12:02.800 Towards the end of April, I had to start working on a new project for work and had to scale back my volunteering quite a bit, which made me feel terrible but also provided a much-needed distraction from the mentally draining work of tracking diseases and death. While I was away from daily volunteering on data entries, when the scaling issues started happening, I had error monitoring set up with BugSnag, and I would get timeout reports from Ovid because the caseline data was getting so big that the ArcGIS servers were starting to struggle under intense load.
00:12:55.200 I first made the cache expire less frequently and one day, I was strangely reporting fewer cases than the data entry folks at COVID Tracking Project. After scratching my head for a while, I realized that the ArcGIS API endpoints were paginated and returning only 200 or 2,000 items when returning geographic data. At first, I managed to get over that limit by no longer requesting unnecessary geographic data, which allowed me to get back more records. But soon enough, even the new maximum was exceeded, as Florida cases started to sharply rise in June, so I had to get creative.
00:13:51.240 I did this big refactor, where I extracted the case line data tabulation into a Sidekiq worker so it could work asynchronously, unlike the other easier to calculate metrics. I was trying really hard not to use threads; I've been working with Ruby for ten years, and threads still scare me. I struggled to reason about them. I knew I could dramatically speed up processing different offsets from the paginated API endpoints and assemble everything together whenever the last thread was done, but the pain of the 30 or so seconds it took for the workers to go through all the pages wasn't really enough to warrant a tricky rewrite just yet.
00:14:44.080 But of course, that's when the ArcGIS API, either due to load or the constant breaking changes the FDoh was making, started routinely timing out during one of the many paginated requests I had to make in sequence. That meant I had to ensure that the total count of records matched the final array of cases I was retrieving. I had a few close calls where I felt really bad about not having written tests, but considering how much flux everything was in, it didn't feel like a productive thing to do every week, or sometimes every day, when something would change or break. I was starting to lose sleep worrying about the rising cases and hospitalizations in Florida.
00:15:46.080 As Ron DeSantis, the governor, was lying and obfuscating on TV about how everything was going to be fine, it was clearly not, looking at the data and which way it was pointing. Up. That's when something weird happened. I realized that through all the work I'd been doing to gather data about COVID-19 in Florida, I knew a lot of stuff. I had intuitions based on patterns in the data, delays in data releases, and some hunches about why some metrics were being withheld. For example, it might seem unbelievable, but until July, six months into the pandemic, there was no way to know how many hospital beds were occupied throughout the state of Florida and how many of those occupied beds were for COVID-19 patients.
00:16:54.560 The early lead Florida had on data transparency had been squandered, and that's no surprise. One of the people I was emailing back and forth with to better understand the Florida data was the GIS geographic information systems manager that created the now-famous or infamous Florida dashboard, Rebecca Jones. In early May, she was asked by her leadership to provide data to support the plan to reopen Florida. A standard for community spread and test positivity had been established in order to determine whether it was safe to reopen specific Florida counties.
00:17:41.760 The large populated and left-leaning counties of Miami-Dade, Orange, and Duval seemed to be under the established safety threshold, meaning it was probably safe for them to reopen. But several rural counties, heavily right-leaning, were not satisfying the reopening criteria at all. The only way for the DeSantis administration to push reopening in the face of this was to ignore the criteria or make exceptions for counties that didn't satisfy it. Rebecca Jones was fired, and the governor then went on national TV next to Vice President Mike Pence to claim that there were pending criminal charges against her. He called her an insubordinate on live TV.
00:18:29.680 This whole episode was mortifying. I worried that my contacts with her contributed in some way to her firing. By that point, I had established contacts with some local Florida journalists, and I didn't know if they would agree that this was a story. Thankfully, they did. Rebecca Jones didn't take the smearing kindly either. A few weeks later, she appeared on CNN and broke down exactly how disingenuous the administration was being about test positivity and repeat testing. This event alone ended up being a tipping point among people who depended on the data published by the FDoh, which, up until that point, had been trusted.
00:19:14.160 To my surprise, many of those people weren't journalists but academic researchers at Florida universities like UCF, USF, UF, and others. Many had thankfully backed a lot of data, and some, like Jennifer Larson at UCF, were doing incredible work with medical examiners to figure out whether excess mortality in 2020 could tell us anything about deaths that hadn't been properly attributed to COVID-19 in Florida. This is where the kind of community building I did on my first-ever website came in handy. I somehow ended up in touch with people who understood Florida better than anyone else.
00:20:10.800 So, I reached out to more journalists to make sure the researchers, epidemiologists, and the press were talking to each other—folks at the Orlando Sentinel, the Tampa Bay Times, the Sun Sentinel, the Miami Herald, and even Political Florida did incredible work to understand and contextualize all the ways in which Florida data was especially confusing. Because we were missing key metrics, I'm bad at math, but I'm so glad there's a strong fact-checking culture both in local journalism and in the COVID Tracking Project itself.
00:21:01.840 Speaking of local journalism, I never paid for local journalism until this year. I just cursed at the paywalls as someone who was raised on free websites. Many investigative journalists have learned the intricacies of things like epidemiology to make sure they're accurately reporting factual details to their readers. That takes a lot of work, especially when you don't have any scientific training.
00:21:36.560 In early July, the list of troublesome things with Florida COVID-19 data was getting more clearly defined. At that point, Aaron Cassane, the managing editor of the COVID Tracking Project, asked me and a fellow volunteer who's a public health researcher in Florida whether it would be okay to write a blog post for the project specifically doing a case study on Florida data issues. We ended up basically writing what felt like a white paper.
00:21:57.840 Once again, my abilities as a programmer came in incredibly handy, despite the fact that it was a written piece based on the reopening criteria that was, by then, published by the state of Florida. I wrote a very convoluted script that ingested Florida data at the county level—something the COVID Tracking Project wasn't doing. The script looked at changes in influenza-like and COVID-like illnesses reported in each county and weekly positive resident and test positivity rates at testing capacity increases, and it then produced a list of counties that were satisfying the criteria for the past week or the past two weeks.
00:22:43.920 The latter was the reopening criteria that was defined by the state at the time. We ended up publishing the case study; none of Florida's 67 counties were in the green. In our piece, we gave a list of six ways to improve Florida's data release: daily and cumulative data on probable cases and deaths, current COVID-19 patients in hospitals, ICUs, and ventilators, more comprehensive case hospitalization and death data broken down by race and ethnicity, more frequent updates on long-term care facilities, and we urged the state to separate the kinds of tests so that varied rates of false negatives wouldn't be mixed.
00:23:32.919 Finally, we requested the number of first-time positive cases used to calculate the atypical Florida test positivity metric to be communicated. We also asked for more machine-readable documents since a lot of the data was published as PDF documents, which were difficult to parse. Meanwhile, Florida newspapers had to sue the state of Florida in order to try and get this data out.
00:24:09.840 We're not quite sure what did the trick, but a few days after we published our post, the Florida Agency for Healthcare Administration finally released data about COVID-19 hospitalizations, although it restricted it to people with a primary diagnosis of COVID-19, which, for example, excluded pregnant women who were sick with COVID-19 even if they weren't admitted for their pregnancy. Without question, Ruby had a big impact on my ability to produce expressive and productive code for this project.
00:24:52.000 I was able to perpetually churn on it and refactor it. I definitely feel much more comfortable now with threads after creating a threaded PDF downloader and other cool tools, but I want to leave you with some things that I've learned that have little to do with code. Pay for your news. Ever since the pandemic started, most of the journalists in my area were furloughed every other week. It felt surreal to depend on them for critical information that the state government was willfully withholding or misinterpreting.
00:25:47.520 On the national level, although The Atlantic has supported the COVID Tracking Project, I'm sure some of you have not heard of the incredible stories written by folks like Ed Yong, but also Alexis Madrigal and Robinson Meyer. Amazingly, dedicated and talented journalists like Florida reporters Adelaide Chen, Nazim Miller, Ben Konark, Mario Arisa, Eric Choki, Daniel Chang, and so many others exist in your communities.
00:26:40.000 Vote for science because, as the Netflix series 'Pandemic' shows, it takes a lot of research and funding to protect the world effectively against the next pandemic and to understand this one. Promote healthcare for all because, aside from the absurdity of employer-dependent healthcare during a global recession with massive unemployment, private health insurance just mathematically and statistically makes no sense.
00:27:22.560 And although it's no longer the season for black squares on social media, fight systemic racism because it literally kills Black, Indigenous, and people of color—not just through violence but through health-related outcomes, which are tragically lower than for white people in the United States. If there needed to be another argument against private prisons, the grossly mismanaged incarceration systems in the U.S. alone led to inmates seeing five times the level of infections compared to surrounding populations and two more preventable deaths as well.
00:27:57.440 The Marshall Project and the Equal Justice Initiative both contribute critical understanding to the impact of COVID-19 in prisons and jails, and finally, contribute your own time and money. You likely are part of a group of people who can still work during this global pandemic, and although jobs were lost in our industry too, our position affords us the ability to lend a hand. So, volunteer. Do it.
00:28:40.480 All problems are not technology problems, but technologists can help if they show up. Show up.
00:29:12.800 Thank you.
Explore all talks recorded at RubyConf 2020
+17