Data always plays a critical role in the ability to research, study, and combat public health emergencies, and nowhere is this more true than in the case of a global pandemic. Access to data sets–and tools that can analyze that data at cloud scale–are increasingly essential to the research process and have been particularly useful in the global response to the novel coronavirus.
The Google Cloud Public Datasets Program (PDP) facilitates access to nearly 150 high-demand datasets from different industry verticals, which are constantly being added to. These datasets are onboarded and maintained by Google Cloud, with input and guidance from a variety of data providers, such as the Census Bureau, the National Weather Service, and the U.S. Geological Survey.
Additional examples include the National Water Model, detailing information about flooding and water movement across the continental United States; Broad References, containing human genomics reference files used for sequencing analytics; and the Global Surface Summary of the Day, which provides meteorological observations of weather stations around the world every day going back over 100 years.
This powerful resource is a playground for analysts and data scientists to unlock new insights from their own data by contextualizing it with data provided from the PDP. It helps create a more complete picture of customers, patients, or products by linking data to vast public datasets with a single line of code–that’s the power of PDP.
“We pull data from a lot of public sources that you normally would have to research, source, clean, prep, correctly format, and download,” says Michael Hamamoto Tribble, Head of Google Cloud Datasets. “Traditionally, that data would need to be sorted and uploaded to a database before end-users could work with it.”
“Our Public Datasets Program takes care of all the prep work, wrangling, cleaning, and aligning the data for easy access in BigQuery tables. Also, it can easily connect different databases together. It’s remarkable.”Michael Hamamoto Tribble | Head of Google Cloud Datasets
To help organizations adapt and meet their customers’ changing needs during the COVID-19 pandemic, SADA and Google Cloud set out to create a series of public domain COVID-19 datasets to aid researchers, data scientists, and analysts in developing data-driven models to better understand the spread of COVID-19.
Knowing that COVID-19 datasets would rapidly evolve at high volumes, SADA and Google Cloud wanted to update the PDP infrastructure to address the unique challenges of the COVID-19 datasets, such as increasing capabilities for data validation and alerts and implementing quality controls to ensure datasets remain up-to-date.
SADA, a Google Cloud Premier Partner and three-time Google Cloud Reseller Partner of the Year, and Google Cloud partnered to develop COVID-19 dataset pipelines for ten states, including data from a number of key states, and leveraging information from healthcare organizations such as the American Hospital Association and The COVID Tracking Project.
SADA developed a framework to obtain the required information and to autogenerate specific schemas and tables that could be easily applied to the remaining forty state government health-related websites.
Through this partnership, SADA’s team of technical and professional services experts worked with Google Cloud to design and implement new, backend data pipelines for the PDP to capture data from public data sources. SADA developed a reference implementation for the COVID-19 dataset with reusable code to refactor the existing PDP pipeline for other datasets in a fully automated way, requiring minimal configuration.
This project required a quick turnaround due to the emerging nature of the pandemic, with customers’ needs changing virtually overnight. Meeting three times a week, SADA and Google Cloud technical teams worked closely together to deliver the COVID-19 comprehensive dataset in only 90 days.
SADA also delivered the new standardized pipeline framework as a reference implementation that would enable customers to publish work on a small virtual machine running Python code, DataFlow, or Kubernetes, depending on the size of the dataset to be loaded into BigQuery.
“Now anyone can refer to the ingestion framework and build pipelines for collecting data from disparate sources and push data into BigQuery by providing endpoints and a few configuration details, requiring only minimal coding efforts,”Michael Hamamoto Tribble | Head of Google Cloud Datasets
SADA also achieved advanced automation by implementing the underlying infrastructure as code, in addition to the data pipeline, making it simple to replicate the complete environment in support of customers’ evolving needs.
Overall, SADA’s partnership with Google Cloud helped customers by:
- Delivering the public domain COVID-19 Open Data datasets in only 90 days
- Accelerating a shared understanding of how coronavirus spreads
- Developing reusable code to easily support legacy data sources
- Replacing the existing PDP pipeline with a modern architecture to support the unique challenges of COVID-19 datasets and beyond
By making COVID-19 data open and available in BigQuery, researchers and public health officials have been able to better understand, study, and analyze the impact of this disease.