Google Cloud partners with SADA to build COVID-19 public dataset pipeline

bg_mob_customer_stories

/ AT A GLANCE

SADA and Google Cloud created a series of public domain datasets to aid researchers, data scientists, and analysts to better understand the spread of COVID-19.

sada-slash-white

INDUSTRY

Software & Technology

DEVELOPED

New public datasets and modern pipeline architecture

ENABLED

Access to reusable code for legacy data sources

Data always plays a critical role in the ability to research, study, and combat public health emergencies, and nowhere is this more true than in the case of a global pandemic. Access to data sets–and tools that can analyze that data at cloud scale–are increasingly essential to the research process and have been particularly useful in the global response to the novel coronavirus. 

The Google Cloud Public Datasets Program (PDP) facilitates access to nearly 150 high-demand datasets from different industry verticals, which are constantly being added to. These datasets are onboarded and maintained by Google Cloud, with input and guidance from a variety of data providers, such as the Census Bureau, the National Weather Service, and the U.S. Geological Survey.

Additional examples include the National Water Model, detailing information about flooding and water movement across the continental United States; Broad References, containing human genomics reference files used for sequencing analytics; and the Global Surface Summary of the Day, which provides meteorological observations of weather stations around the world every day going back over 100 years.

This powerful resource is a playground for analysts and data scientists to unlock new insights from their own data by contextualizing it with data provided from the PDP. It helps create a more complete picture of customers, patients, or products by linking data to vast public datasets with a single line of code–that’s the power of PDP.

“We pull data from a lot of public sources that you normally would have to research, source, clean, prep, correctly format, and download,” says Michael Hamamoto Tribble, Head of Google Cloud Datasets. “Traditionally, that data would need to be sorted and uploaded to a database before end-users could work with it.” 

“Our Public Datasets Program takes care of all the prep work, wrangling, cleaning, and aligning the data for easy access in BigQuery tables. Also, it can easily connect different databases together. It’s remarkable.”

Michael Hamamoto Tribble | Head of Google Cloud Datasets

Business challenge

To help organizations adapt and meet their customers’ changing needs during the COVID-19 pandemic, SADA and Google Cloud set out to create a series of public domain COVID-19 datasets to aid researchers, data scientists, and analysts in developing data-driven models to better understand the spread of COVID-19.

Knowing that COVID-19 datasets would rapidly evolve at high volumes, SADA and Google Cloud wanted to update the PDP infrastructure to address the unique challenges of the COVID-19 datasets, such as increasing capabilities for data validation and alerts and implementing quality controls to ensure datasets remain up-to-date.

Solution

SADA, a Google Cloud Premier Partner and three-time Google Cloud Reseller Partner of the Year, and Google Cloud partnered to develop COVID-19 dataset pipelines for ten states, including data from a number of key states, and leveraging information from healthcare organizations such as the American Hospital Association and The COVID Tracking Project.

SADA developed a framework to obtain the required information and to autogenerate specific schemas and tables that could be easily applied to the remaining forty state government health-related websites.

Through this partnership, SADA’s team of technical and professional services experts worked with Google Cloud to design and implement new, backend data pipelines for the PDP to capture data from public data sources. SADA developed a reference implementation for the COVID-19 dataset with reusable code to refactor the existing PDP pipeline for other datasets in a fully automated way, requiring minimal configuration.

Results

This project required a quick turnaround due to the emerging nature of the pandemic, with customers’ needs changing virtually overnight. Meeting three times a week, SADA and Google Cloud technical teams worked closely together to deliver the COVID-19 comprehensive dataset in only 90 days. 

SADA also delivered the new standardized pipeline framework as a reference implementation that would enable customers to publish work on a small virtual machine running Python code, DataFlow, or Kubernetes, depending on the size of the dataset to be loaded into BigQuery. 

“Now anyone can refer to the ingestion framework and build pipelines for collecting data from disparate sources and push data into BigQuery by providing endpoints and a few configuration details, requiring only minimal coding efforts,”

Michael Hamamoto Tribble | Head of Google Cloud Datasets

SADA also achieved advanced automation by implementing the underlying infrastructure as code, in addition to the data pipeline, making it simple to replicate the complete environment in support of customers’ evolving needs.

Overall, SADA’s partnership with Google Cloud helped customers by:

  • Delivering the public domain COVID-19 Open Data datasets in only 90 days
  • Accelerating a shared understanding of how coronavirus spreads
  • Developing reusable code to easily support legacy data sources
  • Replacing the existing PDP pipeline with a modern architecture to support the unique challenges of COVID-19 datasets and beyond

By making COVID-19 data open and available in BigQuery, researchers and public health officials have been able to better understand, study, and analyze the impact of this disease.

The team at SADA played an invaluable role developing data pipelines in support of customers. Additionally, our work together serves as a framework that can facilitate processing of other datasets that we want to onboard for this project–as well as some potential legacy datasets.”

— Michael Hamamoto Tribble | Head of Google Cloud Datasets

More customer stories

What we're up to

Solve not just for today but for what's next.

We'll help you harness the immense power of Google Cloud to solve your business challenge and transform the way you work.

Scroll to Top