Google Cloud Big Data and Machine Learning Blog Innovation in data processing and machine learning technology

New healthcare and population datasets now available in Google BigQuery

Friday, May 19, 2017

By Juan Marquez, M.D., Brain Clinical Fellow, and Ming Jack Po, M.D., Ph.D., Brain and Cloud Clinical Product Manager

We’ve just added several publicly available healthcare datasets to the collection of public datasets on Google BigQuery (the cloud-native data warehouse for analytics at petabyte scale), including RxNorm (maintained by NLM) and the Healthcare Common Procedure Coding System (HCPCS) Level II. While it’s not technically a healthcare dataset, we also added the 2000 and 2010 Decennial census counts broken down by age, gender and zip code tabular areas, which we hope will assist healthcare utilization and population health analysis (as we’ll discuss below). Anyone with a Google Cloud Platform (GCP) account can explore these datasets. You can also find relevant sample queries on the documentation pages for RxNorm, diagnosis and procedure coding, and census datasets, including:

What are all the different forms (pill, powder, etc.) in which a drug ingredient can be found?
What are the RXCUI codes of the related ingredients when mapping from drug brand names, generic pack, semantic branded drug, etc.?
What are the most populous ZIP code tabular areas?
What are the ZIP codes with the highest number of physicians per capita?

This is only a tiny subset of all the questions you can answer using these datasets. (As with all BigQuery public datasets, this data is fully anonymized/contains no personally-identifying information.)

Advantages of hosting the datasets on BigQuery

You might be wondering: What’s so special about querying these data sets with BigQuery? After all, these datasets are already available to the public and there are free tools available (Census Data Tools, RxNav) to explore them. To answer this question, we’d like to focus on two datasets: RxNorm and U.S. Census Data (joined with Medicare data also available on BigQuery).RxNorm

RxNorm was created by the U.S. National Library of Medicine (NLM) to provide a normalized naming system for clinical drugs and provide structured information such as brand names, ingredients and so on for each drug. Drug information is made available as a single “concepts” table while the relationships that map entities to each other (ingredient to brand name, for example) is made available as a separate “relationships” table. Navigation across these relationships requires more than 200 pathways.

NLM provides very helpful tools on its website including a web interface for exploring these relationships and APIs to assist with this mapping. These tools, however, are more difficult to use when mapping entire datasets of drugs.

To improve the experience for users, Google’s Brain and Healthcare Cloud teams worked with NLM to replicate these numerous pathways, which required more than 700 joins between tables. Simple pathways such as Brand Name (BN) to Ingredient (IN) require two tables to be joined but more complex pathways (such as PIN => SCDC => SCD => SBD => SBDG) require up to 9 tables to be joined. These pathways were combined into a large final table (rxn_all_pathways) that comprises approximately 1.87 million rows. This image reflects the amount of work required to build this table:

Table structure showing some detail of individual joins

The final outcome of all this work is that a user can enter the names or codes (RXCUI) of any type of drug (brand name, ingredient, generic pack, etc.) and return the names/codes of the desired drug type(s). For example, let’s find out what the ingredients are for the following list of drugs:

SELECT
  *
FROM 
  [bigquery-public-data:nlm_rxnorm.rxn_all_pathways_current]
WHERE 
  SOURCE_NAME IN (
    'Tylenol',
    'Ketorolac',
    'Motrin',
    'Lidocaine Hydrochloride 0.04 MG/MG / Menthol 0.01 MG/MG Medicated Patch')
AND
  TARGET_TTY= 'IN'
ORDER BY
  SOURCE_RXCUI

SOURCE_RXCUI	SOURCE_TTY	SOURCE_NAME	TARGET_RXCUI	TARGET_TTY	TARGET_NAME
1373130	SCD	Lidocaine Hydrochloride 0.04 MG/MG / Menthol 0.01 MG/MG Medicated Patch	6387	IN	Lidocaine
1373130	SCD	Lidocaine Hydrochloride 0.04 MG/MG / Menthol 0.01 MG/MG Medicated Patch	6750	IN	Menthol
202433	BN	Tylenol	161	IN	Acetaminophen
202488	BN	Motrin	5640	IN	Ibuprofen
35827	IN	Ketorolac	35827	IN	Ketorolac

This query mapped the names of drugs (belonging to a wide range of types of drug entities) to the names and codes of their related ingredients (TTY=IN). These names could have been mapped to dose form (DF) or any type just as easily. While this query appears to be simple, that’s only because of the hundreds of joins that went into building this final table.

All the tables, including the rxn_all_pathways table, are updated monthly so they always provide the most up-to-date information. However, as some of the drugs are deprecated or change each month, we also include all the archived tables so that once you map your database of drugs, you can continue to use that same table for as long as you’d like.

You can learn more about how to use this dataset on our documentation page here.

U.S. Census and Medicare data

The second dataset we would like to highlight is the United States Census data. The United States census count (also known as the Decennial Census of Population and Housing) is a count of every resident in the U.S. that occurs every 10 years by the United States Census Bureau. While this data is publicly available, it’s typically only as graphs and summary tables: The raw data is often difficult to find and even when available, it’s typically organized into tables divided by region (thus, analyzing data across the whole U.S. requires additional processing).

The U.S. Census Bureau kindly provided us with two very interesting U.S. census datasets. These datasets (one for 2000 and another for 2010) break down the entire U.S. population by age, gender, ZIP code tabular areas (ZCTAs), and GEOIDs. GEOIDs are numeric codes that uniquely identify all administrative/legal and statistical geographic areas for which the Census Bureau tabulates data, and are useful for correlating this data with other censuses and surveys. ZCTAs are generalized representations of ZIP codes, and often, though not always, are the same as the ZIP code for an area.

It’s often more helpful to understand locations in the context of cities and states rather than ZIP codes and ZCTAs. Looking up each ZIP code individually is a cumbersome process. Therefore, in addition to the census datasets, we have acquired mapping tables from the Census Bureau that link the ZCTAs to the appropriate city and state. By joining these mapping tables, we’ve expanded the census datasets to include the city and state.

These U.S. census datasets are available on BigQuery and easily allow for analysis of population by gender, age, ZIP code, city and state. Using these datasets you can answer the questions such as, what are the most populous ZIP codes in the US? Or, which cities grew the most over the last two census counts?

Although those questions are extremely interesting, the power of using these census datasets is magnified when combined with other datasets that are either available publicly on BigQuery or that you chose to upload.

One such example can be illustrated by joining the U.S. census data with Medicare datasets already available on BigQuery, courtesy of the Center for Medicare and Medicaid Services (CMS). These include data on inpatient and outpatient services, prescription drugs and this table, which provides information on the services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals in 2012. The table also conveniently lists the geographic location of these professionals’ practices.

For example, by joining this data with the latest 2010 U.S. census data, we analyzed how the number of physicians (M.D.’s / D.O.’s) located in each state compared to the population found in these states, specifically looking at the states with the highest ratio of physicians to population:

#standardSQL
SELECT
  state,
  provider_count,
  population_count,
  ROUND(provider_count/ NULLIF(population_count,
      0),3) AS ratio
FROM (
  SELECT
    t3.state_code AS state,
    SUM(t4.provider_count) AS provider_count,
    SUM(t4.pop) AS population_count
  FROM
    `bigquery-public-data.utility_us.zipcode_area` AS t3
  INNER JOIN (
    SELECT
      t1.zip5 AS zip5,
      t1.provider_cnt AS provider_count,
      t2.population AS pop
    FROM (
      SELECT
        CASE
          WHEN LENGTH(nppes_provider_zip)=5 THEN nppes_provider_zip
          WHEN LENGTH(nppes_provider_zip)=9 THEN SUBSTR(nppes_provider_zip,0,5)
          ELSE '0'
        END AS zip5,
        COUNT(*) AS provider_cnt
      FROM
        `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2012`
      WHERE
        REGEXP_CONTAINS(nppes_credentials, r'(\W|^)[mM]\.*[Dd]')
      GROUP BY
        zip5 ) AS t1
    INNER JOIN (
      SELECT
        population,
        zipcode
      FROM
        `bigquery-public-data.census_bureau_usa.population_by_zip_2010`
      WHERE
        gender ='') AS t2
    ON
      t2.zipcode=t1.zip5) AS t4
  ON
    t3.zipcode=t4.zip5
  GROUP BY
    state_code)
WHERE
  REGEXP_CONTAINS(state, r'^[a-zA-Z][a-zA-Z]$')--Include only zipcodes that are within single state
ORDER BY
  ratio DESC
LIMIT
  10

state	provider_count	population_count
SD	23911	575802
AL	135784	3796925
ND	17159	503949
ME	31025	916108
AR	78058	2387739
TN	179852	5451597
NE	46772	1423094
WV	40829	1308309
KY	109155	3648989
SC	119330	4018575

This is simply a starting point in the analysis; there are several caveats to these results, including the difference in the years of the datasets used in the comparison and the fact that Medicare data doesn’t take all physicians into account. Furthermore, the census ZIP codes are only approximate, so aggregation of ZIP codes into larger areas such as Hospital Service Areas (HSAs) may provide a more accurate analysis. In addition, the query was limited to only include ZIP codes that are fully contained within a single state so as to make the results more understandable.

However, as this example shows, providing access to current U.S. census data without the need for additional processing or data cleanup allows researchers and healthcare organizations to easily gain insights into the ever-growing publicly available healthcare datasets available on BigQuery and/or into their own data uploaded onto BigQuery.

Next steps

If you haven’t tried BigQuery, follow this tutorial to learn more about using it (includes 10GB of free storage and 1TB of free queries). Otherwise, you can jump right into exploring these healthcare datasets!

Source:

https://cloud.google.com/blog/big-data/2017/05/new-healthcare-and-population-datasets-now-available-in-google-bigquery

Select Your Style