New healthcare and population datasets now available in Google BigQuery
By Juan Marquez, M.D., Brain Clinical Fellow, and Ming Jack Po, M.D., Ph.D., Brain and Cloud Clinical Product Manager
We’ve just added several publicly available healthcare datasets to the collection of public datasets on Google BigQuery (the cloud-native data warehouse for analytics at petabyte scale), including RxNorm (maintained by NLM) and the Healthcare Common Procedure Coding System (HCPCS) Level II. While it’s not technically a healthcare dataset, we also added the 2000 and 2010 Decennial census counts broken down by age, gender and zip code tabular areas, which we hope will assist healthcare utilization and population health analysis (as we’ll discuss below). Anyone with a Google Cloud Platform (GCP) account can explore these datasets. You can also find relevant sample queries on the documentation pages for RxNorm, diagnosis and procedure coding, and census datasets, including:
- What are all the different forms (pill, powder, etc.) in which a drug ingredient can be found?
- What are the RXCUI codes of the related ingredients when mapping from drug brand names, generic pack, semantic branded drug, etc.?
- What are the most populous ZIP code tabular areas?
- What are the ZIP codes with the highest number of physicians per capita?
This is only a tiny subset of all the questions you can answer using these datasets. (As with all BigQuery public datasets, this data is fully anonymized/contains no personally-identifying information.)
Advantages of hosting the datasets on BigQuery
You might be wondering: What’s so special about querying these data sets with BigQuery? After all, these datasets are already available to the public and there are free tools available (Census Data Tools, RxNav) to explore them. To answer this question, we’d like to focus on two datasets: RxNorm and U.S. Census Data (joined with Medicare data also available on BigQuery).RxNorm
RxNorm was created by the U.S. National Library of Medicine (NLM) to provide a normalized naming system for clinical drugs and provide structured information such as brand names, ingredients and so on for each drug. Drug information is made available as a single “concepts” table while the relationships that map entities to each other (ingredient to brand name, for example) is made available as a separate “relationships” table. Navigation across these relationships requires more than 200 pathways.
NLM provides very helpful tools on its website including a web interface for exploring these relationships and APIs to assist with this mapping. These tools, however, are more difficult to use when mapping entire datasets of drugs.
To improve the experience for users, Google’s Brain and Healthcare Cloud teams worked with NLM to replicate these numerous pathways, which required more than 700 joins between tables. Simple pathways such as Brand Name (BN) to Ingredient (IN) require two tables to be joined but more complex pathways (such as PIN => SCDC => SCD => SBD => SBDG) require up to 9 tables to be joined. These pathways were combined into a large final table (rxn_all_pathways) that comprises approximately 1.87 million rows. This image reflects the amount of work required to build this table:
The final outcome of all this work is that a user can enter the names or codes (RXCUI) of any type of drug (brand name, ingredient, generic pack, etc.) and return the names/codes of the desired drug type(s). For example, let’s find out what the ingredients are for the following list of drugs:
SELECT
*
FROM
[bigquery-public-data:nlm_rxnorm.rxn_all_pathways_current]
WHERE
SOURCE_NAME IN (
'Tylenol',
'Ketorolac',
'Motrin',
'Lidocaine Hydrochloride 0.04 MG/MG / Menthol 0.01 MG/MG Medicated Patch')
AND
TARGET_TTY= 'IN'
ORDER BY
SOURCE_RXCUI
SOURCE_RXCUI | SOURCE_TTY | SOURCE_NAME | TARGET_RXCUI | TARGET_TTY | TARGET_NAME |
1373130 | SCD | Lidocaine Hydrochloride 0.04 MG/MG / Menthol 0.01 MG/MG Medicated Patch | 6387 | IN | Lidocaine |
1373130 | SCD | Lidocaine Hydrochloride 0.04 MG/MG / Menthol 0.01 MG/MG Medicated Patch | 6750 | IN | Menthol |
202433 | BN | Tylenol | 161 | IN | Acetaminophen |
202488 | BN | Motrin | 5640 | IN | Ibuprofen |
35827 | IN | Ketorolac | 35827 | IN | Ketorolac |
This query mapped the names of drugs (belonging to a wide range of types of drug entities) to the names and codes of their related ingredients (TTY=IN). These names could have been mapped to dose form (DF) or any type just as easily. While this query appears to be simple, that’s only because of the hundreds of joins that went into building this final table.
All the tables, including the rxn_all_pathways table, are updated monthly so they always provide the most up-to-date information. However, as some of the drugs are deprecated or change each month, we also include all the archived tables so that once you map your database of drugs, you can continue to use that same table for as long as you’d like.
You can learn more about how to use this dataset on our documentation page here.
U.S. Census and Medicare data
The second dataset we would like to highlight is the United States Census data. The United States census count (also known as the Decennial Census of Population and Housing) is a count of every resident in the U.S. that occurs every 10 years by the United States Census Bureau. While this data is publicly available, it’s typically only as graphs and summary tables: The raw data is often difficult to find and even when available, it’s typically organized into tables divided by region (thus, analyzing data across the whole U.S. requires additional processing).
The U.S. Census Bureau kindly provided us with two very interesting U.S. census datasets. These datasets (one for 2000 and another for 2010) break down the entire U.S. population by age, gender, ZIP code tabular areas (ZCTAs), and GEOIDs. GEOIDs are numeric codes that uniquely identify all administrative/legal and statistical geographic areas for which the Census Bureau tabulates data, and are useful for correlating this data with other censuses and surveys. ZCTAs are generalized representations of ZIP codes, and often, though not always, are the same as the ZIP code for an area.
It’s often more helpful to understand locations in the context of cities and states rather than ZIP codes and ZCTAs. Looking up each ZIP code individually is a cumbersome process. Therefore, in addition to the census datasets, we have acquired mapping tables from the Census Bureau that link the ZCTAs to the appropriate city and state. By joining these mapping tables, we’ve expanded the census datasets to include the city and state.
These U.S. census datasets are available on BigQuery and easily allow for analysis of population by gender, age, ZIP code, city and state. Using these datasets you can answer the questions such as, what are the most populous ZIP codes in the US? Or, which cities grew the most over the last two census counts?
Although those questions are extremely interesting, the power of using these census datasets is magnified when combined with other datasets that are either available publicly on BigQuery or that you chose to upload.
One such example can be illustrated by joining the U.S. census data with Medicare datasets already available on BigQuery, courtesy of the Center for Medicare and Medicaid Services (CMS). These include data on inpatient and outpatient services, prescription drugs and this table, which provides information on the services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals in 2012. The table also conveniently lists the geographic location of these professionals’ practices.
For example, by joining this data with the latest 2010 U.S. census data, we analyzed how the number of physicians (M.D.’s / D.O.’s) located in each state compared to the population found in these states, specifically looking at the states with the highest ratio of physicians to population:
#standardSQL
SELECT
state,
provider_count,
population_count,
ROUND(provider_count/ NULLIF(population_count,
0),3) AS ratio
FROM (
SELECT
t3.state_code AS state,
SUM(t4.provider_count) AS provider_count,
SUM(t4.pop) AS population_count
FROM
`bigquery-public-data.utility_us.zipcode_area` AS t3
INNER JOIN (
SELECT
t1.zip5 AS zip5,
t1.provider_cnt AS provider_count,
t2.population AS pop
FROM (
SELECT
CASE
WHEN LENGTH(nppes_provider_zip)=5 THEN nppes_provider_zip
WHEN LENGTH(nppes_provider_zip)=9 THEN SUBSTR(nppes_provider_zip,0,5)
ELSE '0'
END AS zip5,
COUNT(*) AS provider_cnt
FROM
`bigquery-public-data.cms_medicare.physicians_and_other_supplier_2012`
WHERE
REGEXP_CONTAINS(nppes_credentials, r'(\W|^)[mM]\.*[Dd]')
GROUP BY
zip5 ) AS t1
INNER JOIN (
SELECT
population,
zipcode
FROM
`bigquery-public-data.census_bureau_usa.population_by_zip_2010`
WHERE
gender ='') AS t2
ON
t2.zipcode=t1.zip5) AS t4
ON
t3.zipcode=t4.zip5
GROUP BY
state_code)
WHERE
REGEXP_CONTAINS(state, r'^[a-zA-Z][a-zA-Z]$')--Include only zipcodes that are within single state
ORDER BY
ratio DESC
LIMIT
10
state | provider_count | population_count |
SD | 23911 | 575802 |
AL | 135784 | 3796925 |
ND | 17159 | 503949 |
ME | 31025 | 916108 |
AR | 78058 | 2387739 |
TN | 179852 | 5451597 |
NE | 46772 | 1423094 |
WV | 40829 | 1308309 |
KY | 109155 | 3648989 |
SC | 119330 | 4018575 |
This is simply a starting point in the analysis; there are several caveats to these results, including the difference in the years of the datasets used in the comparison and the fact that Medicare data doesn’t take all physicians into account. Furthermore, the census ZIP codes are only approximate, so aggregation of ZIP codes into larger areas such as Hospital Service Areas (HSAs) may provide a more accurate analysis. In addition, the query was limited to only include ZIP codes that are fully contained within a single state so as to make the results more understandable.
However, as this example shows, providing access to current U.S. census data without the need for additional processing or data cleanup allows researchers and healthcare organizations to easily gain insights into the ever-growing publicly available healthcare datasets available on BigQuery and/or into their own data uploaded onto BigQuery.
Next steps
If you haven’t tried BigQuery, follow this tutorial to learn more about using it (includes 10GB of free storage and 1TB of free queries). Otherwise, you can jump right into exploring these healthcare datasets!
Source:
https://cloud.google.com/blog/big-data/2017/05/new-healthcare-and-population-datasets-now-available-in-google-bigquery