vignettes/analyzing-census-data.Rmd
analyzing-census-data.Rmd
By the end of this vignette, you will be able to:
datacommons
R packageCensus data provides an ideal introduction to Data Commons for several reasons:
Familiar territory: Many analysts have worked with census data, making it easier to focus on learning the Data Commons approach rather than the data itself.
Rich relationships: Census data showcases the power of the knowledge graph through natural hierarchies (country → state → county → city), demonstrating how Data Commons connects entities.
Integration opportunities: Census demographics become even more valuable when combined with health, environmental, and economic data—showing the true power of Data Commons.
Real-world relevance: The examples we’ll explore address actual policy questions that require integrated data to answer properly.
The R ecosystem has excellent packages for specific data sources. For example, the tidycensus package provides fantastic functionality for working with U.S. Census data, with deep dataset-specific features and conveniences.
So why use Data Commons? The real value is in data integration.
Data Commons is part of Google’s philanthropic initiatives, designed to democratize access to public data by combining datasets from organizations like the UN, World Bank, and U.S. Census into a unified knowledge graph.
Imagine you’re a policy analyst studying the social determinants of health. You need to analyze relationships between:
With traditional approaches, you’d need to:
Data Commons solves this by providing a unified knowledge
graph that links all these datasets together. One API, one set
of geographic identifiers, one consistent way to access everything. The
datacommons
R package is your gateway to this integrated
data ecosystem, enabling reproducible analysis pipelines that seamlessly
combine diverse data sources.
Data Commons organizes information as a graph, similar to how web pages link to each other. Here’s the key terminology:
Every entity in Data Commons has a unique identifier called a DCID. Think of it like a social security number for data:
country/USA
= United StatesgeoId/06
= California (using FIPS code 06—Federal
Information Processing Standards codes are essentially ZIP codes for
governments, providing standard numeric identifiers for states,
counties, and other geographic areas)Count_Person
= the statistical variable for population
countEntities are connected by relationships, following the Schema.org standard—a collaborative effort to create structured data vocabularies that help machines understand web content. For Data Commons, this means consistent, machine-readable relationships between places and data:
containedInPlace
)–> United StatestypeOf
)–> StatecontainedInPlace
)–>
CaliforniaThis structure lets us traverse the graph to find related
information. Want all counties in California? Follow the
containedInPlace
relationships backward.
These are the things we can measure:
Count_Person
= total populationMedian_Age_Person
= median ageUnemploymentRate_Person
= unemployment rateMean_Temperature
= average temperatureThe power comes from being able to query any variable for any place
using the same consistent approach through the datacommons
R package.
You’ll need a free API key from https://apikeys.datacommons.org/
# Set your API key
dc_set_api_key("YOUR_API_KEY_HERE")
# Or manually set DATACOMMONS_API_KEY in your .Renviron file
Don’t forget to restart your R session after setting the key to automatically load it.