Cancer Data Analysis in R

Cancer is "a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the body" (National Cancer Institute 2021). There are many different types of cancer, and those different types are usually named for the part of the body where the cancer starts. There is a one in three chance that you will have to deal with cancer somewhere in your body in your lifetime (American Cancer Society 2021).

Current and former industrial activities almost always result in the release of toxic chemicals into the environment, some of which are possible or known carcinogens (National Institutes of Health 2018). While plant operators can implement technical and operational procedures to minimize these releases, and governments can implement policies to motivate plant operators to do so, no industrial process can be 100% clean, and industrially-produced toxins are a cost of living in an industrialized society.

Publicly-available geospatial data can be used to investigate potential links between industrial activities and cancer hot spots. While the exact causes of individual cancers are complex and usually unknowable with any meaningful level of certainty, analysis of aggregated data can provide useful insights and pathways for targeting additional (and often scarce) investigatory resources.

This tutorial will cover basic analysis of ZIP Code level cancer registry data, behavioral risk factors, and industrial toxins in the environment using R.

Acquire the Data

Illinois State Cancer Registry

There are three common statistics associated with health conditions like cancer:

The Illinois State Cancer Registry makes both incidence and mortality data available, although mortality is only published on a state-wide basis. Data is always aggregated (combined) into geographic areas to preserve privacy, and to facilitate accurate reporting and analysis.

The Illinois State Cancer Registry makes both incidence and mortality data available, although mortality is only published on a state-wide basis. To get the most spatially-detailed data available (ZIP code), we use the incidence data for this tutorial.

  1. Go to the Cancer in Illinois: Statistics page.
  2. Roll over Cancer Incidence and select By ZIP Code.
  3. Under Select Data, click the top entry, scroll down to the bottom of the list, hold the shift key down, and click the bottom entry to select all year ranges. Note that because cancer is (thankfully) somewhat rare and because there is randomness to where and when cancer occurs, cancer data is commonly aggregated over five-year periods to smooth out this short-term variability.
  4. Under Cancer Group, select all types of cancer.
  5. Click Go.
  6. Ignore the Please enter a zip code message as the page will display all ZIP codes.
  7. If data rows do not appear, repeat the steps above.
  8. Under the Actions, click the Excel icon (the "X") to download the .csv file. Depending on your browser configuration, it may automatically place it in your default Downloads directory.
  9. Create a new working directory (if you haven't already), and move the .csv file to that directory. You may wish to give it a more meaningful name, such as 2021-illinois-cancer-zip-codes.
Downloading ZIP Code level incidence data from the Illinois State Cancer Registry

American Community Survey

The Census Bureau's American Community Survey (ACS) is an ongoing survey that provides information on an annual basis about people in the United States beyond the basic information collected in the decennial census. The ACS is commonly used by a wide variety of researchers when they need information about the general public.

For this tutorial, we will use a GeoJSON file of commonly-used variables for Illinois ZIP Codes from the 2015-2019 American Community Survey five-year estimates from the US Census Bureau. This data also contains polygons that can be joined with the cancer data for choropleth mapping.

You can download the data and view the metadata here.

Median age by Illinois ZIP Code 2015-2019 ACS five-year estimates

Process the Data

Load the Cancer Data

You import the cancer data into R with the read.csv() function.

cancer = read.csv("1994-2018-il-cancer-zip.csv", as.is=T)

The names() function will show you the field names available in the data.

names(cancer)

	[1] "Year"         "ZIP.Code"     "Cancer.Group" "Male.Count"   "Female.Count"
	[6] "Total.Count"  "X"           

Subset the Cancer Data

This data contains multiple values for each ZIP code representing the different types of cancer during the different time periods of data collection. The data as downloaded from the website is arranged in long format, where there is one column for all data values, with adjacent columns used to specify location (ZIP code) as well as category (type of cancer).

Long data format

Using the unique() function on the Year and Cancer.Group fields will show you the available year ranges and cancer types, respectively.

The dollar sign ($) is used after the name of the data frame object to select a specific field from that data frame.

unique(cancer$Cancer.Group)

	 [1] "All Cancers Combined"  "All Other Cancers"     "Breast-in situ"       
	 [4] "Breast-invasive"       "Cervix"                "Colorectal"           
	 [7] "Leukemias & Lymphomas" "Lung & Bronchus"       "Nervous System"       
	[10] "Oral Cavity"           "Prostate"              "Urinary System"       

unique(cancer$Year)

	[1] "2014-2018" "2009-2013" "2004-2008" "1999-2003" "1994-1998"

In order to get a data frame of cancer rates for a specific type of cancer, you will need to take a subset of the rows in the data associated with the specific year and type of cancer.

cancer.count = cancer[(cancer$Cancer.Group == "All Cancers Combined") & (cancer$Year == "2014-2018"), 
	c("ZIP.Code", "Total.Count")]

head(cancer.count)

	  ZIP.Code Total.Count
	1    60002         763
	2    60004        1583
	3    60005         941
	4    60006           5
	5    60007        1160
	6    60008         619

Load the ACS Data

The American Community Survey data downloaded above provides both spatial polygons associated with each ZIP Code in the cancer data, as well as demographic information for each of those ZIP Code areas.

The file can be loaded with the st_read() function from the sf library.

library(sf)

acs = st_read("2015-2019-acs-il_zcta.geojson", stringsAsFactors=F)

A plot() of one of the fields gives you a preview of the data.

plot(acs["Median.Age"], border=NA, breaks="jenks", 
	pal=colorRampPalette(c("navy", "lightgray", "red")))
Median age by ZIP Code in 2015-2019

Join the Cancer and ACS Data

A join is an operation where two data frames are combined based on common key values.

Join

The merge() function can be used to join the cancer data and the ACS data based on a common key of ZIP Code specified in the by parameters. In this case, the ACS data uses the Name field while the cancer data uses the ZIP.Code field.

zip.codes = merge(x = acs, y = cancer.count, by.x = "Name", by.y = "ZIP.Code")

As with the ACS data, you can now plot the total counts of cancer by ZIP Code.

Note that this is largely a map of population, since places with more people tend to have more cancer. Conversion to rates for comparison is discussed below.

plot(zip.codes["Total.Count"], border=NA, breaks="jenks", 
	pal=colorRampPalette(c("navy", "lightgray", "red")))
Counts of all cancers by ZIP Code

Calculating Crude Rates

The map of cancer counts above shows that most cancer occur where there are more people. This data on its own is not particularly useful, since it is largely data about where more people live.

Of greater interest in epidemiology is the rate of cancer rather than the absoute count, since we usually want to know where there are more cancer cases than you would expect for the size of the population.

Rates in general are often expressed as percents (per 100 residents). But with comparatively rare health conditions like cancer, disease rates are expressed in annual cases per 100,000 residents (per 100k) so the numbers are more readable as whole numbers (e.g. 200 per 100k is easier to read than 0.2%).

Since the ACS data contains population values, we can calculate the rate by dividing by population, multiplying by 100,000 (to get the per 100k) and dividing by five (since the counts are total counts over a five year period).

This rate is called a crude rate because it is calculated with simple division and does not take into account any differences in demographic characteristics between areas.

There may be some cases where cancer cases are reported in ZIP Codes where there is zero population. Dividing by zero in R results in an infinite value, which can confuse the statistical functions used below. The second line below sets any infinite values to NA.

zip.codes$Crude.Rate = zip.codes$Total.Count * 100000 / zip.codes$Total.Population / 5

zip.codes$Crude.Rate[is.infinite(zip.codes$Crude.Rate)] = NA

This map of rates shows a much more complex landscape, with the populous (and comparatively affluent) suburbs around Chicago actually showing lower rates than some sparsely-populated rural areas around the state.

plot(zip.codes["Crude.Rate"], border=NA, breaks="quantile",
	pal=colorRampPalette(c("navy", "lightgray", "red")))
Crude rates of all cancers by ZIP Code

Estimated Rates

Disease rates ordinarily follow a normal distribution (bell curve). However, when we use the hist() function to create a histogram showing the distribution of values, we see the curve is strongly skewed to the right.

hist(zip.codes$Crude.Rate)
Histogram of crude rates of all cancers by ZIP Code

A plot() of a smoothed density() curve is another way to visualize a distribution that makes the skewing clearer.

plot(density(zip.codes$Crude.Rate, na.rm=T))
Density curve of crude rates of all cancers by ZIP Code

Because cancer is rare and many ZIP Codes have small populations, one or two cancer cases in a ZIP Code can cause the cancer rate to be exceptionally high, and an absence of cases can make an area appear to be a cancer-free paradise. This issue is referred to as the small numbers problem (BioMedware 2011) or variance instability (Anselin 2018).

This exaggeration of variance in sparsely populated areas is visible in an X/Y scatter chart comparing population and crude rates. Note that the most extreme rates (high and low) stretch along the bottom of the chart in low population ZIP Codes.

plot(zip.codes$Crude.Rate, zip.codes$Total.Population)
Wide variation in low population ZIP Codes (left side) due to the small numbers problem

Because areas with high crude rates may indeed be areas of concern, we do not want to discard those areas as outliers or use a log transform to indiscriminantly squeeze all high rate areas into normality.

There are a variety of techniques for addressing this issue, although none can be seen as universal solutions for making high certainty inferences from the limited data available in low population areas ( Wang, Guo, and McLafferty 2012; Przbysz and Bunch 2017).

One technique developed by Marshall (1990) and available in the spdep library is empirical Beyes estimation, which calculates overall distribution (priors) from the data, and then shrinks the values in low population areas to conform to that distribution around the global mean. For a more detailed explanation of empirical Beyes estimation, see Robinson (2015), Anselin (2018), and Klaas (2019).

If you have not already installed the spdep package, you should do so.

install.packages("spdep")

The EBest() function from the spdep package creates a smoothed estimate of counts based on population values and raw counts.

library(spdep)

estimator = zip.codes[zip.codes$Total.Population > 0, ]

estimate = EBest(estimator$Total.Count, estimator$Total.Population)

estimator$Estimated.Rate = 100000 * estimate$estmm / 5

zip.codes = merge(zip.codes, st_drop_geometry(estimator[c("Name", "Estimated.Rate")]))

A plot() of the density() curve shows a distribution that is much closer to the expected normal distribution.

plot(density(zip.codes$Estimated.Rate, na.rm=T))
Density curve of empirical Beyes estimated rates of all cancers by ZIP Code

Estimating Age-Adjusted Rates

Like most diseases, cancer risk tends to increase as you get older (National Institutes of Health 2014).

We can see this fairly clearly on an X/Y scatter plot comparing the percent of 65+ with the (estimated) crude rate, with a rough pattern going upward from left to right.

The lm() function can be used to create a simple linear model of cancer rates as a function of age (weighted by population). The output of the model is passed to the abline() function to draw a regression line on the plot.

plot(x = zip.codes$Median.Age, y = zip.codes$Estimated.Rate, col="#00000040")

model = lm(Estimated.Rate ~ Median.Age, weight = Total.Population, data=zip.codes)

abline(model, col="navy", lwd=3)
Correlation between cancer rates and median age

Accordingly, areas with high cancer crude rates tend to simply be areas with larger numbers of older residents, and this obscures other issues that may be increasing cancer risk in different areas.

A common technique used in epidemiology to address this issue is to modify rates based on the proportions of residents in different age groups, so that rates across areas can be compared to consider factors other than whether some areas have unusually high numbers of older (or younger) residents.

Age-adjusted rates have been modified to reflect what the rate would be if the distribution of ages in the population matched a standard population distribution, usually the 2000 U.S. standard population for rates calculated in the USA.

Rigorous calculation of age-adjusted rates requires a detailed breakdown of the ages of people diagnosed with cancer, which is not publicly available in Illinois. However, the linear model output from the lm() function can be used to create a estimated age adjustment.

In this case the coefficient from median age term in the model represents the number of cases per 100k that increase for each year that the median age in an area increases. Basing our formula off the weighted mean for the median age, for every year that the median age in a ZIP Code is above the mean for the state of Illinois (38.6), cases per 100k increase by 23.

model = lm(Estimated.Rate ~ Median.Age, weight = Total.Population, data=zip.codes)

national.median.age = 35.3

age.adjustment = model$coefficients[2] * (zip.codes$Median.Age - national.median.age)

zip.codes$Age.Adjusted.Rate = zip.codes$Estimated.Rate - age.adjustment

zip.codes$Age.Adjusted.Rate[zip.codes$Age.Adjusted.Rate < 0] = NA

weighted.mean(zip.codes$Age.Adjusted.Rate, zip.codes$Total.Population, na.rm=T)

	[1] 485.3436

The weighted mean of this adjusted rate for both sexes across the state is consistent with the published rates of 503.6 for men and 444.5 for women (IDPH 2021).

A map of the estimated age-adjusted rates is much more complex than the map of crude rates.

plot(zip.codes["Age.Adjusted.Rate"], border=NA, breaks="quantile",
	pal=colorRampPalette(c("navy", "lightgray", "red")))
Estimated age-adjusted rates for all cancers by ZIP Code

Save a GeoJSON File

It may be helpful to save this processed data to a GeoJSON file in case you need it in the future and either can't get the original data files or don't want to re-run all the processing steps.

st_write(zip.codes, "2021-minn-all-cancers.geojson", delete_dsn=T)

Data Summary

You can view a summary of the fields in the combined data set using the summary() function.

print(summary(zip.codes))

	     Name           FactFinder.GEOID       FIPS                ST           
	 Length:1375        Length:1375        Length:1375        Length:1375       
	 Class :character   Class :character   Class :character   Class :character  
	 Mode  :character   Mode  :character   Mode  :character   Mode  :character  

	    Latitude       Longitude       Square.Miles   Pop.per.Square.Mile
	 Min.   :37.07   Min.   :-91.42   Min.   :  0.0   Min.   :    0.14   
	 1st Qu.:39.26   1st Qu.:-89.74   1st Qu.:  8.0   1st Qu.:   20.90   
	 Median :40.56   Median :-88.99   Median : 32.0   Median :   55.42   
	 Mean   :40.38   Mean   :-89.01   Mean   : 40.2   Mean   : 1317.70   
	 3rd Qu.:41.68   3rd Qu.:-88.13   3rd Qu.: 58.0   3rd Qu.:  695.70   
	 Max.   :42.48   Max.   :-87.55   Max.   :265.0   Max.   :44498.00   
							  NA's   :61         
				...

	Percent.No.Vehicle Percent.No.Vehicle.MOE  Total.Count       Crude.Rate      
	 Min.   :  0.000    Min.   :  0.300        Min.   :   1.0   Min.   :   40.57  
	 1st Qu.:  1.800    1st Qu.:  1.600        1st Qu.:  25.0   1st Qu.:  541.01  
	 Median :  3.800    Median :  2.500        Median :  65.0   Median :  656.28  
	 Mean   :  5.801    Mean   :  5.805        Mean   : 257.4   Mean   :  716.98  
	 3rd Qu.:  6.800    3rd Qu.:  4.400        3rd Qu.: 334.0   3rd Qu.:  783.69  
	 Max.   :100.000    Max.   :100.000        Max.   :2226.0   Max.   :16000.00  
	 NA's   :1          NA's   :1                                                 

	 Estimated.Rate             geometry   
	 Min.   :  59.81   MULTIPOLYGON :1375  
	 1st Qu.: 546.02   epsg:4326    :   0  
	 Median : 630.34   +proj=long...:   0  
	 Mean   : 627.79                       
	 3rd Qu.: 706.67                       
	 Max.   :1294.51                       

Analyze the Data

Mean Rate

With health conditions, we commonly want to know what the normal rates are so that we can give added attention to groups of people or areas where the rates are above normal.

With a collection of values, the common approach is to find the mean (average) value, which can be done in R using the mean() function.

mean(zip.codes$Crude.Rate, na.rm=T)

	[1] 716.983

However, because the population is not evenly distributed across different ZIP Codes, this mean may not be valid. One option for addressing this situation is to use the weighted.mean() function that increases or decreases each value's contribution to the mean calculation based on a weighting value. In this case we can use the Total.Population field as a weight.

weighted.mean(x = zip.codes$Crude.Rate, w = zip.codes$Total.Population)

	[1] 554.2852

The weighted mean of the estimated rates is slightly different because of the modifications introduced by the empirical Beyes estimation algorithm.

weighted.mean(x = zip.codes$Estimated.Rate, w = zip.codes$Total.Population, na.rm=T)

	[1] 550.3392

And, as indicated earler, the weighted mean of the age-adjusted rates is significantly lower because age-adjustment lowers the rates in ZIP Codes with high percentages of older residents.

weighted.mean(x = zip.codes$Age.Adjusted.Rate, w = zip.codes$Total.Population, na.rm=T)

	[1] 485.3436

Correlations

One of the primary advantages of spatial data is being able to identify correlation between different characteristics of different areas.

While simple correlation between two characteristics does not prove that one characteristic causes the other, identification of correlation can be used as a starting point to more deeply examine relationships between characteristics that can lead to useful insights and proposals for addressing social challenges.

This analysis lists correlation between our cancer rate and all other numeric variables in the data set.

numerics = Filter(is.numeric, zip.codes)

numerics = numerics[,!grepl("MOE", names(numerics))]

numerics = st_drop_geometry(numerics)

correlations = cor(numerics$Estimated.Rate, numerics, use="pairwise.complete.obs")

correlations = t(round(correlations, 3))

correlations = correlations[order(correlations, decreasing=T),]

print(head(correlations, n=10))

             Estimated.Rate                  Median.Age 
                      1.000                       0.494 
            Percent.65.Plus    Old.age.Dependency.Ratio 
                      0.445                       0.422 
                 Crude.Rate           Age.Adjusted.Rate 
                      0.420                       0.408 
       Age.Dependency.Ratio            Percent.Disabled 
                      0.328                       0.303 
         Percent.Homeowners Percent.Drove.Alone.to.Work 
                      0.296                       0.281 


print(tail(correlations, n=10))

	   Pop.per.Square.Mile       Total.Households Average.Household.Size 
			-0.335                 -0.337                 -0.344 
	Grad.School.Enrollment        K.12.Enrollment       Total.Population 
			-0.352                 -0.355                 -0.364 
	  Percent.Foreign.Born   Undergrad.Enrollment     Enrolled.in.School 
			-0.375                 -0.376                 -0.397 
	   Women.Aged.15.to.50 
			-0.405 

Notably, the strongest positive correlations are associated with higher percentages of older residents, such as Median.Age and Percent.65.Plus.

Conversely, the strongest negative correlations are associated with higher percentages of younger residents, such as school enrollment, renters, and women of reproductive age.

Spatially, rates correlate positively with ZIP Code area size, and negatively with overall population and population density. This may reflect a tendency toward higher rates in low-population rural areas, although since rural areas also tend to be older, this may just be another correlation wit age.

Hot Spots

Similar disease rates tend to cluster together in space. These clusters can represent common social norms for risky behaviors (like smoking or diet) or from naturally-occurring or man-made carcinogens in the environment. The randomness of disease incidence can obscure such clustering.

Autocorrelation is the formal term for clustering of similar values, and spatial autocorrelation is the formal term for clustering of similar values in close proximity on the surface of the earth.

The Getis-Ord GI* statistic was developed by Arthur Getis and J.K. Ord and introduced in 1992. This algorithm locates areas with high values that are also surrounded by high values. However, unlike simple observation or techniques like kernel density analysis, this algorithm uses statistical comparisons of all areas to create p-values indicating how probable it is that clusters of high values in a specific areas could have occurred by chance. This rigorous use of statistics makes it possible to have a higher level of confidence (but not complete confidence) that the patterns are more than just a visual illusion and represent clustering that is worth investigating.

library(spdep)

getis.ord = zip.codes[is.finite(zip.codes$Age.Adjusted.Rate),]

centroids = st_centroid(st_geometry(getis.ord))

neighbors = dnearneigh(as_Spatial(centroids), d1 = 0, d2 = 50)

weights = nb2listw(neighbors, zero.policy=T)

getis.ord$Getis.Ord = localG(getis.ord$Age.Adjusted.Rate, weights)

The Getis-Ord GI* statistic is a z-score on a normal probability density curve indicating the probability that an area is in a cluster that did not occur by random chance. A z-score of 1.282 represents a 90% probability that the cluster did not occur randomly, and a z-score of 1.645 represents a 95% probability.

The Getis-Ord GI* statistic is commonly converted to classes for mapping.

breaks=c(-4, -1.645, -1.282, 1.282, 1.645, 4)

getis.ord$Hot.Spots = as.numeric(cut(getis.ord$Getis.Ord, breaks))

The hot and cold spots can then be mapped.

colors = colorRampPalette(c("navy", "lightgray", "red"))(5)

plot(getis.ord["Hot.Spots"], border=NA, pal=colors, key.pos=NULL)

legend(x="topright", fill=rev(colors), 
	legend=c("95% Hot", "90% Hot", "Not significant", "90% Cold", "95% Cold"))
Hot spots and cold spots for all cancers combined in Illinois

Note that being in a hot spot does not necessarily mean that all rates within the hot spot are high. Rather, this indicates that the general level across a hot spot is high, while some areas within the hot spot may be low. This box plot showing the distribution of rates within the five hot/cold spot categories varies widely, although the values in the 95% hot spots (category 5) are generally higher than the values in the 95% cold spots (category 1).

boxplot(Age.Adjusted.Rate ~ Hot.Spots, data=getis.ord, ylim=c(0, 1000))
Hot spots and cold spots for all cancers combined in Illinois

Mapping Hot Spots over a Base Map

Since hot and cold spots are specific areas that deserve attention, a map needs some geographic context so readers know where those spots are.

A common technique for providing geographic context for area maps is to plot the areas on top of a base map that shows notable human and environmental features like cities and rivers.

The OpenStreetMap library provides functions for accessing base map tiles from OpenStreetMap (OSM), which is a common open source geographic information source. OpenStreetMap is like Wikipedia for geospatial data.

library(OpenStreetMap)

basemap = openmap(upperLeft = c(43, -95.5), lowerRight = c(37, -83.5), zoom=6)

plot(basemap)

getis.ord = st_transform(getis.ord, osm()@projargs)

colors = c("#000080a0", "#00008040", "#00000000", "#ff000040", "#ff0000a0")

plot(getis.ord["Hot.Spots"], border=NA, pal=colors, key.pos=NULL, add=T)

legend(x="bottomright", fill=rev(colors), bg="white",
	legend=c("95% Hot", "90% Hot", "Not significant", "90% Cold", "95% Cold"))
Hot spot map overlaid on OpenStreetMap

Mapping Hot Spots in ArcGIS Online

You can export your hot spots to a geoJSON file and import them into ArcGIS Online for interactive mapping and exploration. The st_transform() function is needed here to put the hot spot features back into latitude and longitude rather than the projection used by OSM.

getis.ord = st_transform(getis.ord, 4326)

st_write(getis.ord, "2021_minn_cancer_hot_spots.geojson", delete_dsn=T)

When you have exported your GeoJSON file, you can upload it into a new ArcGIS Online map.

Mapping hot spots in ArcGIS Online