Toxics Release Analysis in R

Federal and state environmental protection agencies monitor and regulate the release of toxins into the environment. They also evaluate the toxic legacies of abandoned industrial sites (brownfields) and, in some cases, assist in the remediation of those sites so they can be returned to productive uses.

US facilities in different industry sectors are required by law to report to the US EPA annually the quantities of chemicals that they release into the environment (air, water, or land disposal) and/or manage through recycling, energy recovery and treatment.

The Toxics Release Inventory (TRI) is a compilation by the US EPA of reported toxic chemicals that may pose a threat to human health and the environment. The TRI is intended support informed decision-making by companies, government agencies, non-governmental organizations and the public.

This tutorial covers analysis of TRI data in the US state of Illinois relative to demographics and cancer rates.

Acquire the Data

EPA Toxics Release Inventory

The US EPA's Toxics Release Inventory (TRI) is program that collects information reported annually by U.S. facilities in different industry sectors annually quantities of toxic chemical released to the air, water, or land disposal, and/or managed through recycling, energy recovery and treatment. The TRI was created by section 313 of the Emergency Planning and Community Right-to-Know Act (EPCRA) of 1986 (EPA 2021).

Summary "basic data files" can be download for individual states from the EPA TRI basic data files page.

This tutorial uses the 2019 file for the state of Illinois, which includes the latitudes and longitudes of reporting facilities, along with multiple rows for each facility describing separate groups of chemicals and the estimated quantities of releases.

Downloading Illinois data from the EPA Toxic Release Inventory

Cancer and Demographic Data

A copy of the demographic and cancer data used in this tutorial is available here.

The Illinois State Cancer Registry

Process the Data

Load Demographic and Cancer Data

The precompiled data file above can be loaded with the st_read() function from the Simple Features library.

library(sf) = st_read("2021-minn-all-cancers.geojson")

The resulting object is a simple features polygon data frame with demographic and cancer incidence data by ZIP Code in Illinois. Of the numerous fields, the following fields are use in this tutorial:

Load the Toxic Release Inventory Basic Data

The toxic release data is a CSV file that can be imported with the read.csv() function.

tri = read.csv("tri_2019_il.csv",


	  [1] "X1..YEAR"                        "X2..TRIFD"                      
	  [3] "X3..FRS.ID"                      "X4..FACILITY.NAME"              
	  [5] "X5..STREET.ADDRESS"              "X6..CITY"                       
	  [7] "X7..COUNTY"                      "X8..ST"                         
	  [9] "X9..ZIP"                         "X10..BIA"                       
	 [11] "X11..TRIBE"                      "X12..LATITUDE"                  
	 [13] "X13..LONGITUDE"                  "X14..HORIZONTAL.DATUM"          
	 [15] "X15..PARENT.CO.NAME"             "X16..PARENT.CO.DB.NUM"          
	 [17] "X17..STANDARD.PARENT.CO.NAME"    "X18..FEDERAL.FACILITY"          
	 [19] "X19..INDUSTRY.SECTOR.CODE"       "X20..INDUSTRY.SECTOR"           

Because the columns have numbered names, the names brought in by read.csv() are a bit unweildy. They can be modified with regular expressions using gsub().

names(tri) = gsub("X\\d+\\.\\.", "", names(tri))

names(tri) = gsub("[0-9\\d\\.ABCD]+\\.\\.\\.", "", names(tri))


	  [1] "YEAR"                       "TRIFD"                     
	  [3] "FRS.ID"                     "FACILITY.NAME"             
	  [5] "STREET.ADDRESS"             "CITY"                      
	  [7] "COUNTY"                     "ST"                        
	  [9] "ZIP"                        "BIA"                       
	 [11] "TRIBE"                      "LATITUDE"                  
	 [13] "LONGITUDE"                  "HORIZONTAL.DATUM"          
	 [15] "PARENT.CO.NAME"             "PARENT.CO.DB.NUM"          
	 [19] "INDUSTRY.SECTOR.CODE"       "INDUSTRY.SECTOR"           

Each row in the TRI data frame represents a single type of emission from each facility. Because facilities can release multiple chemicals, one facility can have multiple rows.

For our purposes, we will use these fields:

Convert to Pounds

Some values are given in grams rather than pounds. We convert those rows to pounds to make it possible to sum them together with other rows.

multiplier = ifelse(tri$UNIT.OF.MEASURE == "Grams", 1 / 453.592, 1)

tri$FUGITIVE.AIR = multiplier * tri$FUGITIVE.AIR

tri$STACK.AIR = multiplier * tri$STACK.AIR

tri$WATER = multiplier * tri$WATER

tri$UNIT.OF.MEASURE = "Pounds"

Create Overall Sums

To make it possible to total values across the whole data frame, we set any missing (NA) values to zero.



tri$WATER[$WATER)] = 0



Convert the Rows to Spatial Points

The data includes latitudes and longitudes for the locations, which can be converted to spatial features points using the st_as_sf() function.


tri = st_as_sf(tri, coords = c("LONGITUDE", "LATITUDE"), crs = 4326)

These can now be mapped with plot(). Although the map clearly shows the bulk of the sites around Chicagoland, there are facilities emitting carcinogens in all populated areas around the state.

plot(tri["CARCINOGEN.POUNDS"], logz=T, pal=colorRampPalette(c("navy", "lightgray", "red")), reset=F)

plot(st_geometry(st_union(, add=T)
Locations of facilities releasing carcinogens from the Toxic Release Inventory (EPA 2021)

Estimate Exposure by ZIP Code

For the purposes of this simplified exploratory example, we use a simple distance decay model that assumes that amount of exposure in a ZIP Code to a toxic release is proportional to the inverse of the distance from the release source to the center of the ZIP Code.

This operates on an assumption that toxics disperse evenly around the facility and that all different carcinogens are equally carcinogenic. If more rigor is needed, a transport model should be developed to model wind and water flow patterns and how they transport the toxics. These models can be quite complex (Beyea and Hatch 1999).

distance_matrix = st_distance(tri, st_centroid(

units(distance_matrix) = NULL

weight_matrix = 1 / (1 + (distance_matrix / 100))$Total.Pounds = colSums(tri$TOTAL.POUNDS * weight_matrix)$Carcinogen.Pounds = colSums(tri$CARCINOGEN.POUNDS * weight_matrix)$Fugitive.Air = colSums(tri$FUGITIVE.AIR * weight_matrix)$Stack.Air = colSums(tri$STACK.AIR * weight_matrix)$Water = colSums(tri$WATER * weight_matrix)

Mapping the carcinogen releases, we see effectively see a heat map of highlighting major industrial agglomerations in Illinois, notably along the Illinois River southwest of Chicago.

plot(["Carcinogen.Pounds"], border=NA, breaks="jenks", 
	pal=colorRampPalette(c("navy", "lightgray", "red")))
Estimates of carcinogen release exposure across ZIP Code areas

Save a GeoJSON File

It may be helpful to save this processed data to a GeoJSON file in case you need it in the future and either can't get the original data files or don't want to re-run all the processing steps.

st_write(, "2021-minn-toxics-zip.geojson", delete_dsn=T)
st_write(tri, "2021-minn-toxics-facilities.geojson", delete_dsn=T)

Analyze the Data


The histogram of the distribution shows that most counties are exposed to very low carcinogen releases.

hist($Carcinogen.Pounds, breaks=20)
Histogram of carcinogen releases by county

Because the distribution is heavily skewed, the mean() is not relevant. The quantile() shows that, with this model, half of all ZIP Codes are exposed to 1/2 ton or less and 75% are exposed to less than one ton.

print(round(quantile($Carcinogen.Pounds, na.rm=T)))

   0%   25%   50%   75%  100% 
  447   904  1164  1881 35828 

Top ZIP Codes

By sorting the data in decreasing order() of emissions, we can see the top six ZIP Codes are each exposed to many tons of carcinogenic releases.

These highly localized areas of high releases are also visible on the map above. =[order($Carcinogen.Pounds, decreasing=T),]

head(st_drop_geometry([,c("Name", "Carcinogen.Pounds")]))

	     Name Carcinogen.Pounds
	308 60633         35827.723
	221 60501         24646.025
	905 62084          8324.801
	338 60804          7036.734
	312 60638          6553.254
	246 60534          5891.762
	856 62526          33914.47

Top Sources

The list of facilities can be ordered to list the top emmiters of toxics and the top emmitters of carcinogens.

tri = tri[order(tri$CARCINOGEN.POUNDS, decreasing=T),]

head(st_drop_geometry(tri[, c("FACILITY.NAME", "CITY", "CARCINOGEN.POUNDS")]), n=10)

	3725              TATE & LYLE DECATUR      DECATUR             91887
	776          INGREDION INC ARGO PLANT BEDFORD PARK             85891
	439  SABIC INNOVATIVE PLASTICS US LLC       OTTAWA             83393
	438  SABIC INNOVATIVE PLASTICS US LLC       OTTAWA             78414
	188                 SWAN SURFACES LLC    CENTRALIA             56900
	1993              WOOD RIVER REFINERY       ROXANA             41861

To get more details on carcinogens from a specific facility, you can subset based on the name.


	776                     Acetaldehyde      8.589100e+04
	770 Dioxin and dioxin-like compounds      9.184465e-04


An initial investigation of relationships often starts by looking for correlations.

The listing of top correlations contains the general emissions variables,

numerics = Filter(is.numeric,

numerics = numerics[,!grepl("MOE", names(numerics))]

numerics = st_drop_geometry(numerics)

correlations = cor(numerics$Carcinogen.Pounds, numerics, use="pairwise.complete.obs")

correlations = t(round(correlations, 3))

correlations = correlations[order(correlations, decreasing=T),]

print(head(correlations, n=20))

	     Carcinogen.Pounds               Stack.Air            Fugitive.Air 
			  1.000                   0.783                   0.780 
		   Total.Pounds    Percent.Foreign.Born 
			  0.650                   0.461                   0.396 
		      Longitude                Latitude         K.12.Enrollment 
			  0.395                   0.393                   0.392 
			  0.390                   0.379                   0.377 
	Median.Monthly.Mortgage             Total.Count        Total.Households 
			  0.374                   0.372                   0.371 
	    Pop.per.Square.Mile     Median.Monthly.Rent         Percent.Renters 
			  0.336                   0.334                   0.269 
	 Grad.School.Enrollment    Undergrad.Enrollment 
			  0.254                   0.249 

print(tail(correlations, n=10))

	   Percent.Health.Insurance             Percent.65.Plus 
			     -0.130                      -0.158 
	       Percent.Vacant.Units            Percent.Disabled 
			     -0.173                      -0.177 
			 Median.Age              Estimated.Rate 
			     -0.181                      -0.222 
		   Percent.Veterans                Square.Miles 
			     -0.227                      -0.253          Percent.Homeowners 
			     -0.264                      -0.269 

correlations = cor(numerics$Estimated.Rate, numerics, use="pairwise.complete.obs")

correlations = t(round(correlations, 3))

correlations = correlations[order(correlations, decreasing=T),]

print(head(correlations, n=20))

print(tail(correlations, n=20))
cor($Carcinogen.Pounds,$Estimated.Rate, use="pairwise.complete.obs")

	[1] -0.2224945

cor($Carcinogen.Pounds,$Age.Adjusted.Rate, use="pairwise.complete.obs")

	[1] -0.01146388

Multiple Regression

One way to try to tease out the interrelationships between these variables is the use of multiple regression, which creates a model that considers multiple explanatory (independent) variables to predict the outcome of a response (dependent) variable.

Multiple regression can be considered a form of supervised machine learning where you provide a set of variables that determine the model output, and let the software (machine) figure out the relationship between them that best fits the training data.

For example, this code models crude rates for all types of cancer combined as a function of median age, median household income, and carcinogen release exposure.

model = lm(Estimated.Rate ~ Median.Age + Percent.Foreign.Born + 
	Median.Household.Income + Carcinogen.Pounds,

The output of the summary() function describes the quality of the model.


	lm(formula = Estimated.Rate ~ Median.Age + Percent.Foreign.Born + 
	    Median.Household.Income + Carcinogen.Pounds, data =

	    Min      1Q  Median      3Q     Max 
	-435.22  -67.92    1.59   64.08  410.51 

				  Estimate Std. Error t value Pr(>|t|)    
	(Intercept)              3.195e+02  2.164e+01  14.764  < 2e-16 ***
	Median.Age               8.535e+00  4.683e-01  18.227  < 2e-16 ***
	Percent.Foreign.Born    -3.545e+00  4.015e-01  -8.829  < 2e-16 ***
	Median.Household.Income -3.336e-04  1.246e-04  -2.677  0.00752 ** 
	Carcinogen.Pounds       -2.354e-03  2.209e-03  -1.065  0.28693    
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 105.2 on 1330 degrees of freedom
	  (40 observations deleted due to missingness)
	Multiple R-squared:  0.3253,    Adjusted R-squared:  0.3232 
	F-statistic: 160.3 on 4 and 1330 DF,  p-value: < 2.2e-16

Spatial Lag Regression

One of the major issues with using multiple regression with geospatial data is that the ordinary least squares algorithm used to build the model assumes that the variables are not autocorrelated, meaning that that there is no correlation between a sequence of values at different parts of the distribution.

Geographical phenomena are very commonly clustered together in space, which means that geospatial variables very commonly exhibit autocorrelation that causes multiple regression model coefficients and outputs to be biased so the model outputs are untrustworthy.

One technique for addressing this issue is spatial lag regression that performs multiple regression modeling in a way that compensates for clustering of similar values across neighboring areas (spatial lag).


neighbors = dnearneigh(as_Spatial(st_centroid(, d1 = 0, d2 = 30)

weights = nb2listw(neighbors, zero.policy=T)

model = lagsarlm(Estimated.Rate ~ Median.Age + Percent.Foreign.Born + 
	Median.Household.Income + Carcinogen.Pounds,, listw=weights, tol.solve=1.0e-30, zero.policy=T)


	Call:lagsarlm(formula = Estimated.Rate ~ Median.Age + Percent.Foreign.Born + 
	    Median.Household.Income + Carcinogen.Pounds, data =, 
	    listw = weights, zero.policy = T, tol.solve = 1e-30)

	     Min       1Q   Median       3Q      Max 
	-404.671  -65.148    1.680   62.854  401.244 

	Type: lag 
	Coefficients: (asymptotic standard errors) 
				   Estimate  Std. Error z value  Pr(>|z|)
	(Intercept)              1.3273e+02  4.5391e+01  2.9242  0.003453
	Median.Age               8.2942e+00  4.6601e-01 17.7984 < 2.2e-16
	Percent.Foreign.Born    -2.8289e+00  4.3846e-01 -6.4518 1.105e-10
	Median.Household.Income -1.4350e-04  1.2872e-04 -1.1149  0.264898
	Carcinogen.Pounds       -3.1802e-04  2.2402e-03 -0.1420  0.887111

	Rho: 0.28023, LR test value: 20.755, p-value: 5.2207e-06
	Asymptotic standard error: 0.060193
	    z-value: 4.6555, p-value: 3.2313e-06
	Wald statistic: 21.674, p-value: 3.2313e-06

	Log likelihood: -8097.145 for lag model
	ML residual variance (sigma squared): 10817, (sigma: 104)
	Number of observations: 1335 
	Number of parameters estimated: 7 
	AIC: 16208, (AIC for lm: 16227)
	LM test for residual autocorrelation
	test value: 1.4391, p-value: 0.23029

Spatial Error Regression

A similar type of model that accounts for spatial autocorrelation is a spatial error model, which handles spatial autocorrelation as an error term in the model and improves the estimate of standard error.

model = errorsarlm(Estimated.Rate ~ Median.Age + Median.Household.Income + 
	Percent.Foreign.Born + Carcinogen.Pounds,, listw=weights, tol.solve=1.0e-30, zero.policy=T)


	errorsarlm(formula = Estimated.Rate ~ Median.Age + Median.Household.Income + 
	    Percent.Foreign.Born + Carcinogen.Pounds, data =, 
	    listw = weights, zero.policy = T, tol.solve = 1e-30)

	       Min         1Q     Median         3Q        Max 
	-418.56991  -65.20729   -0.27079   63.65532  407.00169 

	Type: error 
	Coefficients: (asymptotic standard errors) 
				   Estimate  Std. Error z value  Pr(>|z|)
	(Intercept)              3.2174e+02  2.1973e+01 14.6427 < 2.2e-16
	Median.Age               8.3719e+00  4.6863e-01 17.8647 < 2.2e-16
	Median.Household.Income -2.4866e-04  1.3017e-04 -1.9102   0.05611
	Percent.Foreign.Born    -3.6955e+00  4.5584e-01 -8.1069 4.441e-16
	Carcinogen.Pounds       -2.4945e-03  2.3141e-03 -1.0779   0.28107

	Lambda: 0.34119, LR test value: 21.626, p-value: 3.3128e-06
	Asymptotic standard error: 0.072435
	    z-value: 4.7103, p-value: 2.4734e-06
	Wald statistic: 22.187, p-value: 2.4734e-06

	Log likelihood: -8096.709 for error model
	ML residual variance (sigma squared): 10788, (sigma: 103.86)
	Number of observations: 1335 
	Number of parameters estimated: 7 
	AIC: 16207, (AIC for lm: 16227)

Principal Component Analysis

A challenge with multiple regression is that it is difficult to know which variables to include in your model, and whether the variables you have chosen give you the best model possible. In addition, if there is correlation between two or more independent variables (making them not independent), the importance of those variables to the model (as indicated by the coefficients) may be distorted (biased).

One technique for addressing that issue is principal component analysis, which uses linear algebra to reduce a matrix of all the possible variables to a smaller set of principal component variables

This reduction in the number of variables (dimensions) needed for analysis is called dimensionality reduction. Because this technique makes a determination of the importance of different variables without direct human intervention or judgement, it can also be considered a form of unsupervised machine learning.

field.complete = sapply(names(, function(name) (sum([,name])) < 50))

numerics =[,field.complete]

numerics = Filter(is.numeric, numerics)

numerics = numerics[,!grepl("MOE", names(numerics))]

numerics = st_drop_geometry(numerics)

numerics = na.omit(numerics)

numerics[,c("Percent.No.Vehicle", "Total.Count", "Crude.Rate", 
	"Estimated.Rate", "Age.Adjusted.Rate",
	"Total.Pounds", "Fugitive.Air", "Stack.Air", "Water")] = NULL

pca = prcomp(numerics, scale=T)

The predict() function can be used with principal component matrix create by prcomp() and new data to create a new set of principal components. Those principal components can then be used to create a model using lm().

dependent.variable =$Estimated.Rate = predict(pca, st_drop_geometry( =$Estimated.Rate = dependent.variable

model = lm(Estimated.Rate ~ ., data =


	lm(formula = Estimated.Rate ~ ., data =

	    Min      1Q  Median      3Q     Max 
	-324.74  -56.09   -2.51   53.99  392.49 

		      Estimate Std. Error t value Pr(>|t|)    
	(Intercept)  6.351e+02  2.540e+00 250.070  < 2e-16 ***
	PC1         -1.979e+01  7.564e-01 -26.165  < 2e-16 ***
	PC2         -1.106e+01  2.227e+00  -4.964 7.83e-07 ***
	PC3          1.151e+01  1.404e+00   8.198 5.88e-16 ***
	PC4         -2.088e+01  2.606e+00  -8.012 2.51e-15 ***
	PC5         -5.663e+00  1.776e+00  -3.189 0.001464 ** 
	PC6         -6.140e-01  4.877e+00  -0.126 0.899845    
	PC7         -1.549e-01  4.352e+00  -0.036 0.971616    
	PC8          1.564e+01  2.390e+00   6.545 8.56e-11 ***
	PC9          1.094e+01  8.165e+00   1.340 0.180555    
	PC10         1.270e+00  8.052e+00   0.158 0.874736    
	PC11         1.727e+01  1.912e+01   0.903 0.366500    
	PC12        -4.247e+00  9.966e+00  -0.426 0.670097    
	PC13        -4.048e+00  3.466e+00  -1.168 0.243092    
	PC14         6.386e+00  8.466e+00   0.754 0.450813    
	PC15        -1.365e+01  3.760e+00  -3.629 0.000296 ***
	PC16         2.054e+01  1.177e+01   1.746 0.081109 .  
	PC17         1.439e+01  6.301e+00   2.283 0.022585 *  
	PC18         2.817e+01  1.550e+01   1.817 0.069509 .  
	PC19         1.276e+01  1.168e+01   1.093 0.274624    
	PC20         8.240e+00  5.412e+00   1.523 0.128119    
	PC21        -9.390e+00  7.160e+00  -1.311 0.189971    
	PC22        -4.843e+00  4.343e+00  -1.115 0.264977    
	PC23         1.261e+01  9.283e+00   1.358 0.174672    
	PC24         1.418e+01  2.502e+01   0.567 0.570984    
	PC25         7.404e+00  6.008e+00   1.232 0.218015    
	PC26         1.333e+01  1.429e+01   0.933 0.351112    
	PC27         1.360e+01  8.496e+00   1.601 0.109732    
	PC28         9.290e+00  1.499e+01   0.620 0.535463    
	PC29         2.299e+01  1.508e+01   1.524 0.127703    
	PC30         1.093e+01  8.779e+00   1.245 0.213411    
	PC31         2.107e+00  7.737e+00   0.272 0.785366    
	PC32        -5.705e+00  9.134e+00  -0.625 0.532340    
	PC33         1.268e+02  7.138e+01   1.776 0.075991 .  
	PC34         2.588e+02  6.956e+01   3.721 0.000207 ***
	PC35        -1.121e+01  5.727e+01  -0.196 0.844839    
	PC36         9.217e+01  9.737e+01   0.947 0.344025    
	PC37         3.204e+02  1.250e+02   2.563 0.010495 *  
	PC38         2.905e+01  2.251e+02   0.129 0.897349    
	PC39        -1.021e+03  1.248e+03  -0.818 0.413434    
	PC40         4.697e+03  1.140e+04   0.412 0.680427    
	PC41        -2.493e+16  1.889e+16  -1.320 0.187211    
	Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 92.48 on 1284 degrees of freedom
	  (49 observations deleted due to missingness)
	Multiple R-squared:  0.4943,    Adjusted R-squared:  0.4781 
	F-statistic: 30.61 on 41 and 1284 DF,  p-value: < 2.2e-16

In this case, the model built with principal components fits slightly less well than the multiple regression model (R2 of 0.741 vs 0.889), but only five components are statistically significant, so we can focus on what make up those principal components.