Visualizing Foreclosure: About This Project
Data Sources
Parcel Data
Parcel data was obtained from the Maricopa County Assessor's Office on an uncertain date and possibly through an intermediary. These dates and chain of custody should be clarified as the project proceeds. The data is available to the general public for a fee through the Assessor's website:
http://mcassessor.maricopa.gov/reports-data-sales/data-sales/data-policies-pricing/
Two data series were used: parcel shapefiles and the ST42030 residential master data file.
The parcel shapefile data is in:
/projects/oa/housing_bubble/vector/phoenix/parcel_shapefiles
The residence data is in:
/projects/oa/housing_bubble/vector/phoenix/parcel_data
Parcel data is organized into five separate books numbered 100 to 500 that represent separate areas of the county. The 2010 shapefiles contained 1,545,616 records and the book division presumably exists as a legacy hierarchical scheme to make the data more manageable.
The parcel shapefiles contain minimal data that varies over the different years. The only field from the shapefiles that was used was the parcel number (APN), which was then used as a primary key for a join with the residential data.
Parcel shapefiles were only provided to the RA for the years 2002-2006 and 2013. For years 2007-2012, the 2013 shapefile was used. The use of yearly shapefiles for 2002-2006 bounds the analysis for those years to the parcels that were extant in those years.
Data on residences was obtained from the ST42030 Residential Master Data File. This file has parcel number (APN) as a primary key and gives residential component information used to calculate property values (i.e., livable square footage, construction year, etc). The only data fields used from this data are livable area (SQFT) and number of floors (STORIES).
Residential data was only provided to the RA for the years 2007 - 2013. Data was joined with shapefiles using APN as a key. Shapefiles for 2002-2006 were joined with the 2007 data, which should limit the parcels to those in existence in those years, but which may give erroneous data for parcels that were redeveloped in the years prior to the 2007 data.
Estimated lawn area each parcel was calculated by subtracting the product of SQFT and STORIES from the area calculated from the shapefile data reprojected to the EPSG 2868 projection used with MODIS satellite data. From 2010 onward, the stories value was only listed as single or multiple (S or M). Single assumes one floor and multiple assumes two. Parcels where this estimated lawn ares was below zero were given a lawn area of zero.
Additional parcel data made available to the RA but not used includes:
- ST42060 Secured Master - All detail information regarding parcel including owner, property address, legal description, legal classes, valuation and others. (TYPE, APN, STATUS, ...)
- ST42076 Premium Secured Master Data Set - This is the most detailed subset of data offered from the standard secured master. The fields included are parcel numbers, owner name/address, type of property, deed information, valuation, rental indicator, basic component data and sales information. This set also includes Basic improvement components of livable square footage, land size, pool size, and construction year. (APN, OWNER, OWNERADDR, ...)
Foreclosure Data
Foreclosure data was purchased in late May 2013 from Jim Patterson at The Information Market (http://www.theinformationmarket.com). Data was made available to the RA as separate Excel (.xlsx) files for the years 2002-2012. Those files were exported to CSV for import into R and analysis with the parcel data described above.
Each record in the database represents a separate foreclosure process, which leaves some ambiguity about what should be considered foreclosure for the purposes of this project.
In most multiple-records cases there appear to be one or more cancelled sales (STATUS == "CT") followed by one or more later completed sales (STATUS == "TD"). It is presumed that a sale can be cancelled either by the resident catching up on their payments or when the auction does not result in an acceptable bid. There are cases where there are multiple sales where properties appear to move through a sequence of foreclosures.
Since the primary concern of this project is abandonment and/or cessation of lawn maintenance, the question arises about what should be inferred from this data about when post-eviction / post-abandonment vacancy starts and ends. The Notice of Trustee Sale (NOS, DATE field) is when the process officially starts, but the eviction presumably doesn't occur until after a successful sale (SALEDATE), which must be at least 90 days after the NOS. And if the sale is cancelled, it is not clear whether an eviction into vacancy ever takes place.
For this preliminary processing of the data, it is assumed that for any specific raster date (the date of the associated MODIS image), any parcel that has a foreclosure record (regardless of STATUS) where the raster date is between the NOS date (DATE) and the completion or cancellation of the sale (SALEDATE) is in "foreclosure."
The presumption would then be that there would be a 90-day window where lawn maintenance might stop, although this assumption might not be accurate if the sale is cancelled. This also might not capture situations where abandonment occurred and the parcel was vacant through a sequence of temporally-separated cancelled sales.
However, if the sale is completed, the data does not give a clear indication of when occupancy and / or lawn maintenance would resume. If the property is sold to an individual, they presumably would move in almost immediately, leaving little or no window of vacancy. But if the property was purchased by some investment entity, maintenance might resume or might be deferred until the property was ultimately sold to an entity (rental landlord or individual buyer) that would then facilitate a resumption of occupancy.
Statistics:
- For all ten years there are 450,980 records, representing 337,059 distinct parcels
- Around 24% of the parcels in the foreclosure database have two or more records across the annual forclosure CSV data files, with 7% having three or more and one particularly forlorn property (that appears to have been involved in criminal activity) having 31 separate records.
- 236,170 of those records have a status of "TD," indicating a completed sale (Trustee's Deed), representing 226,599 distinct parcels
- 217,338 parcels (96%) have only one TD record. This implies that they were either never sold again (abandoned?) or that they were immediately sold to a new resident. This would be consistent with the idea that lenders would not be inclined to evict an owner and subject the property to the vicissitudes of vacancy until there was a new owner to move in
A more formal definition of foreclosure is the period between the sale that transfers ownership from the defaulting owner to an intermediary and the sale that transfers ownership to the next individual owner. However, for the single-record properties it does not appear possible to infer such a foreclosure period, which necessitates using the more liberal definition as any parcel with an active NOS.
Rasterization Methodology
The data was processed from the original shapefile and residential data using R. R was chosen because of its robust spatial and statistical capabilities, as well as the availability of R expertise on the project team. Processing was performed on the Illinois Campus Cluster.
The script is located in:
/projects/oa/housing_bubble/vector/phoenix/rasterize_foreclosures.r
While the initial plan was to perform a simple spatial join between the parcel data and a grid representing the MODIS pixels, this represents N^2 complexity that is computationally intractable with the large number of parcels and pixels. Additional complication is added because many parcels overlap pixel boundaries, necessitating numerous computationally-intensive intersection operations.
The final methodology reversed the initial idea, going from parcel to pixel rather than from pixel to parcel:
- A matrix of lists of parcel areas, foreclosed areas and forclosed dates for each pixel is initialized
- For each parcel, the range of of potentially overlapped pixels calculated. This is a simple arithmetic calculation rather than a computationally-intensive full-dataset spatial join, which reduces the complexity from N^2 to N
- The parcel is intersected with a grid of potentially overlapping pixels, yielding a set of parcel fragments. The areas and foreclosure dates (where applicable) of these fragments are then added to the matrix of lists
- After all parcels have been processed, the matrix lists are processed in high-speed vector operations to create the raster data
Each raster set for a single date can be processed in around one hour when the process has unrestricted access to a core.
The current analysis is memory intensive, around 8.5 GB for a year's worth of parcel shapefiles and data, which restricts processing to three simultaneous years per 32 GB node. Future iterations of the analysis script may reduce this by removing numerous unneeded columns from the residential data, or loading the parcel data on demand, which would permit harnassing additional simultaneous cores but add significant additiona I/O time.
Directory Structure
Data for this project is stored on the Taub Campus Cluster in:
/projects/oa/housing_bubble/vector/phoenix
The following subdirectories are used:
- boundaries: Vector data, primarily from the US Census Bureau, defining county and city boundaries
- cache: Data preprocessed from original sources with preprocess.r and saved as GeoTIFF and R Data Serialization (RDS) files that can be easily and quickly loaded for analysis
- census: Raw geospatial and text data for Decennial census and American Community Survey. Processed with preprocess.r and saved in the cache directory
- foreclosures: Foreclosure CSV data from the Information Market. Processed with preprocess.r and saved in the cache directory
- golf_courses: Golf course shapefile data
- graphics: Graphics files other than project data visualizations that can be used for presentations
- old: Safety copies of old versions of data
- parcel_data: ST42030 residential parcel data in DBF format from the Maricopa County Assessor's Office
- parcel_shapefiles: Parcel shapefile data from the Maricopa County Assessor's Office
- scratch: Scratch directory for intermediate and debugging files
- usps-vacancy: USPS vacancy data
- visualizations: Analysis in the form of maps, charts and animations
Three R scripts are present at the root:
- preprocess.r: Run to process the data from original data files into cached data that can be easily and quickly loaded by other scripts
- rasterize.r: Generates the rasterized parcel data saved in cache
- visualize.r: Analysis script with numerous routines for visualizing data analysis
Cached Data Files
The cache directory contains files that have been preprocessed for faster loading.
Many files are stored as R Data Serialization (RDS) files. These can be loaded with readRDS() and written with saveRDS(). These files can be loaded and saved significantly faster than many generalized data formats (especially shapefiles), although they can only be read by R and are not considered appropriate for long-term archival purposes or interchange with other software.
Rasterized Parcel Data
A set of ten GeoTIFF files is available for each analogous MODIS image date over the analysis period:
- yyyy-mm-dd-parcel-count.tif: The count of all parcels that are contained within or intersect the boundaries of each pixel
- yyyy-mm-dd-parcels-foreclosed.tif: The count of all parcels that are in foreclosure during that date that are contained within or intersect the boundaries of each pixel.
- yyyy-mm-dd-lawn-area.tif: The total estimated lawn area in square meters within a pixel on that date. Each pixel covers approximately 53,665 square meters. Estimation methodology is described further below
- yyyy-mm-dd-lawn-area-foreclosed.tif: The total estimated lawn area under foreclosure in square meters on that date
- yyyy-mm-dd-foreclosed-days-mean.tif: The arithmetic mean of the number of days in foreclosure for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
- yyyy-mm-dd-foreclosed-days-stdev.tif: The standard deviation of the number of days in foreclosure for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
- yyyy-mm-dd-foreclosed-days-median.tif: The median number of days in foreclosure for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
- yyyy-mm-dd-foreclosed-days-weighted-mean.tif: The mean number of days in foreclosure weighted by parcel ares for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
- yyyy-mm-dd-foreclosed-days-weighted-stdev.tif: The standard deviation of the number of days in foreclosure weighted by parcel ares for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
- yyyy-mm-dd-foreclosed-days-weighted-median.tif: The median number of days in foreclosure weighted by parcel ares for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
The raster files have the following specifications:
- dimensions : 746, 1536, 1145856, 1 (nrow, ncol, ncell, nlayers)
- resolution : 231.6564, 231.6564 (x, y)
- extent : -10628625, -10272801, 3613839, 3786655 (xmin, xmax, ymin, ymax)
- coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs
MODIS Data
A set of MODIS data GeoTIFFs for the analysis period are cached under the names yyyy-mm-dd-raw-modis-ndvi.tif
American Community Survey Data
Data for the 2008-2012 American Community Survey (ACS) is joined with a polygon shapefile of the 2013 Census Tracts for the county and serialized in: cache/2012-acs.rds
Data fields include:
- TRACTID: The unique GEOID used to reference the tract
- MEDHHINC: Median Household Income
- MEDIANAGE: Median Age
- NAME: Text name of the tract
- ALAND: Land Area (square feet?)
- AWATER: Water Area (square feet?)
- CENTERLAT: Latitude of centroid (WGS 84)
- CENTERLON: Longitude of centroid (WGS 84)
Parcel Centroids
Two sets of serialized centroid point shapefiles are saved for each analysis period year.
yyyy-parcel-centroids.rds are the centroids for the parcels.
yyyy-parcel-fragment-centroids.rds are centroids for all framents of parcels that are formed when the parcels are split on pixel boundaries.
Foreclosure Data
Two serialized data frame files are used to cache foreclosure data.
cache/distressed.rds contains data for parcels that were included in a Notice of Sale (NOS), regardless of whether the sale was completed or cancelled.
cache/foreclosed.rds contains data for parcels where a sale was completed. This is a subsed of distressed.rds
Both files contain the same fields:
- PARCELID is the county parcel ID that can be used to join with parcel data
- DATESTART is the date (yyyy-mm-dd) of NOS
- DATEEND is the date the sale was completed or cancelled
Greater Phoenix OpenStreetMap
The three-color raster of the Phoenix urban area is serialized in cache/greater_phoenix_osm.rds
Template Raster
All rasterized and MODIS data is saved using the projection used by MODIS (EPSG 2868) with the same geographic extents and pixel dimensions. A blank template is used during preprocessing and saved as cache/template_raster.tif
USPS Vacancy Data
HUD Aggregated USPS Administrative Data On Address Vacancies as a data frame is serialized and saved in cache/tract-vacancy.rds
Data is aggregated by Census Tract and calendar quarter.
Data is included in five different sets of columns:
- TRACTID is the census tract GEOID
- AMSyyyymm is the total number of addresses on the given date (AMS = address matching system)
- VACyyyymm is the total number of addresses listed as vacant on the given date
- RESyyyymm is the total number of residential addresses on the given date. Residential addresses are only tabulated separately from March 2008 onward
- VREyyyymm is the total number of residential addresses listed as vacant on the given date