Visualizing Foreclosure: About This Project

Data Sources

Parcel Data

Parcel data was obtained from the Maricopa County Assessor's Office on an uncertain date and possibly through an intermediary. These dates and chain of custody should be clarified as the project proceeds. The data is available to the general public for a fee through the Assessor's website:

http://mcassessor.maricopa.gov/reports-data-sales/data-sales/data-policies-pricing/

Two data series were used: parcel shapefiles and the ST42030 residential master data file.

The parcel shapefile data is in:

/projects/oa/housing_bubble/vector/phoenix/parcel_shapefiles

The residence data is in:

/projects/oa/housing_bubble/vector/phoenix/parcel_data

Parcel data is organized into five separate books numbered 100 to 500 that represent separate areas of the county. The 2010 shapefiles contained 1,545,616 records and the book division presumably exists as a legacy hierarchical scheme to make the data more manageable.

The parcel shapefiles contain minimal data that varies over the different years. The only field from the shapefiles that was used was the parcel number (APN), which was then used as a primary key for a join with the residential data.

Parcel shapefiles were only provided to the RA for the years 2002-2006 and 2013. For years 2007-2012, the 2013 shapefile was used. The use of yearly shapefiles for 2002-2006 bounds the analysis for those years to the parcels that were extant in those years.

Data on residences was obtained from the ST42030 Residential Master Data File. This file has parcel number (APN) as a primary key and gives residential component information used to calculate property values (i.e., livable square footage, construction year, etc). The only data fields used from this data are livable area (SQFT) and number of floors (STORIES).

Residential data was only provided to the RA for the years 2007 - 2013. Data was joined with shapefiles using APN as a key. Shapefiles for 2002-2006 were joined with the 2007 data, which should limit the parcels to those in existence in those years, but which may give erroneous data for parcels that were redeveloped in the years prior to the 2007 data.

Estimated lawn area each parcel was calculated by subtracting the product of SQFT and STORIES from the area calculated from the shapefile data reprojected to the EPSG 2868 projection used with MODIS satellite data. From 2010 onward, the stories value was only listed as single or multiple (S or M). Single assumes one floor and multiple assumes two. Parcels where this estimated lawn ares was below zero were given a lawn area of zero.

Additional parcel data made available to the RA but not used includes:

Foreclosure Data

Foreclosure data was purchased in late May 2013 from Jim Patterson at The Information Market (http://www.theinformationmarket.com). Data was made available to the RA as separate Excel (.xlsx) files for the years 2002-2012. Those files were exported to CSV for import into R and analysis with the parcel data described above.

Each record in the database represents a separate foreclosure process, which leaves some ambiguity about what should be considered foreclosure for the purposes of this project.

In most multiple-records cases there appear to be one or more cancelled sales (STATUS == "CT") followed by one or more later completed sales (STATUS == "TD"). It is presumed that a sale can be cancelled either by the resident catching up on their payments or when the auction does not result in an acceptable bid. There are cases where there are multiple sales where properties appear to move through a sequence of foreclosures.

Since the primary concern of this project is abandonment and/or cessation of lawn maintenance, the question arises about what should be inferred from this data about when post-eviction / post-abandonment vacancy starts and ends. The Notice of Trustee Sale (NOS, DATE field) is when the process officially starts, but the eviction presumably doesn't occur until after a successful sale (SALEDATE), which must be at least 90 days after the NOS. And if the sale is cancelled, it is not clear whether an eviction into vacancy ever takes place.

For this preliminary processing of the data, it is assumed that for any specific raster date (the date of the associated MODIS image), any parcel that has a foreclosure record (regardless of STATUS) where the raster date is between the NOS date (DATE) and the completion or cancellation of the sale (SALEDATE) is in "foreclosure."

The presumption would then be that there would be a 90-day window where lawn maintenance might stop, although this assumption might not be accurate if the sale is cancelled. This also might not capture situations where abandonment occurred and the parcel was vacant through a sequence of temporally-separated cancelled sales.

However, if the sale is completed, the data does not give a clear indication of when occupancy and / or lawn maintenance would resume. If the property is sold to an individual, they presumably would move in almost immediately, leaving little or no window of vacancy. But if the property was purchased by some investment entity, maintenance might resume or might be deferred until the property was ultimately sold to an entity (rental landlord or individual buyer) that would then facilitate a resumption of occupancy.

Statistics:

A more formal definition of foreclosure is the period between the sale that transfers ownership from the defaulting owner to an intermediary and the sale that transfers ownership to the next individual owner. However, for the single-record properties it does not appear possible to infer such a foreclosure period, which necessitates using the more liberal definition as any parcel with an active NOS.

Rasterization Methodology

The data was processed from the original shapefile and residential data using R. R was chosen because of its robust spatial and statistical capabilities, as well as the availability of R expertise on the project team. Processing was performed on the Illinois Campus Cluster.

The script is located in:

/projects/oa/housing_bubble/vector/phoenix/rasterize_foreclosures.r

While the initial plan was to perform a simple spatial join between the parcel data and a grid representing the MODIS pixels, this represents N^2 complexity that is computationally intractable with the large number of parcels and pixels. Additional complication is added because many parcels overlap pixel boundaries, necessitating numerous computationally-intensive intersection operations.

The final methodology reversed the initial idea, going from parcel to pixel rather than from pixel to parcel:

  1. A matrix of lists of parcel areas, foreclosed areas and forclosed dates for each pixel is initialized
  2. For each parcel, the range of of potentially overlapped pixels calculated. This is a simple arithmetic calculation rather than a computationally-intensive full-dataset spatial join, which reduces the complexity from N^2 to N
  3. The parcel is intersected with a grid of potentially overlapping pixels, yielding a set of parcel fragments. The areas and foreclosure dates (where applicable) of these fragments are then added to the matrix of lists
  4. After all parcels have been processed, the matrix lists are processed in high-speed vector operations to create the raster data

Each raster set for a single date can be processed in around one hour when the process has unrestricted access to a core.

The current analysis is memory intensive, around 8.5 GB for a year's worth of parcel shapefiles and data, which restricts processing to three simultaneous years per 32 GB node. Future iterations of the analysis script may reduce this by removing numerous unneeded columns from the residential data, or loading the parcel data on demand, which would permit harnassing additional simultaneous cores but add significant additiona I/O time.

Directory Structure

Data for this project is stored on the Taub Campus Cluster in:

/projects/oa/housing_bubble/vector/phoenix

The following subdirectories are used:

Three R scripts are present at the root:

Cached Data Files

The cache directory contains files that have been preprocessed for faster loading.

Many files are stored as R Data Serialization (RDS) files. These can be loaded with readRDS() and written with saveRDS(). These files can be loaded and saved significantly faster than many generalized data formats (especially shapefiles), although they can only be read by R and are not considered appropriate for long-term archival purposes or interchange with other software.

Rasterized Parcel Data

A set of ten GeoTIFF files is available for each analogous MODIS image date over the analysis period:

  1. yyyy-mm-dd-parcel-count.tif: The count of all parcels that are contained within or intersect the boundaries of each pixel
  2. yyyy-mm-dd-parcels-foreclosed.tif: The count of all parcels that are in foreclosure during that date that are contained within or intersect the boundaries of each pixel.
  3. yyyy-mm-dd-lawn-area.tif: The total estimated lawn area in square meters within a pixel on that date. Each pixel covers approximately 53,665 square meters. Estimation methodology is described further below
  4. yyyy-mm-dd-lawn-area-foreclosed.tif: The total estimated lawn area under foreclosure in square meters on that date
  5. yyyy-mm-dd-foreclosed-days-mean.tif: The arithmetic mean of the number of days in foreclosure for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
  6. yyyy-mm-dd-foreclosed-days-stdev.tif: The standard deviation of the number of days in foreclosure for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
  7. yyyy-mm-dd-foreclosed-days-median.tif: The median number of days in foreclosure for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
  8. yyyy-mm-dd-foreclosed-days-weighted-mean.tif: The mean number of days in foreclosure weighted by parcel ares for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
  9. yyyy-mm-dd-foreclosed-days-weighted-stdev.tif: The standard deviation of the number of days in foreclosure weighted by parcel ares for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel
  10. yyyy-mm-dd-foreclosed-days-weighted-median.tif: The median number of days in foreclosure weighted by parcel ares for all parcels in foreclosure during that date that are contained within or intersect the boundaries of each pixel

The raster files have the following specifications:

MODIS Data

A set of MODIS data GeoTIFFs for the analysis period are cached under the names yyyy-mm-dd-raw-modis-ndvi.tif

American Community Survey Data

Data for the 2008-2012 American Community Survey (ACS) is joined with a polygon shapefile of the 2013 Census Tracts for the county and serialized in: cache/2012-acs.rds

Data fields include:

Parcel Centroids

Two sets of serialized centroid point shapefiles are saved for each analysis period year.

yyyy-parcel-centroids.rds are the centroids for the parcels.

yyyy-parcel-fragment-centroids.rds are centroids for all framents of parcels that are formed when the parcels are split on pixel boundaries.

Foreclosure Data

Two serialized data frame files are used to cache foreclosure data.

cache/distressed.rds contains data for parcels that were included in a Notice of Sale (NOS), regardless of whether the sale was completed or cancelled.

cache/foreclosed.rds contains data for parcels where a sale was completed. This is a subsed of distressed.rds

Both files contain the same fields:

Greater Phoenix OpenStreetMap

The three-color raster of the Phoenix urban area is serialized in cache/greater_phoenix_osm.rds

Template Raster

All rasterized and MODIS data is saved using the projection used by MODIS (EPSG 2868) with the same geographic extents and pixel dimensions. A blank template is used during preprocessing and saved as cache/template_raster.tif

USPS Vacancy Data

HUD Aggregated USPS Administrative Data On Address Vacancies as a data frame is serialized and saved in cache/tract-vacancy.rds

Data is aggregated by Census Tract and calendar quarter.

Data is included in five different sets of columns: