Today, most poor renting families spend at least half of their income on housing costs, with one in four of those families spending over 70 percent of their income just on rent and utilities. Only one in four families who qualifies for affordable housing programs gets any kind of help. Under those conditions, it has become harder for low-income families to keep up with rent and utility costs, and a growing number are living one misstep or emergency away from eviction.
Performance Metric
We’re predicting a numeric quantity, so this is a regression problem. To measure regression, we’ll use a metric known as R-squared, also called the coefficient of determination. It is a quantity between -∞ and 1 where a higher value is better.
where y^y^ is the predicted number of evictions, yy is the actual number of evictions, and y¯y¯ is the average of the actual number of evictions. A score of 1 means predictions exactly match the test values.
Features
There are 47 variables in this dataset. Each row in the dataset represents a United States county, and the dataset we are working with covers two particular years, denoted a
, and b
. We provide a unique identifier for an individual county, but note that the counties in the test set are distinct from counties in the train set. In other words, no county that appears in the train set will appear in the test set. Thus, county-specific features (i.e. county dummy variables) will not be an option. However, the counties in the test set still share similar patterns as those in the train set and so other feature engineering will work the same as usual.
The variables are as follows:
ID
county_code
– Unique identifier for each countyyear
– Year, denoted asa
orb
state
– Unique identifier for each statepopulation
– Total population
HOUSING
renter_occupied_households
– Count of renter-occupied householdspct_renter_occupied
– Percent of occupied housing units that are renter-occupiedmedian_gross_rent
– Median cost of rentmedian_household_income
– Median household incomemedian_property_value
– Median property valuerent_burden
– Median gross rent as a percentage of household income
ETHNICITY
pct_white
– Percent of population that is White alone and not Hispanic or Latinopct_af_am
– Percent of population that is Black or African American alone and not Hispanic or Latinopct_hispanic
– Percent of population that is of Hispanic or Latino originpct_am_ind
– Percent of population that is American Indian and Alaska Native alone and not Hispanic or Latinopct_asian
– Percent of population that is Asian alone and not Hispanic or Latinopct_nh_pi
– Percent of population that is Native Hawaiian and Other Pacific Islander alone and not Hispanic or Latinopct_multiple
– Percent of population that is two or more races and not Hispanic or Latinopct_other
– Percent of population that is other race alone and not Hispanic or Latino
ECONOMIC
poverty_rate
– Percent of the population with income in the past 12 months below the poverty levelrucc
– Rural-Urban Continuum Codes “form a classification scheme that distinguishes metropolitan counties by the population size of their metro area, and nonmetropolitan counties by degree of urbanization and adjacency to a metro area. The official Office of Management and Budget (OMB) metro and nonmetro categories have been subdivided into three metro and six nonmetro categories. Each county in the U.S. is assigned one of the 9 codes.” (USDA Economic Research Service)urban_influence
– Urban Influence Codes “form a classification scheme that distinguishes metropolitan counties by population size of their metro area, and nonmetropolitan counties by size of the largest city or town and proximity to metro and micropolitan areas.” (USDA Economic Research Service)economic_typology
– County Typology Codes “classify all U.S. counties according to six mutually exclusive categories of economic dependence and six overlapping categories of policy-relevant themes. The economic dependence types include farming, mining, manufacturing, Federal/State government, recreation, and nonspecialized counties. The policy-relevant types include low education, low employment, persistent poverty, persistent child poverty, population loss, and retirement destination.” (USDA Economic Research Service)pct_civilian_labor
– Civilian labor force, annual average, as percent of population.pct_unemployment
– Unemployment, annual average, as percent of population
HEALTH
pct_uninsured_adults
– Percent of adults without health insurancepct_uninsured_children
– Percent of children without health insurancepct_adult_obesity
– Percent of adults who meet clinical definition of obesepct_adult_smoking
– Percent of adults who smokepct_diabetes
– Percent of population with diabetespct_low_birthweight
– Percent of babies born with low birth weightpct_excessive_drinking
– Percent of adult population that engages in excessive consumption of alcoholpct_physical_inactivity
– Percent of adult population that is physically inactiveair_pollution_particulate_matter_value
– Fine particulate matter in µg/m³homicides_per_100k
– Deaths by homicide per 100,000 populationmotor_vehicle_crash_deaths_per_100k
– Deaths by motor vehicle crash per 100,000 populationheart_disease_mortality_per_100k
– Deaths from heart disease per 100,000 populationpop_per_dentist
– Population per dentistpop_per_primary_care_physician
– Population per Primary Care Physician
DEMOGRAPHIC
pct_female
– Percent of population that is femalepct_below_18_years_of_age
– Percent of population that is below 18 years of agepct_aged_65_years_and_older
– Percent of population that is aged 65 years or olderpct_adults_less_than_a_high_school_diploma
– Percent of adult population that does not have a high school diplomapct_adults_with_high_school_diploma
– Percent of adult population which has a high school diploma as highest level of education achievedpct_adults_with_some_college
– Percent of adult population which has some college as highest level of education achievedpct_adults_bachelors_or_higher
– Percent of adult population which has a bachelor’s degree or higher as highest level of education achievedbirth_rate_per_1k
– Births per 1,000 of populationdeath_rate_per_1k
– Deaths per 1,000 of population
Example Row
Here’s an example of one of the rows in the dataset so that you can see the kinds of values you might expect in the dataset. Most are numeric, a few are categorical, and there can be missing values.
0 | |
---|---|
county_code | a4e2211 |
year | b |
state | d725a95 |
population | 45009 |
renter_occupied_households | 6944 |
pct_renter_occupied | 37.218 |
median_gross_rent | 643 |
median_household_income | 33315 |
median_property_value | 98494 |
rent_burden | 33.389 |
pct_white | 0.41207 |
pct_af_am | 0.493459 |
pct_hispanic | 0.0701932 |
pct_am_ind | 0.00258823 |
pct_asian | 0.00457455 |
pct_nh_pi | 0.000200638 |
pct_multiple | 0.0159206 |
pct_other | 0.000993158 |
poverty_rate | 18.451 |
rucc | Nonmetro – Urban population of 20,000 or more, adjacent to a metro area |
urban_influence | Micropolitan adjacent to a large metro area |
economic_typology | Nonspecialized |
pct_civilian_labor | 0.407 |
pct_unemployment | 0.093 |
pct_uninsured_adults | 0.239 |
pct_uninsured_children | 0.068 |
pct_adult_obesity | 0.332 |
pct_adult_smoking | 0.277 |
pct_diabetes | 0.145 |
pct_low_birthweight | 0.12 |
pct_excessive_drinking | 0.077 |
pct_physical_inactivity | 0.313 |
air_pollution_particulate_matter_value | 12.1653 |
homicides_per_100k | 14.01 |
motor_vehicle_crash_deaths_per_100k | 18.21 |
heart_disease_mortality_per_100k | 318 |
pop_per_dentist | 2420 |
pop_per_primary_care_physician | 1960 |
pct_female | 0.532 |
pct_below_18_years_of_age | 0.252 |
pct_aged_65_years_and_older | 0.153 |
pct_adults_less_than_a_high_school_diploma | 0.233 |
pct_adults_with_high_school_diploma | 0.375 |
pct_adults_with_some_college | 0.278 |
pct_adults_bachelors_or_higher | 0.114 |
birth_rate_per_1k | 12.9151 |
death_rate_per_1k | 11.2051 |
Process:
Based on the reference study titled “Losing home – the human cost of eviction in Seattle”, I extracted many new futures by dividing population with ethnicity.
The same thought process went into the extraction of features by distributing rental households by variables related to ethnicity, health and economics.
In total the feature count was increased from 47 to 209.
Before starting with the process of feature engineering the process of data cleansing required add median values in variables which are titled “median” and then other records with null values were replaced by PCA.
Result:
I started with Linear regression, which had a CoD of 0.76 and worked my towards getting the most out of Neural Networks.
In the results shown below, neural networks with single parameter yielded the best result.