Livelihoods has been a major focus area for most of the anti-poverty program. With a rural population far exceeding the urban, lack of access to work and credit has been an impediment to the socio-economic development of the local population.
With the data giving the financial, economic and education levels of the individuals I attempt to predict the probability if certain people of the population would be classified as poor.
Executive Summary
There are a total of 54 variables and 53 variables are used to give a probability of poverty. A total of 12,600 records exists and most of the variables are marked as True and False.
Count |
Mean |
STD |
Min |
Max |
|
poverty_probability |
12364 |
0.61 |
0.29 |
0 |
1 |
The significant feature found in the initial data exploration and correlation examination were:
Education_level: The higher the education the less likely it will be for a person to be poor.
Is_urban: If a person lives in urban population, they are probability of them being poor is low.
Variables related to awareness of technology and banking system all indicated that there is a negative correlation between them and poverty probability.
Initial Data Exploration
The count column reveals that there are in total of 12,600 records but following variables have null values in them:
- education_level : removed rows with NaN
- bank_interest_rate: removed column
- mm_interest_rate: removed column
- mfi_interest_rate: removed column
- other_fsp_interest_rate: removed column
The columns were removed because they demonstrated poor correlation.
Table 1.1
count |
mean |
std |
min |
max |
|
row_id |
12600 |
6299.5 |
3637.451031 |
0 |
12599 |
age |
12600 |
36.28071429 |
15.1459448 |
15 |
115 |
education_level |
12364 |
1.316240699 |
0.905441988 |
0 |
3 |
share_hh_income_provided |
12295 |
2.888165921 |
1.56428404 |
1 |
5 |
num_times_borrowed_last_year |
12600 |
0.657698413 |
0.924598074 |
0 |
3 |
borrowing_recency |
12600 |
0.866428571 |
0.960866117 |
0 |
2 |
bank_interest_rate |
289 |
9.843079585 |
15.03308922 |
0 |
100 |
mm_interest_rate |
151 |
9.02102649 |
13.62016125 |
0 |
100 |
mfi_interest_rate |
201 |
10.90920398 |
10.3532979 |
0 |
100 |
other_fsp_interest_rate |
239 |
8.216736402 |
10.64953801 |
0 |
100 |
num_shocks_last_year |
12600 |
1.10015873 |
1.190071892 |
0 |
5 |
avg_shock_strength_last_year |
12600 |
2.11276455 |
2.019238924 |
0 |
5 |
phone_technology |
12600 |
1.208730159 |
1.09306016 |
0 |
3 |
phone_ownership |
12600 |
1.468253968 |
0.776638297 |
0 |
2 |
num_formal_institutions_last_year |
12600 |
0.714126984 |
0.805877951 |
0 |
6 |
num_informal_institutions_last_year |
12600 |
0.188968254 |
0.473696287 |
0 |
4 |
num_financial_activities_last_year |
12600 |
1.55968254 |
2.043831136 |
0 |
10 |
poverty_probability |
12600 |
0.611271667 |
0.291475931 |
0 |
1 |
Pearson Correlation
At first in Table 1.1 data doesn’t reveal any positive correlation while following variables do demonstrate weaker negative correlation:
- Education_level
- Is_urban
- Phone_technology
- Can_use_internet
- Can_text
What this does not reveal is which categorical value in each correlated variable has stronger correlation with the poverty_probability, to do that we will have to expand the table by using one-hot encoding.
Table 1.2
poverty_probability |
|
num_shocks_last_year |
0.135471975 |
avg_shock_strength_last_year |
0.129479306 |
income_ag_livestock_last_year |
0.103012885 |
married |
0.098293529 |
female |
0.057990184 |
borrowed_for_daily_expenses_last_year |
0.045577829 |
borrowing_recency |
0.04469322 |
mm_interest_rate |
0.03937198 |
num_times_borrowed_last_year |
0.033996431 |
nonreg_active_mm_user |
0.033492093 |
borrowed_for_emergency_last_year |
0.032834789 |
age |
0.007226039 |
can_calc_compounding |
-0.025874815 |
active_formal_nbfi_user |
-0.033173989 |
income_government_last_year |
-0.033373439 |
reg_formal_nbfi_account |
-0.033983273 |
employed_last_year |
-0.041584232 |
can_divide |
-0.04384036 |
cash_property_savings |
-0.046705604 |
share_hh_income_provided |
-0.059546055 |
borrowed_for_home_or_biz_last_year |
-0.06027447 |
can_calc_percents |
-0.062347686 |
has_insurance |
-0.062468349 |
mfi_interest_rate |
-0.064025153 |
num_informal_institutions_last_year |
-0.080436461 |
can_add |
-0.08638597 |
active_informal_nbfi_user |
-0.086387684 |
informal_savings |
-0.087468568 |
other_fsp_interest_rate |
-0.098450351 |
income_own_business_last_year |
-0.103902662 |
income_public_sector_last_year |
-0.103931945 |
bank_interest_rate |
-0.11736729 |
reg_mm_acct |
-0.121792293 |
income_friends_family_last_year |
-0.124148797 |
income_private_sector_last_year |
-0.146884578 |
can_call |
-0.152492698 |
has_investment |
-0.155311643 |
active_mm_user |
-0.155668324 |
financially_included |
-0.192211702 |
literacy |
-0.198561362 |
num_formal_institutions_last_year |
-0.211129674 |
can_make_transaction |
-0.223611969 |
reg_bank_acct |
-0.235280003 |
active_bank_user |
-0.245175169 |
advanced_phone_use |
-0.247681975 |
phone_ownership |
-0.252786628 |
formal_savings |
-0.252813286 |
num_financial_activities_last_year |
-0.261384588 |
can_text |
-0.261725975 |
can_use_internet |
-0.284346245 |
phone_technology |
-0.289281728 |
is_urban |
-0.290158708 |
education_level |
-0.345486915 |
Pearson Correlation after One-Hot Encoding(+/-0.2)
The number of variables increased from 54 to 111 after the one-hot encoding. Compared to Table 1.2, we can now see in Table 1.3 that indeed there are such variables which have both weak positive and negative correlation. I have listed the correlations which have higher than 0.2 strength of correlation.
Table 1.3
poverty_probability |
|
is_urban_False |
0.289286814 |
can_use_internet_False |
0.282023475 |
can_text_False |
0.259107513 |
phone_technology_0 |
0.254060755 |
formal_savings_False |
0.249416935 |
active_bank_user_False |
0.243164929 |
country_2 |
0.241324978 |
reg_bank_acct_False |
0.232604408 |
can_make_transaction_False |
0.220259181 |
country_0 |
0.201765418 |
num_formal_institutions_last_year |
-0.206546614 |
education_level_2 |
-0.208348105 |
can_make_transaction |
-0.220259181 |
can_make_transaction_True |
-0.220259181 |
reg_bank_acct |
-0.232604408 |
reg_bank_acct_True |
-0.232604408 |
education_level_3 |
-0.240237826 |
active_bank_user |
-0.243164929 |
active_bank_user_True |
-0.243164929 |
country |
-0.246903618 |
formal_savings |
-0.249416935 |
formal_savings_True |
-0.249416935 |
phone_ownership |
-0.249417137 |
phone_ownership_2 |
-0.253870112 |
num_financial_activities_last_year |
-0.25700145 |
can_text |
-0.259107513 |
can_text_True |
-0.259107513 |
phone_technology_3 |
-0.264542853 |
can_use_internet |
-0.282023475 |
can_use_internet_True |
-0.282023475 |
phone_technology |
-0.287777386 |
is_urban_True |
-0.289286814 |
is_urban |
-0.289286814 |
education_level |
-0.345486915 |
Regression: Predicting probability of being poor
Using Boosted Decision Tree algorithm, I trained the model with 80% of the data then tested it on 20% of the data, the result was following:
Mean Absolute Error |
0.177574 |
Root Mean Squared Error |
0.221758 |
Relative Absolute Error |
0.706912 |
Relative Squared Error |
0.577702 |
Coefficient of Determination |
0.422298 |
The plot above showcases the scatterplot between scored labels and poverty_probability which is a label.
On training data the model had a CoD of 0.42228 which was higher than what was required, on the testing data the model scored the CoD of 0.4000.
Conclusion
The analysis has shown that based on the normal machine learning techniques we were able to achieve the CoD of 0.40 which confirms that we can predict that the person will be poor by 40% accuracy. We can achieve even higher percentage by using deep learning techniques.
Featured Image: City photo created by rawpixel.com – www.freepik.com