Livelihoods has been a major focus area for most of the anti-poverty program. With a rural population far exceeding the urban, lack of access to work and credit  has been an impediment to the socio-economic development of the local population.

With the data giving the financial, economic and education levels of the individuals I attempt to predict the probability if certain people of the population would be classified as poor.

Executive Summary

There are a total of 54 variables and 53 variables are used to give a probability of poverty. A total of 12,600 records  exists and most of the variables are marked as True and False.

 

Count

Mean

STD

Min

Max

poverty_probability

12364

0.61

0.29

0

1

 

The significant feature found in the initial data exploration and correlation examination were:

Education_level: The higher the education the less likely it will be for a person to be poor.

Is_urban: If a person lives in urban population, they are probability of them being poor is low.

Variables related to awareness of technology and banking system all indicated that there is a negative correlation between them and poverty probability.

 

Initial Data Exploration

The count column reveals that there are in total of 12,600 records but following variables have null values in them:

  1. education_level : removed rows with NaN
  2. bank_interest_rate: removed column
  3. mm_interest_rate: removed column
  4. mfi_interest_rate: removed column
  5. other_fsp_interest_rate: removed column

The columns were removed because they demonstrated poor correlation.

 

Table 1.1

 

count

mean

std

min

max

row_id

12600

6299.5

3637.451031

0

12599

age

12600

36.28071429

15.1459448

15

115

education_level

12364

1.316240699

0.905441988

0

3

share_hh_income_provided

12295

2.888165921

1.56428404

1

5

num_times_borrowed_last_year

12600

0.657698413

0.924598074

0

3

borrowing_recency

12600

0.866428571

0.960866117

0

2

bank_interest_rate

289

9.843079585

15.03308922

0

100

mm_interest_rate

151

9.02102649

13.62016125

0

100

mfi_interest_rate

201

10.90920398

10.3532979

0

100

other_fsp_interest_rate

239

8.216736402

10.64953801

0

100

num_shocks_last_year

12600

1.10015873

1.190071892

0

5

avg_shock_strength_last_year

12600

2.11276455

2.019238924

0

5

phone_technology

12600

1.208730159

1.09306016

0

3

phone_ownership

12600

1.468253968

0.776638297

0

2

num_formal_institutions_last_year

12600

0.714126984

0.805877951

0

6

num_informal_institutions_last_year

12600

0.188968254

0.473696287

0

4

num_financial_activities_last_year

12600

1.55968254

2.043831136

0

10

poverty_probability

12600

0.611271667

0.291475931

0

1

 

Pearson Correlation

At first in Table 1.1 data doesn’t reveal any positive correlation while following variables do demonstrate weaker negative correlation:

  1. Education_level
  2. Is_urban
  3. Phone_technology
  4. Can_use_internet
  5. Can_text

What this does not reveal  is which categorical value in each correlated variable has stronger correlation with the poverty_probability, to do that we will have to expand the table by using one-hot encoding.

 

Table 1.2

 

poverty_probability

num_shocks_last_year

0.135471975

avg_shock_strength_last_year

0.129479306

income_ag_livestock_last_year

0.103012885

married

0.098293529

female

0.057990184

borrowed_for_daily_expenses_last_year

0.045577829

borrowing_recency

0.04469322

mm_interest_rate

0.03937198

num_times_borrowed_last_year

0.033996431

nonreg_active_mm_user

0.033492093

borrowed_for_emergency_last_year

0.032834789

age

0.007226039

can_calc_compounding

-0.025874815

active_formal_nbfi_user

-0.033173989

income_government_last_year

-0.033373439

reg_formal_nbfi_account

-0.033983273

employed_last_year

-0.041584232

can_divide

-0.04384036

cash_property_savings

-0.046705604

share_hh_income_provided

-0.059546055

borrowed_for_home_or_biz_last_year

-0.06027447

can_calc_percents

-0.062347686

has_insurance

-0.062468349

mfi_interest_rate

-0.064025153

num_informal_institutions_last_year

-0.080436461

can_add

-0.08638597

active_informal_nbfi_user

-0.086387684

informal_savings

-0.087468568

other_fsp_interest_rate

-0.098450351

income_own_business_last_year

-0.103902662

income_public_sector_last_year

-0.103931945

bank_interest_rate

-0.11736729

reg_mm_acct

-0.121792293

income_friends_family_last_year

-0.124148797

income_private_sector_last_year

-0.146884578

can_call

-0.152492698

has_investment

-0.155311643

active_mm_user

-0.155668324

financially_included

-0.192211702

literacy

-0.198561362

num_formal_institutions_last_year

-0.211129674

can_make_transaction

-0.223611969

reg_bank_acct

-0.235280003

active_bank_user

-0.245175169

advanced_phone_use

-0.247681975

phone_ownership

-0.252786628

formal_savings

-0.252813286

num_financial_activities_last_year

-0.261384588

can_text

-0.261725975

can_use_internet

-0.284346245

phone_technology

-0.289281728

is_urban

-0.290158708

education_level

-0.345486915

 

 

Pearson Correlation after One-Hot Encoding(+/-0.2)

The number of variables increased from 54 to 111 after the one-hot encoding. Compared to Table 1.2, we can now see in Table 1.3 that indeed there are such variables which have both weak positive and negative correlation. I have listed the correlations which have higher than 0.2 strength of correlation.

Table 1.3

 

poverty_probability

is_urban_False

0.289286814

can_use_internet_False

0.282023475

can_text_False

0.259107513

phone_technology_0

0.254060755

formal_savings_False

0.249416935

active_bank_user_False

0.243164929

country_2

0.241324978

reg_bank_acct_False

0.232604408

can_make_transaction_False

0.220259181

country_0

0.201765418

num_formal_institutions_last_year

-0.206546614

education_level_2

-0.208348105

can_make_transaction

-0.220259181

can_make_transaction_True

-0.220259181

reg_bank_acct

-0.232604408

reg_bank_acct_True

-0.232604408

education_level_3

-0.240237826

active_bank_user

-0.243164929

active_bank_user_True

-0.243164929

country

-0.246903618

formal_savings

-0.249416935

formal_savings_True

-0.249416935

phone_ownership

-0.249417137

phone_ownership_2

-0.253870112

num_financial_activities_last_year

-0.25700145

can_text

-0.259107513

can_text_True

-0.259107513

phone_technology_3

-0.264542853

can_use_internet

-0.282023475

can_use_internet_True

-0.282023475

phone_technology

-0.287777386

is_urban_True

-0.289286814

is_urban

-0.289286814

education_level

-0.345486915

 

 

Regression: Predicting probability of being poor

Using Boosted Decision Tree algorithm, I trained the model with 80% of the data then then tested it on 20% of the data, the result was following:

Mean Absolute Error

0.177574

Root Mean Squared Error

0.221758

Relative Absolute Error

0.706912

Relative Squared Error

0.577702

Coefficient of Determination

0.422298

 

The plot above showcases the scatterplot between scored labels and poverty_probability which is a label.

On training data the model had a CoD of 0.42228 which was higher than what was required, on the testing data the model scored the CoD of 0.4000.

Conclusion

The analysis has shown that based on the normal machine learning techniques we were able to achieve the CoD of 0.40 which confirms that we can predict that the person will be poor by 40% accuracy. We can achieve even higher percentage by using deep learning techniques.

Featured Image: City photo created by rawpixel.com – www.freepik.com

Leave a Reply