The dataset of this project is taken from Kiva.org which is online crowdfunding platform to extend financial services to poor and financially excluded people around the world. Kiva lenders have provided over $1 billion dollars in loans to over 2 million people. In order to set investment priorities, help inform lenders, and understand their target communities, knowing the level of poverty of each borrower is critical. However, this requires inference based on a limited set of information for each borrower. (kaggle)
The business model is to operate a platform (crowd investing) where people who have a business idea but don't have the money they need can register and collect money for their project within a given time.
On the other hand, investors would like to invest their money in projects and are looking for investments.
As an intermediary, this platform brings borrowers and financiers together.
The database is the history of your platform.
Assumptions made regarding the business model
All projects are completed projects, i.e. the time to collect money for your project has expired. The business model stipulates that the funds collected will be paid out even if the target amount has not been reached.
The lender receives interest for borrowing money.
- funded_amount ... amount received/amount paid out in USD at the end of the "crowding" time
- loan_amount ... Target amount (amount you wanted to achieve for the project) in USD
- activity ... Subcategory to which the goal of the crowd project belongs thematically
- sector ... main category in which the crowd project topic falls
- use ... Short description of what the money should be used for
- country_code ... Country code according to ISO standard
- country ... Country name according to ISO standard
- region ... region
- currency ... Currency in which the funded_amount was then paid out
- term in months ... Duration over which the loan should be paid out
- lender_count ...lender (i.e. how many people gave money for the project)
- borrower_genders ... gender and number of borrowers, i.e. those who initiated the crowd project
- repayment interval ... contractually agreed repayment modalities/frequency
# import required libraries
import pandas as pd
# for visualization
import plotly.express as px
# CRISP-DM Model
# Step1: Business Understanding
# Step2: Data Understanding
# Step3: Data Preparation
# There was a data entry error in line 11 in dataset. A backslash sign ('/') in 'activity' column prevented the data
# to be read by Python. It had to be removed manually.
df_kiva = pd.read_csv('data_abschlussprojekt.csv', sep='#')
df_kiva
Unnamed: 0 | funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
1 | 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
2 | 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
3 | 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
4 | 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
671200 | 671200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
671201 | 671201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
671202 | 671202 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
671203 | 671203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671204 | 671204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671205 rows × 14 columns
# dealing with duplicates
duplicate_rows = df_kiva[df_kiva.duplicated()]
# Print the result
print("Duplicate rows:")
print(duplicate_rows)
Duplicate rows: Empty DataFrame Columns: [Unnamed: 0, funded_amount, loan_amount, activity, sector, use, country_code, country, region, currency, term_in_months, lender_count, borrower_genders, repayment_interval] Index: []
# Calculate the amount of missing values
df_kiva.isnull().sum()
# according to the result we have four set of missing values
Unnamed: 0 0 funded_amount 0 loan_amount 0 activity 0 sector 0 use 4232 country_code 8 country 0 region 56800 currency 0 term_in_months 0 lender_count 0 borrower_genders 4221 repayment_interval 0 dtype: int64
df_kiva.columns
Index(['Unnamed: 0', ' funded_amount', ' loan_amount', ' activity', ' sector', ' use', ' country_code', ' country', ' region', ' currency', ' term_in_months', ' lender_count', ' borrower_genders', ' repayment_interval'], dtype='object')
# There is a space before the name of each column. First we remove them to avoid future errors
df_kiva.columns = df_kiva.columns.str.strip()
df_kiva.columns
Index(['Unnamed: 0', 'funded_amount', 'loan_amount', 'activity', 'sector', 'use', 'country_code', 'country', 'region', 'currency', 'term_in_months', 'lender_count', 'borrower_genders', 'repayment_interval'], dtype='object')
# we replcae the with correct value
missing_country_code_rows = df_kiva[df_kiva['country_code'].isnull()]
# Display the resulting DataFrame
missing_country_code_rows
Unnamed: 0 | funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
202537 | 202537 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 6.0 | 162 | female | bullet |
202823 | 202823 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | Rundu | NAD | 6.0 | 159 | male | bullet |
344929 | 344929 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 7.0 | 120 | female | bullet |
351177 | 351177 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | Rundu | NAD | 7.0 | 126 | male | bullet |
420953 | 420953 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 7.0 | 118 | female | bullet |
421218 | 421218 | 4000.0 | 4000.0 | Wholesale | Wholesale | purchase solar lighting products for sale to l... | NaN | Namibia | Rundu | NAD | 7.0 | 150 | male | bullet |
487207 | 487207 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | to pay for stock of solar lights and cell phon... | NaN | Namibia | Katima Mulilo | NAD | 7.0 | 183 | male | bullet |
487653 | 487653 | 5000.0 | 5000.0 | Wholesale | Wholesale | to maintain a stock of solar lights and cell p... | NaN | Namibia | Oshakati | NAD | 7.0 | 183 | female | bullet |
# Namibia is missing its country_code we replace it with NM
df_kiva.loc[missing_country_code_rows.index, 'country_code'] = "NM"
# verify the 'country_code' replacement
df_kiva[df_kiva['country'] == 'Namibia']
Unnamed: 0 | funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
202537 | 202537 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NM | Namibia | EEnhana | NAD | 6.0 | 162 | female | bullet |
202823 | 202823 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NM | Namibia | Rundu | NAD | 6.0 | 159 | male | bullet |
344929 | 344929 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NM | Namibia | EEnhana | NAD | 7.0 | 120 | female | bullet |
351177 | 351177 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NM | Namibia | Rundu | NAD | 7.0 | 126 | male | bullet |
420953 | 420953 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NM | Namibia | EEnhana | NAD | 7.0 | 118 | female | bullet |
421218 | 421218 | 4000.0 | 4000.0 | Wholesale | Wholesale | purchase solar lighting products for sale to l... | NM | Namibia | Rundu | NAD | 7.0 | 150 | male | bullet |
487207 | 487207 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | to pay for stock of solar lights and cell phon... | NM | Namibia | Katima Mulilo | NAD | 7.0 | 183 | male | bullet |
487653 | 487653 | 5000.0 | 5000.0 | Wholesale | Wholesale | to maintain a stock of solar lights and cell p... | NM | Namibia | Oshakati | NAD | 7.0 | 183 | female | bullet |
# there are a number of values in 'borrower_genders' column
df_kiva['borrower_genders'].unique()
array(['female', 'female, female', 'female, female, female', ..., 'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female', 'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male', 'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'], dtype=object)
# we choose four unique values for this column: male, female, mixed (when there are both male and female), unknown for
# missing values
def simplify_genders(gender_string):
if pd.isna(gender_string):
return 'unknown'
genders = gender_string.split(', ')
if 'male' in genders and 'female' not in genders:
return 'male'
elif 'female' in genders and 'male' not in genders:
return 'female'
else:
return 'mixed'
# Apply the function to the 'borrower_genders' column
df_kiva['borrower_genders'] = df_kiva['borrower_genders'].apply(simplify_genders)
# Display the unique values in the simplified 'borrower_genders' column
print(df_kiva['borrower_genders'].unique())
['female' 'male' 'mixed' 'unknown']
# There are a lot of similar data in column 'use' in DataFrame that reprsent the same value.
# Thay could be converted to one value with Text Mining methods.
df_kiva['use'].mode()
value_counts = df_kiva['use'].value_counts()
# Print the result
print("Value counts:")
value_counts
Value counts:
use to buy a water filter to provide safe drinking water for their family. 5217 to buy a water filter to provide safe drinking water for her family. 4082 To buy a water filter to provide safe drinking water for their family. 2141 to build a sanitary toilet for her family. 1708 to build a sanitary toilet for her family 1599 ... to acquire furniture, equipment and beauty products (makeup, an iron, sprays, hair creams, etc.) 1 to purchase materials like cloth, needles, zippers, and duck feed 1 to buy farm tools as well as manure and fertilizer to maintain his crops. 1 to launch her own shoe store in her village 1 to buy differently sized pots 1 Name: count, Length: 424912, dtype: int64
# Handle NaN values in 'use' column
df_kiva['use'] = df_kiva['use'].fillna("unknown")
df_kiva['region'] = df_kiva['region'].fillna("unknown")
df_kiva.isnull().sum()
Unnamed: 0 0 funded_amount 0 loan_amount 0 activity 0 sector 0 use 0 country_code 0 country 0 region 0 currency 0 term_in_months 0 lender_count 0 borrower_genders 0 repayment_interval 0 dtype: int64
df_kiva_s = df_kiva.copy(deep=True)
df_kiva_s
Unnamed: 0 | funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
1 | 1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female | irregular |
2 | 2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
3 | 3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
4 | 4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
671200 | 671200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
671201 | 671201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | unknown | KES | 13.0 | 1 | female | monthly |
671202 | 671202 | 0.0 | 25.0 | Games | Entertainment | unknown | KE | Kenya | unknown | KES | 13.0 | 0 | unknown | monthly |
671203 | 671203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | unknown | KES | 13.0 | 0 | female | monthly |
671204 | 671204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | unknown | KES | 13.0 | 0 | female | monthly |
671205 rows × 14 columns
df_kiva_s['repayment_interval'].unique()
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)
# This figure shows there is a outlier in our data
fig_sct = px.scatter(df_kiva_s, x="lender_count", y="funded_amount", color="repayment_interval",
title="Relation between the number of lender and requested funding")
# If you print the figure, you'll see that it's just a regular figure with data and layout
# print(fig)
fig_sct.show()
# The figure without outliers
# The figure shows there is no project with requested amount of higher than 10K USD without any lender
# So lenders are tent to invest on more expensive projects
df_kiva_no_outlier = df_kiva_s[df_kiva_s['lender_count'] != 2986]
fig_sct2 = px.scatter(df_kiva_no_outlier,
x="lender_count",
y="funded_amount",
color="repayment_interval",
title="Relation between the number of lender and requested funding"
)
# If you print the figure, you'll see that it's just a regular figure with data and layout
# print(fig)
fig_sct2.show()
# Business Question: Are the reuested amount for the project (funded_amount) was allocated to the project (loan_amount)?
# Wurde der beantragte Betrag für das Projekt (funded_amount) dem Projekt (loan_amount) zugewiesen?
fig_sct3 = px.scatter(df_kiva_no_outlier,
x="loan_amount",
y="funded_amount",
color="lender_count",
title="Was the target amount funded as desired?"
).update_layout(xaxis_title="Target amount (USD)", yaxis_title="Funded Amount")
# If you print the figure, you'll see that it's just a regular figure with data and layout
# print(fig)
fig_sct3.show()