信用评分模型开发(FICO评分)

以支付宝的芝麻信用为例，其分值范围在350-950分。一般认为分值越高，信用越好，个人业务的违约率越低。这里用的也是与FICO评分类似的个人信用评分工具。

FICO评分的只要思路是：多大量拥有多个属性的用户数据进行收集/分析/转换，使用各项统计指标（如相关系数/卡方校验/方差膨胀系数等）对属性进行取舍/复制/组合，最终得到一个量化的/综合的/可用于对比的分值。分值的高低，一方面反映了用户历史信用记录的好坏，另一方面暗示了未来违约可能性的大小。

这里使用到的原始数据主要包括客户的个人信息（包括性别/年龄/工作岗位/婚姻情况/学历状况等）/账户信息（包括各种账户的数量/存贷款余额等），以及该客户是否存在违约的分类标签。

数据集经过处理后，需要经过数据分箱/属性选择以及离散分类标签与连续信用评分结果的转换等过程。

数据集的下载
链接: https://pan.baidu.com/s/1xh1kjo6waZZEPrxJFrwCng 密码: u744

一.导入数据

#导入包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt#查看原始数据
data = pd.read_csv('credit.csv')
print('原始数据概况')
data.info()
#数据为12.6M,11个属性，有两个属性MonthlyIncome和NumberOfDependents中有空值

数据为12.6M，有11个属性，有两个属性MonthlyIncome和NumberOfDependents中有空值
在这里插入图片描述
二.数据清洗
因为MonthlyIncome和NumberOfDependents中有缺失值，需要对其进行处理，常用的方法1.有上下值填入空白处；2.属性均值填入空白处。但是各有它的缺点。这里定义了一个set_missing的函数，用来填充随机森林回归算法对缺失值进行填充。

#进行数据清洗
#用随机森林方法对MonthlyIncome缺失值进行预测填充，这里定义了set_missing函数
from sklearn.ensemble import RandomForestRegressor
def set_missing(df):print('随机森林回归填充0值：')process_df = df.iloc[:,[5,0,1,2,3,4,6,7,8,9]]#把第5列的MonthlyIncome提前到第0列，作为一个标签，便于后续划分数据#分成有数值缺失值两组known = process_df.loc[process_df['MonthlyIncome']!=0].valuesunknown = process_df.loc[process_df['MonthlyIncome']==0].valuesX = known[:,1:]y = known[:,0]#用x,y训练随机森林回归算法rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)rfr.fit(X,y)#得到的模型进行缺失值预测predicted = rfr.predict(unknown[:,1:]).round(0)#得到的预测结果填补原缺失数据df.loc[df['MonthlyIncome'] == 0,'MonthlyIncome'] = predictedreturn df

定义outlier_processing函数，用于对属性中的离群数据点进行删除处理
这里用到的方法是数据分箱：将属性的取值分成若干段（箱体），落在同一个箱体范围内的数据，用一个统一的数值代替。

#对属性中的异常值进行删除处理，找出离群点，先计算最大和最小阈值作为删除标准
#最小阈值 = 第一四位点 - 1.5*（第三四分位点 - 第一四分位点）
#最大阈值 = 第三四位点 + 1.5*（第三四分位点 - 第一四分位点）
#   <最小阈值  和  >最大阈值的行将会被删除
#定义了outlier_processing函数，用于处理离群数据点def outlier_processing(df,cname):s = df[cname]onequater = s.quantile(0.25)threequater = s.quantile(0.75)irq = threequater - onequatermin = onequater - 1.5*irqmax = threequater + 1.5*irqdf = df[df[cname]<=max]df = df[df[cname]>=min]return df

MonthlyIncome原始分布图和处理后分布图

#对MonthlyIncome列进行数据整理
print('MonthlyIncome属性离群点原始分布：')
data[['MonthlyIncome']].boxplot()
plt.savefig('MonthlyIncome1.png',dpi = 300,bbox_inches = 'tight')
plt.show()
print('删除离群点，填充缺失数据：')
data = outlier_processing(data, 'MonthlyIncome')#删除离群点
data = set_missing(data)#填充缺失数据
print('处理MonthlyIncome后数据概况：')
data.info()#查看整理后数据
#图像显示
data[['MonthlyIncome']].boxplot()#箱线图
plt.savefig('MonthlyIncome2.png',dpi = 300,bbox_inches = 'tight')
plt.show()

在这里插入图片描述

删除离群点，填充缺失数据后，数据集少了2M
在这里插入图片描述

同理，对其他属性进行离群点处理

#同理，对其他属性进行离群点处理
data = outlier_processing(data, 'age')
data = outlier_processing(data, 'RevolvingUtilizationOfUnsecuredLines')
data = outlier_processing(data, 'DebtRatio')
data = outlier_processing(data, 'NumberOfOpenCreditLinesAndLoans')
data = outlier_processing(data, 'NumberRealEstateLoansOrLines')
data = outlier_processing(data, 'NumberOfDependents')

对于三个取值过于集中的属性进行手工处理

#三个取值过于集中的属性，三个四分位点的值相等，直接用outlier_processing的函数会导致所有值被删除
#因此对这三个属性进行手工处理features = ['NumberOfTime30-59DaysPastDueNotWorse','NumberOfTime60-89DaysPastDueNotWorse','NumberOfTimes90DaysLate']
features_labels = ['30-59days','60-89days','90+days']
print('三个属性的原始分布：')
data[features].boxplot()
plt.xticks([1,2,3],features_labels)
plt.savefig('三个属性的原始分布', dpi = 300 ,bbox_inches = 'tight')
plt.show()print('删除离群点后：')
data = data[data['NumberOfTime30-59DaysPastDueNotWorse']<90]
data = data[data['NumberOfTime60-89DaysPastDueNotWorse']<90]
data = data[data['NumberOfTimes90DaysLate']<90]data[features].boxplot()
plt.xticks([1,2,3],features_labels)
plt.savefig('三个属性的整理后分布', dpi = 300 ,bbox_inches = 'tight')
plt.show()
print('处理离群点后数据概况：')
data.info()

在这里插入图片描述

#生成数据集和测试集
from sklearn.model_selection import train_test_split
#原始值0为正常，1为违约。因为习惯上信用评分越高，违约的可能越小，所以将原始值0和1置换
data['SeriousDlqin2yrs'] = 1-data['SeriousDlqin2yrs']
Y = data['SeriousDlqin2yrs']
X = data.iloc[:,1:]#拆分训练集和数据集
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state = 0)train = pd.concat([Y_train,X_train],axis = 1)
test = pd.concat([Y_test,X_test],axis = 1)
clasTest = test.groupby('SeriousDlqin2yrs')['SeriousDlqin2yrs'].count()
print('训练集数据')
print(train.shape)
print('测试集数据')
print(test.shape)

在这里插入图片描述
三.属性选择
除了用相关性分析排除绝对值较小的属性外，可通过WoE(Weight of Evidence):迹象权重与 IV(Information Value):信息值两个指标来考察某个属性对于目标变量影响的重要程度，从而决定属性的取舍。

这两个指标的计算公式如下：
WoE = In(pctlGood/pctlBad)
MIV = WoE*(pctlGood-pctlBad)
IV = ∑ MIV

信息值IV的取值与要研究的目标变量之间相关性强弱的关系是：
0<IV<0.02极弱
0.02<=IV<0.1弱
0.1<=IV<0.03一般
0.3<=IV<0.5强
0.5<=IV<1.0很强

#对属性进行分箱，并计算WOE和IV值

#对属性进行分箱，并计算WOE和IV值
def mono_bin(res,feat,n = 10):good = res.sum()bad = res.count()-goodd1 = pd.DataFrame({'feat':feat,'res':res,'Bucket':pd.cut(feat,n)})d2 = d1.groupby('Bucket',as_index = True)d3 = pd.DataFrame(d2.feat.min(),columns = ['min'])d3['min'] = d2.min().featd3['max'] = d2.max().featd3['sum'] = d2.sum().resd3['total'] = d2.count().resd3['rate'] = d2.mean().resd3['woe'] = np.log((d3['rate']/(1-d3['rate']))/(good/bad))d3['goodattribute'] = d3['sum']/goodd3['badattribute'] = (d3['total']-d3['sum'])/badiv = ((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()d4 = (d3.sort_values(by = 'min'))cut = []cut.append(float('-inf'))for i in range(1,n):qua = feat.quantile(i/(n))cut.append(round(qua,4))cut.append(float('inf'))woe = list(d4['woe'].round(3))return d4,iv,cut,woedef self_bin(res,feat,cat):good = res.sum()bad = res.count()-goodd1 = pd.DataFrame({'feat':feat,'res':res,'Bucket':pd.cut(feat,cat)})d2 = d1.groupby('Bucket',as_index = True)d3 = pd.DataFrame(d2.feat.min(),columns = ['min'])d3['min'] = d2.min().featd3['max'] = d2.max().featd3['sum'] = d2.sum().resd3['total'] = d2.count().resd3['rate'] = d2.mean().resd3['woe'] = np.log((d3['rate']/(1-d3['rate']))/(good/bad))d3['goodattribute'] = d3['sum']/goodd3['badattribute'] = (d3['total']-d3['sum'])/badiv = ((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()d4 = (d3.sort_values(by = 'min'))woe = list(d4['woe'].round(3))return d4,iv,woe

将各个属性按照指定间隔进行分箱，这里定义了cutx3/6/7/8/910分箱

pinf = float('inf')
ninf = float('-inf')
dfx1,ivx1,cutx1,woex1 = mono_bin(train['SeriousDlqin2yrs'],train['RevolvingUtilizationOfUnsecuredLines'],n = 10)
#显示RevolvingUtilizationOfUnsecuredLines分箱和WOE信息
print('='*60)
print('显示RevolvingUtilizationOfUnsecuredLines分箱和WOE信息:')
print(dfx1)
dfx2,ivx2,cutx2,woex2 = mono_bin(train['SeriousDlqin2yrs'],train['age'],n = 10)
dfx4,ivx4,cutx4,woex4 = mono_bin(train['SeriousDlqin2yrs'],train['DebtRatio'],n = 10)
dfx5,ivx5,cutx5,woex5 = mono_bin(train['SeriousDlqin2yrs'],train['MonthlyIncome'],n = 10)
#对3，6，7，8，9，10列数据进行指定间隔分箱
cutx3 = [ninf,0,1,3,5,pinf]
cutx6 = [ninf,1,2,3,5,pinf]
cutx7 = [ninf,0,1,3,5,pinf]
cutx8 = [ninf,0,1,2,3,pinf]
cutx9 = [ninf,0,1,3,pinf]
cutx10 = [ninf,0,1,2,3,5,pinf]#按照cutx3指定的间隔把NumberOfTime30-59DaysPastDueNotWorse属性分成5段
dfx3,ivx3,woex3 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfTime30-59DaysPastDueNotWorse'],cutx3)
#显示NumberOfTime30-59DaysPastDueNotWorse分箱和woe信息：
print('='*60)
print('NumberOfTime30-59DaysPastDueNotWorse分箱和woe信息：')
print(dfx3)
dfx6,ivx6,woex6 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfOpenCreditLinesAndLoans'],cutx6)
dfx7,ivx7,woex7 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfTimes90DaysLate'],cutx7)
dfx8,ivx8,woex8 = self_bin(train['SeriousDlqin2yrs'],train['NumberRealEstateLoansOrLines'],cutx8)
dfx9,ivx9,woex9 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfTime60-89DaysPastDueNotWorse'],cutx9)
dfx10,ivx10,woex10 = self_bin(train['SeriousDlqin2yrs'],train['NumberOfDependents'],cutx10)

在这里插入图片描述

对求出的IV属性进行画图

#按照iv选取属性
ivlist = [ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]
index = ['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
fig1 = plt.figure(1)
ax1 = fig1.add_subplot(1,1,1)
x = np.arange(len(index))+1
ax1.bar(x,ivlist,width = 0.48,color = 'yellow',alpha = 0.5)
ax1.set_xticks(x)
ax1.set_xticklabels(index,rotation = 0,fontsize = 12)
ax1.set_ylabel('IV(information value)',fontsize = 14)
for a,b in zip(x,ivlist):plt.text(a,b+0.01,'%.4f'%b,ha = 'center',va = 'bottom',fontsize = 10)
plt.savefig('iv取值.png', dpi = 300,bbox_inches = 'tight')
plt.show()

在这里插入图片描述
四.模型训练阶段
定义了get_woe函数，用于将原始数据转换为WoE值，以提高模型的训练结果。调用get_wor函数，将训练集和测试集属性转化为WoE值

#模型训练阶段
#求出属性的对应woe值
def get_woe(feat,cut,woe):res = []for row in feat.iteritems():value = row[1]j = len(cut)-2m = len(cut)-2while j>=0:if value>=cut[j]:j=-1else:j-=1m-=1res.append(woe[m])return res
#调用get_woe函数，分别将训练集和测试集的属性值转为woe值
woe_train = pd.DataFrame()
woe_train['SeriousDlqin2yrs'] = train['SeriousDlqin2yrs']
woe_train['RevolvingUtilizationOfUnsecuredLines'] = get_woe(train['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1)
woe_train['age'] = get_woe(train['age'], cutx2, woex2)
woe_train['NumberOfTime30-59DaysPastDueNotWorse'] = get_woe(train['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, woex3)
woe_train['DebtRatio'] = get_woe(train['DebtRatio'], cutx4, woex4)
woe_train['MonthlyIncome'] = get_woe(train['MonthlyIncome'], cutx5, woex5)
woe_train['NumberOfOpenCreditLinesAndLoans'] = get_woe(train['NumberOfOpenCreditLinesAndLoans'], cutx6, woex6)
woe_train['NumberOfTimes90DaysLate'] = get_woe(train['NumberOfTimes90DaysLate'], cutx7, woex7)
woe_train['NumberRealEstateLoansOrLines'] = get_woe(train['NumberRealEstateLoansOrLines'], cutx8, woex8)
woe_train['NumberOfTime60-89DaysPastDueNotWorse'] = get_woe(train['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, woex9)
woe_train['NumberOfDependents'] = get_woe(train['NumberOfDependents'], cutx10, woex10)     #将测试集各属性替换成woe     
woe_test = pd.DataFrame()
woe_test['SeriousDlqin2yrs'] = train['SeriousDlqin2yrs']
woe_test['RevolvingUtilizationOfUnsecuredLines'] = get_woe(train['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1)
woe_test['age'] = get_woe(train['age'], cutx2, woex2)
woe_test['NumberOfTime30-59DaysPastDueNotWorse'] = get_woe(train['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, woex3)
woe_test['DebtRatio'] = get_woe(train['DebtRatio'], cutx4, woex4)
woe_test['MonthlyIncome'] = get_woe(train['MonthlyIncome'], cutx5, woex5)
woe_test['NumberOfOpenCreditLinesAndLoans'] = get_woe(train['NumberOfOpenCreditLinesAndLoans'], cutx6, woex6)
woe_test['NumberOfTimes90DaysLate'] = get_woe(train['NumberOfTimes90DaysLate'], cutx7, woex7)
woe_test['NumberRealEstateLoansOrLines'] = get_woe(train['NumberRealEstateLoansOrLines'], cutx8, woex8)
woe_test['NumberOfTime60-89DaysPastDueNotWorse'] = get_woe(train['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, woex9)
woe_test['NumberOfDependents'] = get_woe(train['NumberOfDependents'], cutx10, woex10)

import statsmodels.api as sm
from sklearn.metrics import roc_curve,aucY = woe_train['SeriousDlqin2yrs']
X = woe_train.drop(['SeriousDlqin2yrs','DebtRatio','MonthlyIncome','NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis = 1)
X1 = sm.add_constant(X)
logit = sm.Logit(Y,X1)
Logit_model = logit.fit()
print('输出拟合的各项系数')
print(Logit_model.params)

在这里插入图片描述
画出模型AUC曲线

Y_test = woe_test['SeriousDlqin2yrs']
X_test = woe_test.drop(['SeriousDlqin2yrs','DebtRatio','MonthlyIncome','NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis=1)
X3 = sm.add_constant(X_test)
resu = Logit_model.predict(X3)
fpr,tpr,threshold = roc_curve(Y_test,resu)
rocauc = auc(fpr,tpr)
plt.plot(fpr,tpr,'y',label='AUC=%0.2f' % rocauc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'p--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('TurePositive')
plt.xlabel('FalsePositive')
plt.savefig('模型AUC曲线.png',dpi=300,bbox_inches='tight')
print('模型AUC曲线：')
plt.show()

在这里插入图片描述

#定义get_score函数用于计算各个分箱的基础得分
def get_score(coe,woe,factor):scores = []for w in woe:score = round(coe*w*factor,0)scores.append(score)return scores
#定义compte_score函数，计算具体属性取值对应的基础得分
def compute_score(feat,cut,score):res = []for row in feat.iteritems():value = row[1]j = len(cut)-2m = len(cut)-2while j>=0:if value>=cut[j]:j=-1else:j-=1m-=1res.append(score[m])return res
import math
coe = Logit_model.params
p = 20/math.log(2)
q = 600-20*math.log(20)/math.log(2)
baseScore = round(q+p*coe[0],0)x1 = get_score(coe[1],woex1,p)
print('第1列属性取值在各分箱段对应的分数')
print(x1)x2 = get_score(coe[2],woex2,p)
x3 = get_score(coe[3],woex3,p)
x7 = get_score(coe[4],woex7,p)
x9 = get_score(coe[5],woex9,p)#print(x2)
#print(x3)
#print(x3)
#计算分数
test['BaseScore'] = np.zeros(len(test))+baseScore
test['x1'] = compute_score(test['RevolvingUtilizationOfUnsecuredLines'],cutx1,x1)
test['x2'] = compute_score(test['age'],cutx2,x2)
test['x3'] = compute_score(test['NumberOfTime30-59DaysPastDueNotWorse'],cutx3,x3)
test['x7'] = compute_score(test['NumberOfTimes90DaysLate'],cutx7,x7)
test['x9'] = compute_score(test['NumberOfTime60-89DaysPastDueNotWorse'],cutx9,x9)test['Score'] = test['x1']+test['x2']+test['x3']+test['x7']+test['x9']+baseScore

第1列属性取值在各分箱段对应的分数
[20.0, 10.0, 4.0, -2.0, -7.0, -13.0, -19.0, -21.0, -41.0, -38.0]

Normal = test.loc[test['SeriousDlqin2yrs']==1]
Charged = test.loc[test['SeriousDlqin2yrs']==0]print('测试集中正常客户组信用评分统计描述')
print(Normal['Score'].describe())
print('测试集中违约客户组信用评分统计描述')
print(Charged['Score'].describe())import seaborn as sns
plt.figure(figsize = (10,4))
sns.kdeplot(Normal['Score'],label = 'normal',linewidth = 2,linestyle = '--')
sns.kdeplot(Charged['Score'],label = 'charged',linewidth = 2,linestyle = '-')
plt.xlabel('Score',fontdict = {'size':10})
plt.ylabel('probability',fontdict = {'size':10})
plt.title('normal/charged',fontdict={'size':18})
plt.savefig('违约与正常客户的信用分布情况.png',dpi = 300,bbox_inches = 'tight')
plt.show()

在这里插入图片描述

违约客户与正常客户的信用分数分分布在这里插入图片描述

将训练好的模型用于客户信用评分

#将训练好的模型用于客户信用评分
cusInfo = {'RevolvingUtilizationOfUnsecuredLines':0.248537,'age':48,'NumberOfTime30-59DaysPastDueNotWorse':0,'NumberOfTime60-89DaysPastDueNotWorse':0,'DebtRatio':0.177586,'MonthlyIncome':4166,'NumberOfOpenCreditLinesAndLoans':11,'NumberOfTimes90DaysLate':0,'NumberRealEstateLoansOrLines':1,'NumberOfTime60-89DaysPastDueNotWorse':0,'NumberOfDependents':0}
custData = pd.DataFrame(cusInfo,pd.Index(range((1))))
custData.drop(['DebtRatio','MonthlyIncome','NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis = 1)custData['x1'] = compute_score(custData['RevolvingUtilizationOfUnsecuredLines'], cutx1,x1)
custData['x2'] = compute_score(custData['age'], cutx2,x2)
custData['x3'] = compute_score(custData['NumberOfTime30-59DaysPastDueNotWorse'], cutx3,x3)
custData['x7'] = compute_score(custData['NumberOfTimes90DaysLate'], cutx7,x7)
custData['x9'] = compute_score(custData['NumberOfTime60-89DaysPastDueNotWorse'], cutx9,x9)custData['Score'] = custData['x1']+custData['x2']+custData['x3']+custData['x7']+custData['x9']+baseScore
print('该客户的信用评分为：')
print(custData.loc[0,'Score'])