Python实现预测信用卡潜在客户

article/2025/3/18 19:52:16

一、数据集

有一家名为Happy Customer Bank (快乐客户银行) 的银行,是一家中型私人银行,经营各类银行产品,如储蓄账户、往来账户、投资产品、信贷产品等。

该银行还向现有客户交叉销售产品,为此他们使用不同类型的通信方式,如电话、电子邮件、网上银行推荐、手机银行等。

在这种情况下,Happy Customer Bank 希望向现有客户交叉销售其信用卡。该银行已经确定了一组有资格使用这些信用卡的客户。

银行希望确定对推荐的信用卡表现出更高意向的客户。

数据集:dataset

该数据集主要包括:

  1. 客户详细信息(gender, age, region, etc

  2. 他/她与银行的关系详情(Channel_Code、Vintage、Avg_Asset_Value, etc

在这里,我们的任务是构建一个能够识别对信用卡感兴趣的客户的模型。

二、本文实现的模型

完成了LR、RF、LIGHTBGM、XGBOOST等模型的预测

三、代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Import dataset
df_train = pd.read_csv(r"C:\Users\Administrator\Desktop\DATA\Credit-Card-Lead-Prediction-main\train_s3TEQDk.csv")
df_train.head()
# Shape of the data
df_train.shape# There is 2.45L rows and 11 columns are there.
# Datatypes of the dataset
df_train.info()
# Five point summary for numerical variables
df_train.describe(exclude='object')# Minimum age of the customer is found to be 23yrs and maximum age is 85yrs
# Five point summary for categorical variables
df_train.describe(include='object')#单变量分析
# Count plot for gender variable
plt.figure(figsize=(6,5))
sns.countplot(df_train['Gender'])
plt.show()# dataset consist of more male gender observations than female.
# Unique region code names
df_train['Region_Code'].unique()# distribution of age
plt.figure(figsize=(8,5))
sns.distplot(df_train['Age'])
plt.show()
# Age variable is right skewed.
# between 26-28yrs and 46-49yrs most of the customers are seen# distribution of Vintage 该资金投资的起始年份
plt.figure(figsize=(8,5))
sns.distplot(df_train['Vintage'])
plt.show()
# Vintage variable is right skewed.# Occupation of customers
plt.figure(figsize=(10,5))
sns.countplot(df_train['Occupation'])
plt.show()
# Most of the customers are self employed and very least is Entrepreneur# Unique channel code
df_train['Channel_Code'].unique()
# There are 4 differnt channel code present in the dataset# credit product of customers
plt.figure(figsize=(10,5))
sns.countplot(df_train['Credit_Product'])
plt.show()
# Most of the customers do not have credit card products# customers status
plt.figure(figsize=(10,5))
sns.countplot(df_train['Is_Active'])
plt.show()
# Most of the customers are not active in last 3months# customers interest in purchase of credit card product
plt.figure(figsize=(10,5))
sns.countplot(df_train['Is_Lead'])
plt.show()
# Very few customers are showing interest in buying credit card product#双变量分析
# Gender with target
plt.figure(figsize=(15,5))
pd.crosstab(df_train['Gender'], df_train['Is_Lead']).plot(kind='bar')
plt.show()
# Males are more interested towards buying credit card than femalesdf_train.groupby(by=['Is_Lead']).mean()
# customers with average age of 50yrs interested in buying more credit products
# Customers with more account balance are interested in buying product.# Age v/s target
fig,axes  = plt.subplots(1,2,figsize = (18,5))ax1 = plt.subplot(1, 2, 1)
df_train[df_train['Is_Lead'] ==1]['Age'].plot(kind='kde', ax=ax1)
plt.title('Dist plot of age for customers is interested', fontsize=15)ax2 = plt.subplot(1, 2, 2)
df_train[df_train['Is_Lead'] ==0]['Age'].plot(kind='kde', ax=ax2)
plt.title('Dist plot of age for customer not interested', fontsize=15)
plt.show()
# Customers interested in buying credit product is almost normally distributed.
# Customers not interested in buying credit product is alomost right skewed.# Avg_Account_Balance v/s target
fig,axes  = plt.subplots(1,2,figsize = (18,5))
ax1 = plt.subplot(1, 2, 1)
df_train[df_train['Is_Lead'] ==1]['Avg_Account_Balance'].plot(kind='kde', ax=ax1)
plt.title('Dist plot of Avg_Account_Balance for customers is interested', fontsize=12)ax2 = plt.subplot(1, 2, 2)
df_train[df_train['Is_Lead'] ==0]['Avg_Account_Balance'].plot(kind='kde', ax=ax2)
plt.title('Dist plot of Avg_Account_Balance for customer not interested', fontsize=12)
plt.show()
# Both plots are showing right skewed distribution.
# hence Avg_Account_Balance not helping in predicting target# Vintage v/s target
fig,axes  = plt.subplots(1,2,figsize = (18,5))
ax1 = plt.subplot(1, 2, 1)
df_train[df_train['Is_Lead'] ==1]['Vintage'].plot(kind='kde', ax=ax1)
plt.title('Dist plot of Vintage for customers is interested', fontsize=12)ax2 = plt.subplot(1, 2, 2)
df_train[df_train['Is_Lead'] ==0]['Vintage'].plot(kind='kde', ax=ax2)
plt.title('Dist plot of Vintage for customer not interested', fontsize=12)
plt.show()# Occupation with target
plt.figure(figsize=(25,6))
pd.crosstab(df_train['Occupation'], df_train['Is_Lead']).plot(kind='bar')
plt.legend()
plt.show()
# Entrepreneur are using more credit products among entrepreneur group.
# In other occupations, most of the customers are not interested in buying credit products.# Is_Active with target
plt.figure(figsize=(25,6))
pd.crosstab(df_train['Is_Active'], df_train['Is_Lead']).plot(kind='bar')
plt.legend()
plt.show()
# In both active and not active customers, most of them are not interested in buying credit products# Heat map for correlation
plt.figure(figsize=(10,6))
sns.heatmap(df_train.corr(), annot=True)
plt.show()
# Both vintage and age variable are positive correlation with r=0.63# Null values treatement
df_train.isnull().sum() / df_train.shape[0] * 100
# Credit_Product variable have almost 12% null valuesdfdf_train['Credit_Product'].fillna(method='ffill', inplace=True)
df_train.isnull().sum().sum()# Outliers in the dataset
plt.figure(figsize=(14,5))
df_train.boxplot()
plt.show()# Outlier detection using IQR method and treatment
q1 = df_train['Avg_Account_Balance'].quantile(0.25)
q3 = df_train['Avg_Account_Balance'].quantile(0.75)
IQR = q3 - q1upper_limit = q3 + 1.5*IQR
lower_limit = q1 - 1.5*IQR# Presence of outliers
df_train[(df_train['Avg_Account_Balance'] > upper_limit) | (df_train['Avg_Account_Balance'] < lower_limit)]
# There is almost 6% of outliers are present in the dataset.
# To avoid data loss, we are not removing it and transforming it using log transformation.# Transformation using log method
df_train['Avg_Account_Balance'] = np.log(df_train['Avg_Account_Balance'])
# Outliers in the dataset
plt.figure(figsize=(14,5))
df_train.boxplot()
plt.show()# Drop insignificant variables like ID and Region code which will not help in improving model performance
df_train.drop(columns=['ID', 'Region_Code'], inplace=True)
# Convert all categorical columns into numerical
df_train = pd.get_dummies(df_train.drop('Is_Active', axis=1), drop_first=True)#Converting train set as modified train after EDA
#只需要做一次即可
#df_train.to_csv(r"C:\Users\Administrator\Desktop\DATA\Credit-Card-Lead-Prediction-main\df_train.csv")##EDA for test set
# Read the test data
df_test = pd.read_csv(r"C:\Users\Administrator\Desktop\DATA\Credit-Card-Lead-Prediction-main\test_mSzZ8RL.csv")
df_test.head()
# Performing all the operation did for train set
# Null value imputation
df_test['Credit_Product'].fillna(method='ffill', inplace=True)
df_test.isnull().sum().sum()
# Transforming Avg_Account_Balance using log transformation
df_test['Avg_Account_Balance'] = np.log(df_test['Avg_Account_Balance'])
# Drop insignificant variables like ID and Region code which will not help in improving model performance
df_test.drop(columns=['ID', 'Region_Code'], inplace=True)
# Convert all categorical columns into numerical
df_test = pd.get_dummies(df_test.drop('Is_Active', axis=1), drop_first=True)#Converting train set as modified train after EDA
#只需要做一次即可
#df_test.to_csv(r'C:\Users\Administrator\Desktop\DATA\Credit-Card-Lead-Prediction-main\df_test.csv')#-----------------------------Model building--------------------------------from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix
import lightgbm as lgb
import xgboost as xgb
from scipy.stats import randint as sp_randint# Lets consider train set for splitting data into train and test as 70:30 ratio
x = df_train.drop('Is_Lead', axis=1)
y = df_train['Is_Lead']x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)#-----------------------------LR逻辑回归--------------------------------
loc = LogisticRegression(solver='liblinear')
loc.fit(x_train, y_train)y_train_pred = loc.predict(x_train)
y_train_prob = loc.predict_proba(x_train)[:, 1]print('ROC score for train is :', roc_auc_score(y_train, y_train_prob))
print('Classification report for train:\n')
print(classification_report(y_train, y_train_pred))
print(confusion_matrix(y_train, y_train_pred))y_test_pred = loc.predict(x_test)
y_test_prob = loc.predict_proba(x_test)[:, 1]print('ROC score for test is :', roc_auc_score(y_test, y_test_prob))
print('Classification report for test :\n')
print(classification_report(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))# Model is performing good but recall score is too less due to class imbalance#--------------------------------随机森林----------------------------------------------
# Random forest without tuning
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)y_train_pred = rfc.predict(x_train)
y_train_prob = rfc.predict_proba(x_train)[:, 1]print('ROC score for train is :', roc_auc_score(y_train, y_train_prob))
print('Classification report for train:\n')
print(classification_report(y_train, y_train_pred))
print(confusion_matrix(y_train, y_train_pred))y_test_pred = rfc.predict(x_test)
y_test_prob = rfc.predict_proba(x_test)[:, 1]print('ROC score for test is :', roc_auc_score(y_test, y_test_prob))
print('Classification report for test :\n')
print(classification_report(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))# Model is overfitting and tuned with better accuracy
# Tuning of Random forest
rfc = RandomForestClassifier()
params = {'criterion':['gini', 'entropy'], 'max_depth':sp_randint(3, 20), 'min_samples_split':sp_randint(2, 20), 'max_features':["auto", "sqrt", "log2"], 'n_estimators':sp_randint(50, 200)}rscv = RandomizedSearchCV(rfc, param_distributions=params, cv=5, scoring='roc_auc', n_iter=10, n_jobs=-1, verbose=3)
rscv.fit(x, y)
#获取最优参数
rscv.best_params_
# Random forest without tuning
rfc = RandomForestClassifier(**rscv.best_params_)
rfc.fit(x_train, y_train)y_train_pred = rfc.predict(x_train)
y_train_prob = rfc.predict_proba(x_train)[:, 1]print('ROC score for train is :', roc_auc_score(y_train, y_train_prob))
print('Classification report for train:\n')
print(classification_report(y_train, y_train_pred))
print(confusion_matrix(y_train, y_train_pred))y_test_pred = rfc.predict(x_test)
y_test_prob = rfc.predict_proba(x_test)[:, 1]print('ROC score for test is :', roc_auc_score(y_test, y_test_prob))
print('Classification report for test :\n')
print(classification_report(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))# Model is overfitting and tuned with better accuracy#test data prediction 预测目标文件 并生成
df = pd.read_csv(r"C:\Users\Administrator\Desktop\DATA\Credit-Card-Lead-Prediction-main\test_mSzZ8RL.csv")
df.head(2)
sample_submission = df.iloc[:, [0]]
sample_submission['Is_Lead'] = rfc.predict(df_test)sample_submission.to_csv(r"C:\Users\Administrator\Desktop\DATA\Credit-Card-Lead-Prediction-main\sample_submission.csv")#----------------------------------------------XGboost----------------------------------------
xg = xgb.XGBClassifier()
xg.fit(x_train, y_train)y_train_pred = xg.predict(x_train)
y_train_prob = xg.predict_proba(x_train)[:, 1]print('ROC score for train is :', roc_auc_score(y_train, y_train_prob))
print('Classification report for train:\n')
print(classification_report(y_train, y_train_pred))
print(confusion_matrix(y_train, y_train_pred))y_test_pred = xg.predict(x_test)
y_test_prob = xg.predict_proba(x_test)[:, 1]print('ROC score for test is :', roc_auc_score(y_test, y_test_prob))
print('Classification report for test :\n')
print(classification_report(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))#-------------------------------------light gbm----------------------------------------
lg = lgb.LGBMClassifier()
lg.fit(x_train, y_train)y_train_pred = lg.predict(x_train)
y_train_prob = lg.predict_proba(x_train)[:, 1]print('ROC score for train is :', roc_auc_score(y_train, y_train_prob))
print('Classification report for train:\n')
print(classification_report(y_train, y_train_pred))
print(confusion_matrix(y_train, y_train_pred))y_test_pred = lg.predict(x_test)
y_test_prob = lg.predict_proba(x_test)[:, 1]print('ROC score for test is :', roc_auc_score(y_test, y_test_prob))
print('Classification report for test :\n')
print(classification_report(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))# Tuning of lightgbm
lg = lgb.LGBMClassifier()
params = {'boosting_type':['gdbt', 'dart', 'rf'], 'max_depth':sp_randint(-1, 20), 'learning_rate':[0.1, 0.2,0.3,0.4,0.5], 'n_estimators':sp_randint(50, 400)}rscv = RandomizedSearchCV(lg, param_distributions=params, cv=5, scoring='roc_auc', n_iter=10, n_jobs=-1)
rscv.fit(x, y)
rscv.best_params_lg = lgb.LGBMClassifier(**rscv.best_params_)
lg.fit(x_train, y_train)y_train_pred = lg.predict(x_train)
y_train_prob = lg.predict_proba(x_train)[:, 1]print('ROC score for train is :', roc_auc_score(y_train, y_train_prob))
print('Classification report for train:\n')
print(classification_report(y_train, y_train_pred))
print(confusion_matrix(y_train, y_train_pred))y_test_pred = lg.predict(x_test)
y_test_prob = lg.predict_proba(x_test)[:, 1]print('ROC score for test is :', roc_auc_score(y_test, y_test_prob))
print('Classification report for test :\n')
print(classification_report(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))sample_submission = df.iloc[:, [0]]
sample_submission['Is_Lead'] = lg.predict(df_test)
sample_submission.to_csv(r"C:\Users\Administrator\Desktop\DATA\Credit-Card-Lead-Prediction-main\sample_submission.csv")

简单的结果:


http://chatgpt.dhexx.cn/article/wrIaWErE.shtml

相关文章

数据挖掘(二)预测潜在贷款发放客户

注&#xff1a;参考多篇csdn及b站文章所得 一、实验背景 某机构想要预测哪些客户可能会产生贷款违约行为&#xff61;他们搜集了历史客户行为的部分数据以及目标客户的信息,希望通过历史数据对目标客户进行预测哪些客户会是潜在的违约客户,从而缩小目标范围,实现低风险贷款发…

淘宝客服话术《挖掘每一个潜在客户》

在这个电商行业的时代&#xff0c;作为一名淘宝客服人员&#xff0c;与店铺之间的关系是密不可言的&#xff0c;客服相当于店铺的门面&#xff0c;也是和客户第一接触者&#xff0c;重要性可想而知。 随着客服岗位的泛滥&#xff0c;客服之间的能力也是参差不齐的&#xff0c;想…

依据数据简单分析,发掘潜在客户

大数据概论作业(一) 信息技术的不断发展让人们离不开科技,我们每天使用各类电子产品所产生的信息数据不计其数,而这些数据的合理利用将会使我们的生活更加的便捷,所以,大数据俨然已成为现在前沿科技。的研究热点,大数据来源于我们生活的方方面面,也必将影响着我们…

生成微信小程序码、URL Scheme和URL Link

通用第一步,获取access_token,需要服务端去获取并缓存 (APPID和APPSECRET在微信小程序后台查看获取) https://api.weixin.qq.com/cgi-bin/token?grant_typeclient_credential&appidAPPID&secretAPPSECRET 1.获取小程序码(通过该接口生成的小程序码&#xff0c;永久有…

如何生成小程序太阳码

近期在小程序管理后台发现了生成太阳码的工具&#xff0c;以此来记录下。 登录微信公众平台&#xff08;https://mp.weixin.qq.com/&#xff09; 菜单栏工具->生成小程序码 输入页面路径->点击确定->右击保存太阳码 注意&#xff1a; 生成的页面路径必须是已发布的&am…

php小程序码生成并保存,小程序中如何生成小程序码

导语&#xff1a; 小程序是一种不需要下载安装即可使用的应用&#xff0c;它实现了应用“触手可及”的梦想&#xff0c;用户扫一扫或者搜一下即可打开应用。也体现了“用完即走”的理念&#xff0c;用户不用关心是否安装太多应用的问题。应用将无处不在&#xff0c;随时可用&am…

uniapp小程序生成小程序码

文章目录 前言一、自测版本二、线上版本三、测试总结 前言 需求&#xff1a;用户通过扫描小程序码&#xff0c;直接跳转到小程序的登陆页&#xff0c;并自动填充推荐码 一、自测版本 用于前端自己测试如何生成小程序码 <!-- 以图片的形式展示 --> <image :src"…

微信小程序开发实战9_1 生成小程序码

9.1 小程序的入口场景 为了便于商家进行小程序的推广&#xff0c;微信提供了多种小程序入口的方式&#xff0c;用户可以通过常规的方式来使用小程序&#xff1a;例如用户可以通过搜索关键字来搜索并进入小程序&#xff0c;也可以通过附近的小程序来选择并进入小程序。用户还可…

微信小程序生成小程序码和展示

云函数代码&#xff1a; // 云函数入口文件 const cloud require(wx-server-sdk) cloud.init({env: cloud.DYNAMIC_CURRENT_ENV })// 云函数入口函数 exports.main async (event, context) > {try {const result await cloud.openapi.wxacode.getUnlimited({scene:event…

【微信小程序】 java如何生成小程序码,并跳转到指定落地页 demo

前言&#xff1a; 需求场景&#xff0c;用户通过扫描小程序码&#xff0c;到指定的页码&#xff0c;希望能帮到大家&#xff0c;切记&#xff0c;要等到小程序发版测能测试。 1、微信官网 https://developers.weixin.qq.com/miniprogram/dev/framework/open-ability/qr-code.…

uni-app跨端开发之生成小程序码和调试scene参数爬坑指南

前段时间&#xff0c;公司的小程序中有一个分享小程序码邀请好友的功能。前前后后也踩过不少坑&#xff0c;然后就有了这篇笔记。如果看官正在因生成微信小程序码或调试scene参数而苦恼&#xff0c;不妨继续往下看看&#xff0c;或许这篇文章能够帮助到您哟。 1、如何生成微信…

微信小程序实现前端自己生成小程序码并且带参数

hxrhwxacode.getUnlimited | 微信开放文档微信开发者平台文档https://developers.weixin.qq.com/miniprogram/dev/api-backend/open-api/qr-code/wxacode.getUnlimited.html后端开发希望我们前端自己去生成小程序码并且带上用户信息。于是查到了官网上有相关的文档。开始着手去…

小程序云函数生成小程序码

云函数生成小程序码的Demo # 云函数 config.json配置,云调用wxacode.get API 的权限 {"permissions": {"openapi": ["wxacode.get"]} }index.js const cloud require(wx-server-sdk) cloud.init()exports.main async (event, context) > {t…

微信小程序生成小程序码以及参数的获取

一、小程序码介绍 通过后台接口可以获取小程序任意页面的小程序码&#xff0c;扫描该小程序码可以直接进入小程序对应的页面&#xff0c;所有生成的小程序码永久有效&#xff0c;可放心使用。 目前小程序码有两种形式&#xff0c;推荐生成并使用小程序码&#xff0c;它具有更好…

小程序指定页面生成小程序码(任意页面),所有运营在小程序端就可以自主得到页面链接,再也不用每次去协助看页面链接了~

写在前面&#xff1a; 业主运营经常会咨询&#xff0c;公众号放小程序链接&#xff0c;或者小程序后台使用&#xff0c;分类链接是哪一个&#xff0c;商品具体链接是哪一个&#xff0c;拼团砍价秒杀链接是哪一个。这里是设置好&#xff0c;解决以上所有不定期的咨询。 一、官方…

小程序-云开发-实现生成小程序码

虽互不曾谋面,但希望能和您成为笔尖下的朋友 以读书,技术,生活为主,偶尔撒点鸡汤 不作,不敷衍,意在真诚吐露,用心分享 点击左上方,可关注本刊 标星公众号&#xff08;ID&#xff1a;itclanCoder&#xff09; 如果不知道如何操作 点击这里,标星不迷路 前言 小程序因为传播快,易分…

超强、超详细Redis入门教程

转载自&#xff1a; http://www.h5min.cn/article/56448.htm 这篇文章主要介绍了超强、超详细Redis入门教程,本文详细介绍了Redis数据库各个方面的知识,需要的朋友可以参考下 【本教程目录】 1.redis是什么 2.redis的作者何许人也 3.谁在使用redis 4.学会安装redis 5.学会启动r…

Redis02:企业架构介绍以及redis介绍

企业架构介绍以及redis介绍 互联网项目架构演进单机Mysql的演进当今企业架构分析 Redis概述 互联网项目架构演进 单机Mysql的演进 1、单机Mysql的年代 90年代一个基本的网站访问量一般不会太大&#xff0c;单个数据库完全足够&#xff01;那个时候更多的去使用静态html&#…

Redis技术

一.基本知识 (1)NoSQL数据库简介 技术的分类: 1、解决功能性的问题&#xff1a;Java、Jsp、RDBMS、Tomcat、HTML、Linux、JDBC、SVN 2、解决扩展性的问题&#xff1a;Struts、Spring、SpringMVC、Hibernate、Mybatis 3、解决性能的问题&#xff1a;NoSQL、Java线程、Hadoop、…

《Redis系列专题》 之 大规模互联网应用Redis架构要点(精华)

建议有一定工作经验者阅读 通常&#xff0c;为了提高网站响应速度&#xff0c;总是把热点数据保存在内存中而不是直接从后端数据库中读取。Redis是一个很好的Cache工具。大型网站应用&#xff0c;热点数据量往往巨大&#xff0c;几十G上百G是很正常的事儿&#xff0c;在这种情况…