数据挖掘入门

线性回归模型

线性回归(Linear Regression)是利用称为线性回归方程的最小平方函数对一个或多个自变量和因变量之间关系进行建模的一种回归分析。
可以直接使用sklearn建立线性模型：

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model.fit(train_x, train_y)

模型训练好之后可以观察模型在某一特征上的拟合程度。

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color = 'black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color = 'blu
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc = 'upper right')
print('The predicted price is obvious different from true price')
plt.show()

在这里插入图片描述
通过作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设。

import
seaborn as sns
print('It is clear to see the price shows a typical exponential plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y <
np.quantile(train_y, 0.9)])

在这里插入图片描述
对price进行变换，使其符合正态分布

import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

在这里插入图片描述

模型性能验证

交叉验证

在使用训练集对参数进行训练的时候，经常会发现人们通常会将一整个训练集分为三个部分（比如mnist手写训练集）。一般分为：训练集（train_set），评估集（valid_set），测试集（test_set）这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解，其实就是完全不参与训练的数据，仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。
因为在实际的训练中，训练的结果对于训练集的拟合程度通常还是挺好的（初始条件敏感），但对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练，而是分出一部分来（这一部分不参加训练）对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证（CrossValidation）

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorerdef log_transfer(func):def wrapper(y, yhat):result = func(np.log(y), np.nan_to_num(np.log(yhat)))   # 这个是为了解决不合法的值的return resultreturn wrapperscores = cross_val_score(model, X=train_x, y=train_y, verbose=1, cv=5, scoring=make_scorer(log_transfer(mean_absolute_error)))# 使用线性回归模型，对未处理标签的特征数据进行五折交叉验证（Error 1.36）
print('AVG:', np.mean(scores))# 对处理的标签交叉验证
scores = cross_val_score(model, X=train_x, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
print('AVG:', np.mean(scores))# 输出五次的验证结果：
scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

在这里插入图片描述

模拟真实业务情况

但在事实上，由于我们并不具有预知未来的能力，五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。通过2018年的二手车价格预测2017年的二手车价格，这显然是不合理的，因此我们还可以采用时间顺序对数据集进行分隔。在本例中，我们选用靠前时间的4/5样本当作训练集，靠后时间的1/5当作验证集，最终结果与五折交叉验证差距不大

绘制学习曲线与验证曲线

from sklearn.model_selection import learning_curve, validation_curvedef plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, train_size=np.linspace(.1, 1.0, 5)):plt.figure()plt.title(title)if ylim is not None:plt.ylim(*ylim)plt.xlabel('Training example')  plt.ylabel('score')  train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  train_scores_mean = np.mean(train_scores, axis=1)  train_scores_std = np.std(train_scores, axis=1)  test_scores_mean = np.mean(test_scores, axis=1)  test_scores_std = np.std(test_scores, axis=1)  plt.grid()#区域  plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  train_scores_mean + train_scores_std, alpha=0.1,  color="r")  plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  test_scores_mean + test_scores_std, alpha=0.1,  color="g")  plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  label="Training score")  plt.plot(train_sizes, test_scores_mean,'o-',color="g",  label="Cross-validation score")  plt.legend(loc="best")  return plt

在这里插入图片描述

模型对比

常用模型对比

在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lassomodels = [LinearRegression(), Ridge(), Lasso()]

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:' + str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

在这里插入图片描述

L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程，若参数很大，那么只要数据偏移一点点，就会对结果造成很大的影响；但如果参数足够小，数据偏移得多一点也不会对结果造成什么影响，专业一点的说法是『抗扰动能力强』

model = Ridge().fit(train_X, train_y_ln)
print('intercept:' + str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

在这里插入图片描述
L1正则化有助于生成一个稀疏权值矩阵，进而可以用于特征选择。如下图，我们发现power与userd_time特征非常重要。

model = Lasso().fit(train_X, train_y_ln)
print('intercept:' + str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

在这里插入图片描述
除此之外，决策树通过信息熵或GINI指数选择分裂节点时，优先选择的分裂特征也更加重要，这同样是一种特选择的方法。XGBoost 与LightGBM模型中的model_importance指标正是基于此计算的

常用非线性模型

除了线性模型以外，还有许多我们常用的非线性模型如下，在此篇幅有限不再一一讲解原理。我们选择了部分常用模型与线性模型进行效果比对。

models = [LinearRegression(),
DecisionTreeRegressor(),
RandomForestRegressor(),
GradientBoostingRegressor(),
MLPRegressor(solver = 'lbfgs', max_iter = 100),
XGBRegressor(n_estimators =
100, objective = 'reg:squarederror'),
LGBMRegressor(n_estimators =
100)]

在这里插入图片描述

可以看到随机森林模型在每一个fold中均取得了更好的效果

模型调参

贪心调参方法

拿当前对模型影响最大的参数调优，直到最优化；再拿下一个影响最大的参数调优，如此下去，直到所有的参数调整完毕。这个方法的缺点就是可能会调到局部最优而不是全局最优，但是省时间省力，巨大的优势面前，可以一试。

# 先建立一个参数字典
best_obj = dict()# 调objective
for obj in objective:model = LGBMRegressor(objective=obj)score = np.mean(cross_val_score(model, X, Y_ln, verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))best_obj[obj] = score# 上面调好之后，用上面的参数调num_leaves
best_leaves = dict()
for leaves in num_leaves:model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)score = np.mean(cross_val_score(model, X, Y_ln, verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))best_leaves[leaves] = score# 用上面两个最优参数调max_depth
best_depth = dict()
for depth in max_depth:model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],num_leaves=min(best_leaves.items(), key=lambda x: x[1])[0],max_depth=depth)score = np.mean(cross_val_score(model, X, Y_ln, verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))best_depth[depth] = score# 调n_estimators
best_nstimators = dict()
for nstimator in n_estimators:model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],num_leaves=min(best_leaves.items(), key=lambda x: x[1])[0],max_depth=min(best_depth.items(), key=lambda x:x[1])[0],n_estimators=nstimator)score = np.mean(cross_val_score(model, X, Y_ln, verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))best_nstimators[nstimator] = score# 调learning_rate
best_lr = dict()
for lr in learning_rate:model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],num_leaves=min(best_leaves.items(), key=lambda x: x[1])[0],max_depth=min(best_depth.items(), key=lambda x:x[1])[0],n_estimators=min(best_nstimators.items(), key=lambda x:x[1])[0],learning_rate=lr)score = np.mean(cross_val_score(model, X, Y_ln, verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)))best_lr[lr] = score

在这里插入图片描述

网格调参方法

GridSearchCV，它存在的意义就是自动调参，只要把参数输进去，就能给出最优化的结果和参数。但是这个方法适合于小数据集，一旦数据的量级上去了，很难得出结果。这个在这里面优势不大，因为数据集很大，不太能跑出结果，但是也整理一下，有时候还是很好用的。

parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv = 5)
clf = clf.fit(train_X, train_y)

贝叶斯调参方法

贝叶斯优化用于机器学习调参，主要思想是，给定优化的目标函数(广义的函数，只需指定输入和输出即可，无需知道内部结构以及数学性质)，通过不断地添加样本点来更新目标函数的后验分布(高斯过程,直到后验分布基本贴合于真实分布。简单的说，就是考虑了上一次参数的信息，从而更好的调整当前的参数。

from  bayes_opt import BayesianOptimization# 定义优化函数
def rf_cv(num_leaves, max_depth, subsample, min_child_samples):model = LGBMRegressor(objective='regression_l1', num_leaves=int(num_leaves),max_depth=int(max_depth), subsample=subsample,min_child_samples = int(min_child_samples))val = cross_val_score(model, X, Y_ln, verbose=0, cv=5, scoring=make_scorer(mean_absolute_error)).mean()return 1-val# 定义优化参数
rf_bo = BayesianOptimization(rf_cv, {'num_leaves':(2, 100),'max_depth':(2, 100),'subsample':(0.1, 1),'min_child_samples':(2, 100)}
)
#开始优化
num_iter = 25
init_points = 5
rf_bo.maximize(init_points=init_points,n_iter=num_iter)
#显示优化结果
rf_bo.res["max"]
#附近搜索（已经有不错的参数值的时候）
rf_bo.explore({'n_estimators': [10, 100, 200],'min_samples_split': [2, 10, 20],'max_features': [0.1, 0.5, 0.9],'max_depth': [5, 10, 15]})