多元统计分析——多元线性回归

1. 经典的线性回归分析与交叉验证

examDict={' 学习时 ':[0.50, 0.75, 1.00, 1.25,1.50,1.75, 1.75,2.00, 2.25,2.50,
2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50], '分':[10,22,13 ,43,20,22,33,50,62 ,
48,55,75,62,73,81,76,64,82,90,93]}
examDf = pd.DataFrame(examDict)
X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=0.8)
model = LinearRegression()
model.fit(X_train,Y_train)
a = model.intercept_#截距
b = model.coef_#回归系数
y_train_pred = model.predict(X_train) #预测
score = model.score(X_test,Y_test) #可决系数 0.8866470295386657

2. 经典的多元线性回归分析的模型参数的假设检验

import statsmodels.api as sm
from sklearn import datasets ## 从 scikit-learn 导入数据集
data = datasets.load_boston() ## 从数据集库加载波士顿数据集
import numpy as np
import pandas as pd
df = pd.DataFrame(data.data, columns=data.feature_names)
target = pd.DataFrame(data.target, columns=["MEDV"])
X = df[['CRIM', 'ZN', 'INDUS']] ## X 通常表示我们的输入变量 (或自变量)
y = target["MEDV"] ## Y 通常表示输出/因变量
X = sm.add_constant(X) ## 我们添加一个截距（beta_0）到我们的模型
model = sm.OLS(y, X).fit() ## sm.OLS(输出, 输入)
predictions = model.predict(X)
model.summary() ## 打印出统计模型

3. 岭回归模型

X_train,X_test,Y_train,Y_test = train_test_split(df2,df1,train_size=0.8)
model = Ridge(alpha=0.5,fit_intercept=True)
model = RidgeCV(alphas=[0.01,0.1,0.2, 0.5, 1],normalize = True,cv=10)
model.fit(X_train,Y_train)
ridge_best_alpha = model.alpha_ #得到最佳lambda值
print(f"岭回归关键正则参数={ridge_best_alpha}")
计算可决系数
a=model.intercept_
b=model.coef_
y_train_pred =model.predict(X_train)
score=model.score(X_test, Y_test)
print(score)

4. 基于最佳lambda值建模

ridge = Ridge(alpha = ridge_best_alpha,normalize = True)
ridge.fit(X_train,Y_train)
ridge_predict = ridge.predict(X_test)
计算损失函数
rmse = np.sqrt(mean_squared_error(Y_test,ridge_predict))

5. LASSO回归模型：

lasso_cv = LassoCV(alphas = alphas, normalize=True, cv = 10, max_iter=10000)
lasso_cv.fit(x_tr,y_tr)
lasso_best_alpha = lasso_cv.alpha_
lasso_best_alpha
lasso = Lasso(alpha = lasso_best_alpha, normalize=True, max_iter=10000)
lasso.fit(x_tr, y_tr)
lasso_predict = lasso.predict(x_te) #预测 
RMSE = np.sqrt(mean_squared_error(y_te,lasso_predict))

本次任务额外知识点：

seed = 7
np.random.seed(seed)
10折交叉验证
kfold = StratifiedKFold(n_splits=10, shuffle=False, random_state=seed)
固定random_state后，每次构建的模型是相同的、生成的数据集是相同的、每次的拆分结果也是相同的

y代表输出答案，y_代表标准答案
mse=tf.reduce_mean(tf.square(Y_test-yy_train_pred))

题目

3. 数据集简介

原始数据有 14 个变量的 506 个观察值，其中， medv( 自住房屋房价中位数，单位: 千美元 ) 是原始的目标变量，其他变量包括 :crim( 城镇的人均犯罪率) 、 mn( 占地面积超过 25000 平方英尺的住宅用地的比例 )、 indus(每个镇的非零售业务比例，单位 : 英亩 ) 、 chas( 有关查尔斯河的虚拟变量，如果挨着河为1 ，否则为 0) 、 mo( 一氧化氮浓度，单位 :Ppm) 、 m(平均每间住房的房间数量 )、 age(1940 年以前建成的自住单位的房龄比例) 、 dis( 五个波土顿就业中心的加权距离 ) 、 rad( 高速公路的可达性指数) 、 tax( 每万美元全价物业值的财产税率 ) 、 ptratio( 城镇学生与教师的比例) 、 b(=100078-0.63)2 ，其中的 B 是城镇黑人的比例 ) 、 Istat( 低收入人口比例); 更正过的数据集有以下附加变量 :cmed( 修正了的自住房价中位数，单位: 千美元 ) 、 tow( 镇名称 ) 、 trat( 人口普查区 ) 、 lon( 人

口普查区的经度 ) 、 lat( 人口普查区的纬度 ) 。

4. 数据集使用

我们将用 comedy ( 修正了的自住房屋房价中位数 ) 作为 因变量 ，而将 crim ， zn ， indus ， nox ， rm ， age ， dis ， rad ， tax ， ptratio ， b ， lstat 这 12 个变量作为 自变量 。(数据详见 BostonHousing2.csv 文件 ) 。

5. 回归任务指定

（1）利用指定的 12 个自变量与因变量 comedy 创建散布图矩阵，主 要目的查看各自变量与因变量之间的相关性 。

（2）随机地将当前数据集按照 3:1 的容量比例划分为训练集（用于建立模型）和测试集( 用于检测模型的 预测精度 ) ，重复此步骤十次，并将得到十次结果制作如下的折线图，其中横坐标为次数，纵坐标为对应次数的可决系数。如下图所示（ 可以与图不一致，主要体现可决 系数变化规律 ）

（3） 最优回归方程的选择 ：从 12 个自变量中随机的抽取 n （其中 n =2,…..12 ）个自变量，并利用十折交叉验证计算所建模型的可决系数，依据以上 12 个模型的可决系数大小确定哪一个模型的预测精度较高。（ 并不一定使用全部自变量的模型精度最好 ）

（4）岭回归、 Lasso 回归模型中关键正则参数 𝛌 的选择：在给定参数 𝛌 的 0.01 ， 0.1 ， 0.2, 0.5, 1 这五个可能性取值的条件下，利用十折交叉验证和可决系数确定两个模型的各自最优参数 𝛌 。

（5）在最优参数 𝛌 的条件下的 Lasso 回归模型、岭回归及使用全部 12 个自变量模型的 可决系数 （ 十折交叉验证得到的 ）的对比，在此数据集上哪一个模型的精度最高呢？在取定最优参数 𝛌 的条件下 Lasso 回归模型中，计算回归系数为零的个数与全部自变量个数（即，12）的比例。

全部代码

import numpy as np
import random
# import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
from pandas.plotting import scatter_matrix
from sklearn.linear_model import Ridge,RidgeCV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt #as就是重新命名的意思
from matplotlib.font_manager import FontProperties #导入中文字体
from sklearn.linear_model import Lasso,LassoCV
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.model_selection import StratifiedKFold#font = FontProperties(fname=r"/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc")
matplotlib.rcParams['font.family']='SimHei'
matplotlib.rcParams['font.sans-serif']=['SimHei']pd_data=pd.read_csv(r"./BostonHousing2.csv",header=1)
#print(pd_data)df1=pd_data[['cmedv']]
df2=pd_data[['crim', 'zn', 'indus','nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b','lstat']]# #任务一:
print("***************************************************************")
print("任务一")dff=pd_data[['crim', 'zn', 'indus','nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b','lstat','cmedv']]
plt.figure() #自变量
scatter_matrix(dff, alpha = 0.3, figsize = (14,8));
plt.grid(True)
plt.savefig('cmedy')
plt.show()##任务二:
print("***************************************************************")
print("任务二")li=[]
for epoch in range(10):X_train,X_test,Y_train,Y_test = train_test_split(df2,df1,train_size=0.8) model = Ridge(alpha=0.5,fit_intercept=True)model = RidgeCV(alphas=[0.1, 1.0, 10.0]) # 通过 RidgeCV 使用交叉验证获取最佳参数值  model.fit(X_train,Y_train) a=model.intercept_b=model.coef_y_train_pred =model.predict(X_train)score=model.score(X_test, Y_test)li.append(score)
x = list(range(1, 11, 1))
y = [round(i,2) for i in li]plt.plot(x, y,  linewidth=2, color='g', marker='o',markerfacecolor='blue', markersize=3)#绘制图片
plt.plot(x,y)
plt.ylim(0, 1) # 限定纵轴的范围for a, b in zip(x, y):plt.text(a, b, b, ha='center', va='bottom', fontsize=20)
plt.title("房价预测")
plt.xlabel("训练轮数")
plt.ylabel("可决系数")
plt.show()#任务三  (包含任务5)#利用岭回归模型随机选取变量进行十折交叉，计算可决系数
print("任务三  (包含任务5)")
print("下面对变量进行随机抽取：")
X_train,X_test,Y_train,Y_test = train_test_split(df2,df1,train_size=0.8)
for p in range(10):ans=random.randint(1, 12)df3 = X_train.sample(n=ans,axis=1)l=list(df3.columns)df4=pd.DataFrame(X_test,columns=l)model = Ridge(alpha=0.5,fit_intercept=True)model = RidgeCV(alphas=[0.01,0.1,0.2, 0.5, 1],normalize = True,cv=10)model.fit(df3,Y_train)ridge_best_alpha = model.alpha_ #得到最佳lambda值a=model.intercept_b=model.coef_yy_train_pred =model.predict(df3)score = model.score(df4,Y_test)print(f"第{p+1}轮,随机抽取{ans}个变量\n岭回归关键正则参数={ridge_best_alpha},可决系数{round(score,2)}")#任务四   (包含任务5)
print("*************************************************")
print("任务四  (包含任务5)")
print(f"岭回归回归模型中关键正则参数𝛌的选择:")
X_train,X_test,Y_train,Y_test = train_test_split(df2,df1,train_size=0.8)
model = Ridge(alpha=0.5,fit_intercept=True)
model = RidgeCV(alphas=[0.01,0.1,0.2, 0.5, 1],normalize = True,cv=10)
model.fit(X_train,Y_train)
ridge_best_alpha = model.alpha_ #得到最佳lambda值
print(f"岭回归关键正则参数𝛌={ridge_best_alpha}")ridge = Ridge(alpha = ridge_best_alpha,normalize = True)
ridge.fit(X_train,Y_train)
ridge_predict = ridge.predict(X_test)
mse = np.sqrt(mean_squared_error(Y_test,ridge_predict))
score=model.score(X_test, Y_test)
print(f"在最优参数𝛌的条件下,损失值{round(mse,2)},可决系数:{round(score,2)}")print("*************************************************")
print(f"LASSO回归模型中关键正则参数𝛌的选择:")
X_train,X_test,Y_train,Y_test = train_test_split(df2,df1,train_size=0.8)
lasso_cv = LassoCV(alphas = [0.01,0.1,0.2, 0.5, 1], normalize=True, cv = 10)
lasso_cv.fit(X_train,pd.DataFrame(Y_train).values.ravel())  #本来需要一维向量，但是输入为列向量，所以找到出错的位置，应用ravel()函数即可。
lasso_best_alpha = lasso_cv.alpha_  # 取出最佳的lambda值
print(f"LASSO回归关键正则参数𝛌={lasso_best_alpha}")
lasso = Lasso(alpha = lasso_best_alpha, normalize=True)
lasso.fit(X_train,pd.DataFrame(Y_train).values.ravel())
lasso_predict = lasso.predict(X_test) #预测
MSE = np.sqrt(mean_squared_error(Y_test,lasso_predict))
ss=model.score(X_test, Y_test)
print(f"在最优参数𝛌的条件下,损失值{round(MSE,2)},可决系数:{round(ss,2)}")print("***************************************************************")
print("任务五")ll=lasso.coef_
print("各变量回归系数如下:")
print(ll)
ans=0
for i in ll:if abs(i)<=0.05:ans+=1
tmp=round(ans/12,2)
tmp=int(tmp*100)
print(f"回归系数为零的个数与全部自变量个数比例:{tmp}%.")