机器学习常用的分类器比较-实例

article/2025/9/28 5:47:16

这篇学习文章是在上一篇博客（http://blog.csdn.net/july_sun/article/details/53088673）的基础上，从机器学习的四要素（数据，算法和模型，计算机硬件，机器学习平台）角度出发用实例将各个分类器做一比较，下面就开始这段代码的奇妙旅程吧~~

第一：计算机硬件

本例中只要普通的64位 win系统即可，使用的python是W64的Anaconda，这个python平台的好处是已安装好经常使用的ML包，如sklearn和数据处理包，如numpy和scipy.

第二：机器学习平台

首先，本例中用到了数据处理包numpy对数据进行预处理，当然也可以直接使用scipy包，如果调用scipy就可以直接使用numpy，因为scipy是基于numpy的.

其次，在机器学习模型训练和参数选择时使用万能的sklearn库，里面包含了机器学习的最广泛使用的算法

第三：数据

step1：数据收集

网上有大量的免费数据和文本可以拿来使用，也可以自己简单生成或者是用sklearn的datasets的数据集，亦或是sklearn的make_regression来生成纯数据以及带有噪声的数据，本例中使用的是中国气象数据网2015年的上海的气象数据来让计算机学习上海这一年中的气温和一些因素的关系，即监督学习，其中Y是气温.

代码如下：

import numpy as npW = ['C:\\Users\\123\\Desktop\\weather\\2015.txt',]weather = np. loadtxt ( W [0] , skiprows =1)weather[: ,7 ] = weather[: ,7 ] / 10plt . figure ()
plt . axis ([0 , 370 , -5, 40])
plt . ylabel (" Temperatures"  )
plt . xlabel (" Month"  )
plt . xticks ([30 , 58, 89, 119 , 150 , 180 ,211 , 242 , 272 , 303 , 333 , 364],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'] )plt . title (' Temperatures trent chart of Shanghai in year 2015 ')plt . plot ( np.arange(len(weather)),weather[: , 7 ],label=2015,marker='*',linestyle='-',color="black")
plt . legend()
plt . show()

可视化如下：

step2：数据清洗

首先要看一下数据是否是完整的，如果不完整（可能缺失或者不对齐等）就需要清洗数据，这一步对后续模型的选择以及结果很重要，因为这篇文章重点在机器学习分类器的比较，所以选择了数据比较完整的气象数据~~

第四：算法和模型

step1：特征工程（feature selection）

Y是气温，X是6个与气温相关的参量，为了预测准确，要排除X之间的相关性，找到独立的与Y 相关的X，用到了相关系数来对比

代码如下：

np. random . shuffle ( weather )
z = (weather [:, 7], weather [: ,5] , weather [: ,10],weather [: ,11] , weather [: ,13] ,weather [: ,16] , weather [: ,21])
z = np. transpose (z)
cor = np. corrcoef (z, rowvar =0)
np.savetxt("C:\\Users\\123\\Desktop\\weather\\mydata.csv",cor,delimiter=",")

由相关系数找到和气温最相关的四个变量作为输入x

step2：CV交叉验证来看一下不同分类器会不会overfitting，以Random Forest分类器为例

代码如下：

from sklearn.ensemble import RandomForestRegressor
from sklearn.learning_curve import validation_curvenp. random . shuffle ( weather )
y = weatherall [:, 7]
x = np. vstack ((weatherall [:, 10], weatherall [:, 11],weatherall [:, 14], weatherall [:, 21])) 
x = np. transpose (x)
x = preprocessing . scale (x)
RF = RandomForestRegressor()
train_loss,test_loss = validation_curve(RF,x,y,,cv=10,scoring='mean_squared_error'.train_sizes=[0.1,0.25,0.5,0.75,1])
train_loss_mean = -np.mean(train_loss,axis=1)
test_loss_mean = -np.mean(test_loss,axis=1)plt.figure()
plt.plot(param_range,train_loss_mean,'o-',c='r',label='Training')
plt.plot(param_range,test_loss_mean,'o-',c='b',label='Cross_validation')
plt.xlabel('Training exapmles')
plt.ylabel('Loss')
plt.legend(loc='best')
plt.title("10 CV on Random Forest ")
plt.show()

可视化如下：

从可视化结果看，training过程中训练集合验证集的loss一直下降，说明能计算机能一直很好的学到知识，不会有overfitting的问题，换用其他几个模型（lr/DT/SVM/KNN/SGD/GB）也是同样的效果，这可能和数据有直接的关系.

step3：各个分类器的比较（残差，错误率），以GradientBoost为例，其他的LInear Regression,SGD,KNN,SVM,DT和集成算法RF,ET都是类似的方法

代码如下：

gb = GradientBoostingRegressor()
y_gb = gb . fit ( x_train , y_train).predict ( x_test )
print "Residual sum of Gradient Boosting Regression squares is", np. mean(( y_gb - y_test ) ** 2)

得到各个分类器的结果如下：