r语言dataellipse_R语言 PCA分析

article/2025/8/4 17:53:01

PCA数据分析

PCA结果分析及可视化首推factoextra包，能处理各种R函数计算PCA的结果，有：

stats::prcomp()

FactoMiner::PCA()

ade4::dudi.pca()

ExPosition::epPCA()

如果我们想判断PCA中需要多少个主成分比较好，那么可以从主成分的特征值来考虑(Kaiser-Harris准则建议保留特征值大于1的主成分)；特征值表示主成分所保留的变异量(所解释的方差)；如用get_eigenvalue函来提取特征值，结果中第一列是特征值，第二列是可解释变异的比例，第三列是累计可解释变异的比例

> eig.val

eigenvalue variance.percent cumulative.variance.percent

Dim.1 4.1242133 41.242133 41.24213

Dim.2 1.8385309 18.385309 59.62744

Dim.3 1.2391403 12.391403 72.01885

Dim.4 0.8194402 8.194402 80.21325

Dim.5 0.7015528 7.015528 87.22878

Dim.6 0.4228828 4.228828 91.45760

Dim.7 0.3025817 3.025817 94.48342

Dim.8 0.2744700 2.744700 97.22812

Dim.9 0.1552169 1.552169 98.78029

Dim.10 0.1219710 1.219710 100.00000

除了卡特征值大于1作为主成分个数的阈值外，还可以设置总变异的阈值(累计)作为判断指标

除了看表格来判断，还可从图形上直观的感受下

fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

fviz_eig_PCA_plot

如果我们想提取PCA结果中变量的信息，则可用get_pca_var()

var

比如我们用于展示变量与主成分之间的关系，以及变量之间的关联，可直接用head(var$coord)查看，或者图形展示

fviz_pca_var(res.pca, col.var = "black")

pca_var_corrd

图形解释，见原文吧：

Positively correlated variables are grouped together.

Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).

The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map

除了上面的Correlation circle外，还有Quality of representation(对应var$cos2)，用于展示每个变量在各个主成分中的代表性(高cos2值说明该变量在主成分中有good representation，对应在Correlation circle图上则是接近圆周边上；低cos2值说明该变量不能很好的代表该主成分，对应Correlation circle图的圆心位置)；对于变量来说，所有主成分上cos2值的和等于1，所以变量在越少主成分下累计cos2值接近于1，则其在Correlation circle上处于圆周圈上

library("corrplot")

corrplot(var$cos2, is.corr=FALSE)

pca_var_cos2

对于cos2值的原文总结：

The cos2 values are used to estimate the quality of the representation

The closer a variable is to the circle of correlations, the better its representation on the factor map (and the more important it is to interpret these components)

Variables that are closed to the center of the plot are less important for the first components.

针对上述的cos2值，还有一个与其相关的则是Contributions to the principal components，也就是cos2值在各个主成分中的比例。。

简单的说，如果一个变量在PC1和PC2的Contributions很高的话，则说明该变量可有效解释数据的变异，我们可以用图形展示各个变量在PC1和PC2上的Contributions

fviz_pca_var(res.pca, col.var = "contrib",

gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")

)

pca_var_contrib