一组随机样本数据需要进行分析处理时,往往需要用到假设检验,对于离散变量discrete多用卡方检验,连续变量continuous用t检验或wilcoxon秩序和检验,具体的的使用场景如下
离散变量-卡方检验-适用条件
四格表:
- 所有的理论数T≥5并且总样本量n≥40,用卡方进行检验
- 理论数T<5但T≥1并且n≥40,用连续性校正的卡方进行检验
- 有理论数T<1或n<40,用Fisher’s检验
非四格表(R×C表):
- R×C表中理论数小于5的格子不能超过1/5,且不能有小于1的理论数,用卡方进行检验
- 其他情况不能直接检验,要通过增加样本数、列合并等方法改变数据,再用合适的检验方法
注:实际建模分析中,离散变量多是有多个因子水平的,所以大多是非四格表的形式,若不满足条件1:理论数小于5的格子不能超过1/5,通常都是把数据量小的组合并到相似或相近的组内,然后再进行检验分析
连续变量-T检验/wilcoxon秩序和检验-适用条件
- 非正态分布,用非参数方法wilcoxon秩序和检验
- 正态分布,用T检验,python中stats包中ttest_ind可区分是否是同方差
离散变量代码示例:
class DiscreteTest():def __init__(self):passdef _discrete_test(self, data, var, yname):chisq_stat, chisq_p_val, df, expect = stats.chi2_contingency(pd.crosstab(data[yname], data[var]), correction=True)# fourfold tableif expect.size == 4:if (expect.sum() >= 40) and (expect.min()) >= 5:stat, p_val, df, expect_show = stats.chi2_contingency(pd.crosstab(data[yname], data[var]), correction=True)method = 'chisqtest'elif (expect.sum() >= 40) and (expect.min() < 5) and (expect.min() >= 1):stat, p_val, df, expect_show = stats.chi2_contingency(pd.crosstab(data[yname], data[var]), correction=False)method = 'adj_chisqtest'else:prior_odds_ratio, p_val = stats.fisher_exact(pd.crosstab(data[yname], data[var]))method = 'fisher'# non-fourfold tableelif expect.size > 4:if len(expect[expect < 5]) / expect.size <= 0.2:stat, p_val, df, expect_show = stats.chi2_contingency(pd.crosstab(data[yname], data[var]), correction=False)method = 'chisqtest'else:prior_odds_ratio, p_val = stats.fisher_exact(pd.crosstab(data[yname], data[var]))method = 'fisher'return {'p_value': p_val, 'method': method, 'var': var}def _discrete_stat(self, data, discrete_vars, yname):output = pd.DataFrame()for var in discrete_vars:temp = pd.DataFrame({"level": list(pd.Series(data[var]).value_counts().index),'value': list(pd.Series(data[var]).value_counts().values)})temp['var'] = varstat_t = self._discrete_test(data, var, yname)temp['p_value'] = stat_t['p_value']temp['method'] = stat_t['method']output = output.append(temp, ignore_index=True)return output
结果展示:
如图所知:
在α=5%的显著水平下,离散变量education和workYears运用卡方检验发现,education的p_value<α,显著,拒绝原假设,education在正负样本间有显著差异;同理,workYears的p_value>α,不显著,接受原假设,workYears在正负样本间没有显著差异。
连续变量代码示例:
class ContinuousTest():def __init__(self):passdef _continuous_test(self, data, var, yname):# normal testnor_stat = list()nor_p_val = list()for j in pd.unique(data[yname]):nor_stat.append(stats.normaltest(data.loc[data[yname] == j, var])[0])nor_p_val.append(stats.normaltest(data.loc[data[yname] == j, var])[1])# homogeneity of variance testf_stat, f_p_val = stats.f_oneway(data.loc[data[yname] == 0, var],data.loc[data[yname] == 1, var])# T-test or wilcoxon rank-sumif (min(nor_p_val) < 0.05):# p<α,reject null hypothesis: x comes from a normal distribution,so use wilcoxon rank-sum statisticstat, p_val = stats.ranksums(data.loc[data[yname] == 0, var], data.loc[data[yname] == 1, var])method = "wilcoxon-ranksum"elif f_p_val < 0.05:# p<α,reject null hypothesis that two or more groups have the same population meanstat, p_val = stats.ttest_ind(data.loc[data[yname] == 0, var], data.loc[data[yname] == 1, var],equal_var=False)method = "ttest, non-equal var"else:# nor_p_val>0.05 and f_p_val variance is samestat, p_val = stats.ttest_ind(data.loc[data[yname] == 0, var],data.loc[data[yname] == 1, var],equal_var=True)method = "ttest, equal var"stat_t = {'stat': stat, 'p_value': p_val, 'method': method, 'var': var}return stat_tdef _continuous_stat(self, data, continuous_vars, yname):output =pd.DataFrame()for var in continuous_vars:temp=pd.DataFrame({"level":list(pd.Series(data.loc[data[var].notnull(),var]).describe().index),'value' :list(pd.Series(data.loc[data[var].notnull(),var]).describe().values)})temp['var'] = varstat_t = self._continuous_test(data, var, yname)temp['stat'] = stat_t['stat']temp['p_value'] = stat_t['p_value']temp['method'] = stat_t['method']output = output.append(temp, ignore_index=True)return output
结果展示:
如图所知:
在α=5%的显著水平下,连续变量score_a运用T检验发现,p_value>α,不显著,不能拒绝原假设,所以score_a在正负样本间没有显著差异;同理,连续变量score_b运用wilcoxon-ranksum检验,p_value发现显著,拒绝原假设,score_b在正负样本间存在差异。