- Logistic回归介绍
Logistic回归适用于二值相应变量(0/1)。模型假设 Y 服从二项分布,线性模型的拟合形式:
其中 π = μy是 Y的条件均(即给定一系列 X 值时 Y=1的概率),(π/1-π)为 Y =1 时的优势比,log(π/1-π)为对数优势比,或logit。本例中,log(π/1-π)为连接函数,概率分布为二项分布,可用如下代码拟合Logistic回归模型
glm(Y~X1+X2+X3,family = binomial(link ="logit"),data =mydata)
例
当通过一系列连续型和/或类别型预测变量来预测二值的结果变量时,Logistic回归是一个非常有用的工具
#使用AER包中的数据框Affairs为例,探究婚外情的回归过程> data(Affairs,package = "AER")#导入包中的数据,在函数中也有require(包名)> summary(Affairs) #先看下描述性统计,知道整体的情况 affairs gender age yearsmarried children religiousness education occupation Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171 Min. :1.000 Min. : 9.00 Min. :1.000 1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430 1st Qu.:2.000 1st Qu.:14.00 1st Qu.:3.000 Median : 0.000 Median :32.00 Median : 7.000 Median :3.000 Median :16.00 Median :5.000 Mean : 1.456 Mean :32.49 Mean : 8.178 Mean :3.116 Mean :16.17 Mean :4.195 3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:6.000 Max. :12.000 Max. :57.00 Max. :15.000 Max. :5.000 Max. :20.00 Max. :7.000 rating Min. :1.000 1st Qu.:3.000 Median :4.000 Mean :3.932 3rd Qu.:5.000 Max. :5.000 > table(Affairs$affairs) # 生成交叉表格,会自动统计每类的次数 0 1 2 3 7 12 451 34 17 19 42 38 #Logistic回归是对二值型结果的统计,所以先将数据转化为因子> Affairs$affairs[Affairs$affairs > 0] <- 1 #[Affairs$affairs > 0]为真时,赋值为1> Affairs$affairs[Affairs$affairs == 0] <- 0> Affairs$ynaffair <- factor(Affairs$affairs,levels = c(0,1),labels=c("No,Yes")) #转化为因子> table(Affairs$ynaffair)#在使用table看下结果No,Yes1 No,Yes2 451 150 #拟合Logistic模型> fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + + religiousness + education + occupation +rating,+ data=Affairs,family=binomial())> summary(fit.full)Call:glm(formula = ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating, family = binomial(), data = Affairs)Deviance Residuals: Min 1Q Median 3Q Max -1.5713 -0.7499 -0.5690 -0.2539 2.5191 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.37726 0.88776 1.551 0.120807 gendermale 0.28029 0.23909 1.172 0.241083 #无“*”号表示不显著,即 p>0.05age -0.04426 0.01825 -2.425 0.015301 * #"*"越多表示越显著yearsmarried 0.09477 0.03221 2.942 0.003262 ** childrenyes 0.39767 0.29151 1.364 0.172508 religiousness -0.32472 0.08975 -3.618 0.000297 ***education 0.02105 0.05051 0.417 0.676851 occupation 0.03092 0.07178 0.431 0.666630 rating -0.46845 0.09091 -5.153 2.56e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 675.38 on 600 degrees of freedomResidual deviance: 609.51 on 592 degrees of freedomAIC: 627.51Number of Fisher Scoring iterations: 4
从结果中可以看到,性别、孩子、学历职业等对方程都不显著,可以剔除这些再拟合简单的模型,然后两个模型进行比较,看下简单模型是否合理
#剔除显著的变量,再拟合> fit.reduced <- glm(ynaffair ~ age + yearsmarried + religiousness + + rating, data=Affairs, family=binomial())> summary(fit.reduced)Call:glm(formula = ynaffair ~ age + yearsmarried + religiousness + rating, family = binomial(), data = Affairs)Deviance Residuals: Min 1Q Median 3Q Max -1.6278 -0.7550 -0.5701 -0.2624 2.3998 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.93083 0.61032 3.164 0.001558 ** age -0.03527 0.01736 -2.032 0.042127 * yearsmarried 0.10062 0.02921 3.445 0.000571 ***religiousness -0.32902 0.08945 -3.678 0.000235 ***rating -0.46136 0.08884 -5.193 2.06e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 675.38 on 600 degrees of freedomResidual deviance: 615.36 on 596 degrees of freedomAIC: 625.36 #发现 简单模型的AIC值比之前的模型的要小,说明是可行的,然后我们也可以用anova()对两次拟合模型进行比较Number of Fisher Scoring iterations: 4
由于两个模型嵌套(fit.reduced是fit.full的一个子集)可以使用anova()进行比较, 对于广义线性模型,可以卡方检验
##使用anova()对两个嵌套模型进行比较,广义线性回归使用Chisp(卡方检验)> anova(fit.full,fit.reduced,test="Chisq")Analysis of Deviance TableModel 1: ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + ratingModel 2: ynaffair ~ age + yearsmarried + religiousness + rating Resid. Df Resid. Dev Df Deviance Pr(>Chi)1 592 609.51 2 596 615.36 -4 -5.8474 0.2108 #卡方值不显著(p=0.217)表明四个预测变量的新模型与九个完整预测变量的模型拟合程度一样好