What’s wrong with logistic regression? Part I: The basics
One the most cited article in social sciences Carina Mood (2010) points a couple of problems in the interpretation of results from logistic regression. These problems have led many authors to move away from interpreting logistic regression coefficients, but base their conclusions on average marginal effects (AMEs) instead. Some authors even abandon logistic regression as a method to analyse binary dependent variables altogether and use linear probability models instead (e.g. Maneuvrier-Hervieu et al. 2025). In this and the following blog posts I’ll take a look at Mood’s claims and ask how serious the problems uncovered by Mood are and how helpful her suggested solutions are.
The problems uncovered by Mood are:
- Coefficients of nested logistic regression models cannot be compared, because coefficients will change when predictor variables are added to a model, even if these predictor variables are uncorrelated. This is problem does not occur with linear regression.
- Coefficients of logistic regression models fitted to samples from different populations cannot be compared, because populations may have different error variances in the dependent variable.
For these reasons, Mood explicitly recommends the use of average marginal effects (AMEs) for drawing conclusions about the influence of predictor variables on binary. She also remarks that if the size of true logistic regression coefficients is small enough so that the non-linearity in the link between the probability of a positive outcome and the predictor variables is (almost) inconsequential, then the coefficients of a linear probability model are very close to the coefficients of a linear probability model fitted by OLS. So if one is interested in AMEs and the influence of the predictor variables is not “too strong”, one can just avoid the computational and interpretative complications of logistic regression and use linear regression instead. However, she does not explicitly recommend the use of linear regression for binary dependent variables.
Logistic regression
Logistic regression is a statistical technique to describe the influence of one or more predictor variables on a binary response variable. Mathematically, the binary response variable can be represented by a sequence of random variables that take a positive value of unity () with some probability , where , and a value of zero with probability . In a logistic regression model, the relation between the binary response and a set of predictor variables is expressed in the following form:
or alternatively
Here, is the value of the first predictor variable for observation (or individual or case) , is the value of the second predictor variable, etc.
The logarithmic ratio is also known as log-odds or logit, while the function is also known as the logistic function.
Logistic regression also has a latent variable interpretation: This interpretation starts with a latent variable with values with
The binary response variable is created by the following dichotomization:
If has a logistic distribution, i.e. a distribution with cumulative distribution function
then the logistic regression model applies to .
Marginal and average marginal effects
In a (linear or non-linear) regression model, the marginal effect of a predictor variable is defined as the partial derivative of the expected value of the response regarding the values of the predictor variable.
In a linear regression model without interaction terms, a marginal effect of a variable equals its coefficient. For numeric dependent variable a linear regression model can written as
(i.e. instead of the “raw” response variable we have its expected value on the left-hand side and drop the error term on the right-hand side). The marginal effect of the -th predictor variable with values is then defined as:
that is, the marginal effect is identical for each observation and each potential value of the independent variable.
In a logistic regression model, the marginal effect of an independent variable is also defined as the partial derivative of the expected value of the response variable, which equals :
In contrast to linear regression, the marginal effect is not constant, but varies with the values of the independent variables via , which is a function the independent variables. It should be noted the marginal effect of the -th independent variable not only varies with its values but also with the value of the other independent variables, a fact that has lead to quite some confusion.[1] It is noteworthy that has a maximum at , so that the maximum attainable value of the marginal effect is .
The average marginal effect (AME) of the independent variable in a logistic regression model can be defined as the (sample) average of the marginal effects:
If the values of the predictor variables can be interpreted as independent identical distributed random variables with joint density function then one could interpret an AME as the estimate of a population value
In the special case of a logistic regression model with a single independent variable that in turn as a uniform distribution
this “population” AME equals a difference ratio:
While this is a very narrow special case, it may give some intuition of the meaning of average marginal effects, as well as some of their properties that emerge in the discussion below.
-
This confusion culminates in the claim that product terms are neither necessary nor sufficient for interaction effects in logistic regression (see e.g. Berry et al. 2010) – if interaction effects are defined as systematic changes in marginal effects. ↩︎
Linear regression and the linear probability model
Linear regression involves the estimation of the coefficients of the linear model of the relation between a response variable and one or several predictor variables :
If is dichotomous with values and then equals , at least as long the value of the expected value is between and . This leads to the linear probability model:
with
As long as is between and , the partial derivatives with respect to any value of the independent variables will equal their respective coefficients, but whenever is either less or greater than the partial derivative zero.
An example application
The following example uses a data set from R package
“carData”, which come from a
survey conducted on occasion of the 1988 Chilean plebiscite. The data contains
respondents’ statements about their stated vote intentions, a factor variable
vote with values Y for “yes, will vote for Pinochet”, N for “no, will not
vote Pinochet”, A “will abstain”, and U for “undecided”. It also contains a
variable statusquo which is based on a Likert scale for respondents’ status
quo preference and has a standard deviation approximately equal to .
The following lines activate two packages, “carData” and “memisc”.
In the next few lines, we create a binary variant of the vote variable within
the data frame Chile.
# %$$% is a shorthand for `within()` defined in the "memisc" package.
Chile %$$% {
vote2 <- factor(vote, levels = c("N","Y"))
vote.bin <- as.integer(vote2 == "Y")
}
Since we want to enhance the data set with predictions from the model, we need
to make sure that prediction vectors are filled with NA if necessary. This is
made the default setting for lm() and glm() by the appropriate option setting:
options(na.action=na.exclude)
The following lines define a function that computes the average marginal effect for each predictor variable in a logistic regression model.
# Note that this function is non-generic is correct only for
# logistic regression or logit models
AME <- function(mod) {
p <- predict(mod,type="response")
cf <- coef(mod)
cf <- cf[names(cf)!="(Intercept)"]
unname(mean(p*(1-p),na.rm=TRUE))*cf
}
A first model: Status quo preference as a “strong” predictor of the vote
The first model relates respondents’ votes in the plebiscite to their preference for the status quo. We get a pretty large coefficient for this predictor variable.
glm.statusquo <- glm(vote.bin~statusquo,
data=Chile,
family=binomial)
summary(glm.statusquo)
Call:
glm(formula = vote.bin ~ statusquo, family = binomial, data = Chile)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.21531 0.09964 2.161 0.0307 *
statusquo 3.20554 0.14310 22.401 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2431.28 on 1753 degrees of freedom
Residual deviance: 752.59 on 1752 degrees of freedom
(946 observations deleted due to missingness)
AIC: 756.59
Number of Fisher Scoring iterations: 6
Let’s compute the average marginal effect of statusquo:
AME(glm.statusquo)
statusquo
0.1946738
For comparison, we also fit a linear model:
lm.statusquo <- lm(vote.bin~statusquo,
data=Chile)
summary(lm.statusquo)
Call:
lm(formula = vote.bin ~ statusquo, data = Chile)
Residuals:
Min 1Q Median 3Q Max
-1.08835 -0.10223 -0.02417 0.04055 1.02152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.492167 0.006203 79.35 <2e-16 ***
statusquo 0.394079 0.005721 68.89 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2598 on 1752 degrees of freedom
(946 observations deleted due to missingness)
Multiple R-squared: 0.7303, Adjusted R-squared: 0.7302
F-statistic: 4745 on 1 and 1752 DF, p-value: < 2.2e-16
To illustrate how statusquo is related to vote.bin according to logistic
regression and linear regression, we compute predicted values:
Chile <- within(Chile,{
pred.glm.statusquo <- predict(glm.statusquo,
type="response")
pred.lm.statusquo <- predict(lm.statusquo,
type="response")
})
Finally, we use graphics to get an idea of the “fit” of the two models, logistic regression and linear regression, with the data. For this we use the popular “ggplot2” package.
library(ggplot2)
theme_set(theme_bw())
Attaching package: ‘ggplot2’
The following object is masked from ‘package:memisc’:
syms
The following plot compares the “fit” of logistic regression, linear regression, and a model-free scatterplot smoother. The black curve connects the predicted values from the logistic regression model. Due to its construction, it stays strictly between 0 and 1. The red line connects the predicted values from the linear regression. Obviously there are many “out-of-range” predictions with value greater than 1 or less than 0 created by the linear model. The blue dotted curve connects the predictions from the model-free scatterplot smoother. This curve also stays mostly between 0 and 1 and is similar in shape to the curve that represents the logistic regression predictions.
subset(Chile,is.finite(pred.glm.statusquo)) |>
ggplot(aes(x=statusquo)) +
geom_point(aes(y=vote.bin),shape=3) +
geom_smooth(aes(y=vote.bin),se=FALSE,linetype="dotted") +
geom_line(aes(y=pred.glm.statusquo),linetype="solid") +
geom_line(aes(y=pred.lm.statusquo),linetype="solid",color="red")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The proportion of “out-of-range” predictions created by linear regression is quite large: About one third of the predicted values is greater than 1 or less than 0.
with(Chile,
percent(pred.lm.statusquo > 1 | pred.lm.statusquo < 0))
Percentage N
33.75143 1754.00000
A second model: Age is a “weak” predictor with almost linear impact on the vote
The second model relates respondents’ votes in the plebiscite to their age. This gives a relative small coefficient for the predictor variable. However, this is not very surprising because the unit of measurement for age is year. It would be surprising that an age difference of a single year would have much of an impact.
glm.age <- glm(vote.bin ~ age, data = Chile, family=binomial)
summary(glm.age)
Call:
glm(formula = vote.bin ~ age, family = binomial, data = Chile)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.81418 0.13312 -6.116 9.58e-10 ***
age 0.02078 0.00327 6.356 2.07e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2435.5 on 1756 degrees of freedom
Residual deviance: 2394.1 on 1755 degrees of freedom
(943 observations deleted due to missingness)
AIC: 2398.1
Number of Fisher Scoring iterations: 4
The value of the AME is even smaller, it is about one-fourth of the logistic regression
coefficient. But given this is quite close to the possible maximum of a marginal
effect. This is quite different from the situation with statusquo as predictor
variable where the AME is about 0.6 times the logistic regression coefficient.
AME(glm.age)
age
0.00507336
In a linear regression of vote.bin on age, the age variable gets a
coefficient that is very close to its AME obtained from logistic
regression. However, the does not indicated a particularly good fit, its
value indicates that just about 2 percent of the variance of the response can be
explained, in stark contrast to the model with statusquo as predictor variable
where indicates that more 70 percent of the variance can be explained.
lm.age <- lm(vote.bin ~ age, data = Chile)
summary(lm.age)
Call:
lm(formula = vote.bin ~ age, data = Chile)
Residuals:
Min 1Q Median 3Q Max
-0.6574 -0.4682 -0.3915 0.5012 0.6085
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2994234 0.0322561 9.283 < 2e-16 ***
age 0.0051133 0.0007889 6.482 1.17e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4944 on 1755 degrees of freedom
(943 observations deleted due to missingness)
Multiple R-squared: 0.02338, Adjusted R-squared: 0.02282
F-statistic: 42.01 on 1 and 1755 DF, p-value: 1.175e-10
Again, we compute predicted values to illustrate the relation between age and
vote.bin according to logistic and linear regression.
Chile <- within(Chile,{
pred.glm.age <- predict(glm.age,
type="response")
pred.lm.age <- predict(lm.age)
})
The following code would, in principle, create a plot with lines for the predictions from logistic regression and linear regression. But because the predictions coincide almost everywhere, only the curve for the linear regression is visible in the resulting plot.
subset(Chile,is.finite(pred.glm.age)) |>
ggplot(aes(x=age)) +
geom_point(aes(y=vote.bin),shape=3) +
geom_line(aes(y=pred.glm.age),linetype="solid") +
geom_line(aes(y=pred.lm.age),linetype="solid",color="red")
Comparing the marginal effects obtained from the two models
To get an understanding of why the AME is so much smaller relative to the
logistic regression coefficient in the model with statusquo than in the model
with age as predictor variable, a closer look at the distribution of the
marginal effects may be of interest.
The following lines of code define a function that computes the marginal effects of the predictor variable for each individual observation:
# Note that this function is non-generic and is correct only for
# logistic regression or logit models
ME_i <- function(mod) {
p <- predict(mod,type="response")
cf <- coef(mod)
cf <- cf[names(cf)!="(Intercept)"]
(p*(1-p)) %o% cf
}
The next few lines compute the marginal effects for the predictor variables in the two models:
Chile %$$% {
ME_statusquo <- ME_i(glm.statusquo)
ME_age <- ME_i(glm.age)
}
The following two plots show the distribution of the marginal effects of the two single-variable models:
#!opt: jupyter.plot.width=9, jupyter.plot.height=5
library(cowplot) # for `plot_grid()` ...
plot_grid(
ggplot(Chile) + geom_histogram(aes(ME_statusquo)),
ggplot(Chile) + geom_histogram(aes(ME_age))
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning messages:
1: Removed 946 rows containing non-finite outside the scale range (`stat_bin()`).
2: Removed 943 rows containing non-finite outside the scale range (`stat_bin()`).
The marginal effects for statusquo cluster around a minimum just above
zero. This is a consequence from most of the predicted values to be close to
0 or 1, which in turn is a consequence of the strong influence of status quo
preference on voting. This explains why the AME of statusquo is so small relative to the
logistic regression coefficient, while the AME of age is not.
How the marginal effects vary with the values of the predictor variables is demonstrated in the plot created my next lines of code:
#!opt: jupyter.plot.width=9, jupyter.plot.height=5
g1 <- subset(Chile,is.finite(ME_statusquo)) |>
ggplot(aes(x=statusquo)) +
geom_line(aes(y=ME_statusquo)) +
geom_rug() + ylab("Marginal effect")
g2 <- subset(Chile,is.finite(ME_age)) |>
ggplot(aes(x=age)) +
geom_line(aes(y=ME_age)) +
geom_rug() + ylab("Marginal effect") + ylim(0,0.0053)
plot_grid(g1,g2)
The plot shows that the marginal effects of statusquo vary much stronger with
the values of the predictor variable than the marginal effects of age.
Summary
While giving only anecdotal evidence, the applied example points to some complications of logistic regressions, marginal effects, and average marginal effects.
- If the influence of a predictor variable in a single-predictor model is
relatively weak, then
- the predicted values of a logistic regression model behave very similar to the those of a linear regression model. This may even go so far that the predicted values are close to equal.
- the marginal effects vary relatively little with the values of the predictor variable.
- the average marginal effect is very close to the coefficient in a linear regression of the same variables.
- If the influence of a predictor variable in a single-predictor model is
relatively strong, then
- the predicted values of a logistic regression model and a linear regression model tend to be quite different. Most importantly, while the predictions from the logistic regression model always stay between 0 and 1, while a substantial proportion of the predictions from linear regression are out of range, i.e. greater than 1 or less than 0.
- the marginal effects vary strongly with the values of the predictor variable. Their values tend to be closer to 0 than to the potential maximum of the marginal effects.
- the average marginal effect is quite different from, usually much lower than the linear regression coefficient.
Of course, these statements are not very precise: One may ask what a “very weak” or “very strong” influence of a predictor variable is. A further limitation is that the example discussed so far only concerns the strength of the influence of a predictor variable, it does not examine how the distribution of an independent variable may affect the AME. These issues will be addressed in later blog posts.
Another limitation of this empirical example is that it does not address the strength of AMEs and linear regression coefficients in comparison to logistic regression coefficients, asserted by Mood (2010): That they vary little or not at all between nested models when the predictor variables are uncorrelated. The performance of logistic regression coefficents, AMEs, and linear regression coefficients will be examined in the next blog post in this series.