What’s wrong with logistic regression? Part I: The basics

One the most cited article in social sciences Carina Mood (2010) points a couple of problems in the interpretation of results from logistic regression. These problems have led many authors to move away from interpreting logistic regression coefficients, but base their conclusions on average marginal effects (AMEs) instead. Some authors even abandon logistic regression as a method to analyse binary dependent variables altogether and use linear probability models instead (e.g. Maneuvrier-Hervieu et al. 2025). In this and the following blog posts I’ll take a look at Mood’s claims and ask how serious the problems uncovered by Mood are and how helpful her suggested solutions are.

The problems uncovered by Mood are:

  1. Coefficients of nested logistic regression models cannot be compared, because coefficients will change when predictor variables are added to a model, even if these predictor variables are uncorrelated. This is problem does not occur with linear regression.
  2. Coefficients of logistic regression models fitted to samples from different populations cannot be compared, because populations may have different error variances in the dependent variable.

For these reasons, Mood explicitly recommends the use of average marginal effects (AMEs) for drawing conclusions about the influence of predictor variables on binary. She also remarks that if the size of true logistic regression coefficients is small enough so that the non-linearity in the link between the probability of a positive outcome and the predictor variables is (almost) inconsequential, then the coefficients of a linear probability model are very close to the coefficients of a linear probability model fitted by OLS. So if one is interested in AMEs and the influence of the predictor variables is not “too strong”, one can just avoid the computational and interpretative complications of logistic regression and use linear regression instead. However, she does not explicitly recommend the use of linear regression for binary dependent variables.

Logistic regression

Logistic regression is a statistical technique to describe the influence of one or more predictor variables on a binary response variable. Mathematically, the binary response variable can be represented by a sequence of random variables Y i Y_i that take a positive value of unity ( Y i = 1 Y_i=1 ) with some probability Pr ( Y i = 1 ) = π i \Pr(Y_i=1) = \pi_i , where 0 < π i < 1 0<\pi_i<1 , and a value of zero with probability Pr ( Y i = 0 ) = 1 π i \Pr(Y_i=0) = 1-\pi_i . In a logistic regression model, the relation between the binary response and a set of m m predictor variables is expressed in the following form:

ln Pr ( Y i = 1 ) Pr ( Y i = 0 ) = ln π i 1 π i = β 0 + β 1 x 1 i + + β m x m i \ln\frac{\Pr(Y_i=1)}{\Pr(Y_i=0)} =\ln\frac{\pi_i}{1-\pi_i} =\beta_0 + \beta_1x_{1i}+\cdots+\beta_mx_{mi}

or alternatively

Pr ( Y i = 1 ) = π i = exp ( β 0 + β 1 x 1 i + + β m x m i ) 1 + exp ( β 0 + β 1 x 1 i + + β m x m i ) . \Pr(Y_i=1)=\pi_i = \frac{\exp(\beta_0 + \beta_1x_{1i}+\cdots+\beta_mx_{mi})} {1+\exp(\beta_0 + \beta_1x_{1i}+\cdots+\beta_mx_{mi})}.

Here, x 1 i x_{1i} is the value of the first predictor variable for observation (or individual or case) i i , x 2 i x_{2i} is the value of the second predictor variable, etc.

The logarithmic ratio ln π i 1 π i \ln\frac{\pi_i}{1-\pi_i} is also known as log-odds or logit, while the function exp ( x ) 1 + exp ( x ) \frac{\exp(x)}{1+\exp(x)} is also known as the logistic function.

Logistic regression also has a latent variable interpretation: This interpretation starts with a latent variable with values Y i Y^*_i with

Y i = β 0 + β 1 x 1 i + + β m x m i + ϵ i . Y^*_i=\beta_0 + \beta_1x_{1i}+\cdots+\beta_mx_{mi} + \epsilon_i.

The binary response variable Y i Y_i is created by the following dichotomization:

Y i = { 1 if  Y i > 0 0 if  Y i 0 Y_i= \begin{cases} 1 &\text{if } Y_i^*>0\\ 0 &\text{if } Y_i^*\leq 0\\ \end{cases}

If ϵ i \epsilon_i has a logistic distribution, i.e. a distribution with cumulative distribution function

Pr ( ϵ t ) = exp ( t ) 1 + exp ( t ) \Pr(\epsilon\leq t) = \frac{\exp(t)}{1+\exp(t)}

then the logistic regression model applies to Pr ( Y i ) = Pr ( Y i > 0 ) \Pr(Y_i)=\Pr(Y_i^*>0) .

Marginal and average marginal effects

In a (linear or non-linear) regression model, the marginal effect of a predictor variable is defined as the partial derivative of the expected value of the response regarding the values of the predictor variable.

In a linear regression model without interaction terms, a marginal effect of a variable equals its coefficient. For numeric dependent variable Y i Y_i a linear regression model can written as

E ( Y i ) = μ i = β 0 + β 1 x 1 i + + β m x m i \textrm{E}(Y_i) = \mu_i = \beta_0 + \beta_1x_{1i}+\cdots+\beta_mx_{mi}

(i.e. instead of the “raw” response variable we have its expected value on the left-hand side and drop the error term on the right-hand side). The marginal effect of the h h -th predictor variable with values x h 1 , , x h n x_{h1},\ldots,x_{hn} is then defined as:

μ i x h i = β h , \frac{\partial\mu_i}{\partial x_{hi}}=\beta_h,

that is, the marginal effect is identical for each observation and each potential value of the j j independent variable.

In a logistic regression model, the marginal effect of an independent variable is also defined as the partial derivative of the expected value of the response variable, which equals π i \pi_i :

π i x h i = π i η i η i x h i = π i ( 1 π i ) β h . \frac{\partial\pi_i}{\partial x_{hi}} = \frac{\partial\pi_i}{\partial\eta_{i}} \frac{\partial\eta_i}{\partial x_{hi}} = \pi_i(1-\pi_i)\beta_h.

In contrast to linear regression, the marginal effect is not constant, but varies with the values of the independent variables via π i \pi_i , which is a function the independent variables. It should be noted the marginal effect of the h h -th independent variable not only varies with its values x h i x_{hi} but also with the value of the other independent variables, a fact that has lead to quite some confusion.[1] It is noteworthy that π i ( 1 π i ) \pi_i(1-\pi_i) has a maximum at π i = 1 2 \pi_i=\frac12 , so that the maximum attainable value of the marginal effect is β h / 4 \beta_h/4 .

The average marginal effect (AME) of the h h independent variable in a logistic regression model can be defined as the (sample) average of the marginal effects:

AME h = 1 n i = 1 n π i x h i = 1 n i = 1 n π i ( 1 π i ) β h \textrm{AME}_h = \frac1n\sum_{i=1}^n \frac{\partial\pi_i}{\partial x_{hi}} = \frac1n\sum_{i=1}^n \pi_i(1-\pi_i)\beta_h

If the values of the predictor variables can be interpreted as independent identical distributed random variables with joint density function f ( x 1 , , x m ) f(x_1,\ldots,x_m) then one could interpret an AME as the estimate of a population value

τ h = π x h f ( x 1 , , x m ) x 1 x m . \tau_h = \int \frac{\partial\pi}{\partial x_{h}} f(x_1,\ldots,x_m) \partial x_1\cdots\partial x_m.

In the special case of a logistic regression model with a single independent variable that in turn as a uniform distribution

f ( x ) = { 1 b a if  a x b 0 if  x < a  or  x > b f(x) = \begin{cases} \frac1{b-a} &\text{if } a \leq x \leq b \\ 0 &\text{if } x < a \text{ or } x > b\\ \end{cases}

this “population” AME equals a difference ratio:

τ h = π x f ( x ) x = 1 b a a b π x x = π ( b ) π ( a ) b a \tau_h = \int_{-\infty}^{\infty} \frac{\partial\pi}{\partial x} f(x) \partial x = \frac1{b-a} \int_a^b \frac{\partial\pi}{\partial x} \partial x = \frac{\pi(b)-\pi(a)}{b-a}

While this is a very narrow special case, it may give some intuition of the meaning of average marginal effects, as well as some of their properties that emerge in the discussion below.


  1. This confusion culminates in the claim that product terms are neither necessary nor sufficient for interaction effects in logistic regression (see e.g. Berry et al. 2010) – if interaction effects are defined as systematic changes in marginal effects. ↩︎

Linear regression and the linear probability model

Linear regression involves the estimation of the coefficients of the linear model of the relation between a response variable Y i Y_i and one or several predictor variables x 1 i , , x m i x_{1i},\ldots,x_{mi} :

E ( Y i ) = μ i = β 0 + β 1 x 1 i + + β m x m i \textrm{E}(Y_i) = \mu_i = \beta_0 + \beta_1x_{1i}+\cdots+\beta_mx_{mi}

If Y i Y_i is dichotomous with values 0 0 and 1 1 then E ( Y i ) \textrm{E}(Y_i) equals Pr ( Y i = 1 ) \Pr(Y_i=1) , at least as long the value of the expected value is between 0 0 and 1 1 . This leads to the linear probability model:

Pr ( Y i = 1 ) = π i = g ( β 0 + β 1 x 1 i + + β m x m i ) \Pr(Y_i=1) = \pi_i = g(\beta_0 + \beta_1x_{1i}+\cdots+\beta_mx_{mi})

with

g = { x if  0 < x < 1 0 if  x 0 1 if  x 1. g= \begin{cases} x & \text{if }0<x<1\\ 0 & \text{if }x\leq 0\\ 1 & \text{if }x\geq 1.\\ \end{cases}

As long as μ i \mu_i is between 0 0 and 1 1 , the partial derivatives with respect to any value of the independent variables will equal their respective coefficients, but whenever μ i \mu_i is either less 0 0 or greater than 1 1 the partial derivative zero.

An example application

The following example uses a data set from R package “carData”, which come from a survey conducted on occasion of the 1988 Chilean plebiscite. The data contains respondents’ statements about their stated vote intentions, a factor variable vote with values Y for “yes, will vote for Pinochet”, N for “no, will not vote Pinochet”, A “will abstain”, and U for “undecided”. It also contains a variable statusquo which is based on a Likert scale for respondents’ status quo preference and has a standard deviation approximately equal to 1 1 .

The following lines activate two packages, “carData” and “memisc”.

library(carData)
library(memisc)
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 404ms
Show cell output
Loading required package: lattice
Loading required package: MASS

Attaching package: ‘memisc’

The following objects are masked from ‘package:stats’:

    contr.sum, contr.treatment, contrasts

The following object is masked from ‘package:base’:

    as.array

In the next few lines, we create a binary variant of the vote variable within the data frame Chile.

# %$$% is a shorthand for `within()` defined in the "memisc" package.
Chile %$$%  {
    vote2 <- factor(vote, levels = c("N","Y"))
    vote.bin <- as.integer(vote2 == "Y")
}
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 75ms

Since we want to enhance the data set with predictions from the model, we need to make sure that prediction vectors are filled with NA if necessary. This is made the default setting for lm() and glm() by the appropriate option setting:

options(na.action=na.exclude)
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 9ms

The following lines define a function that computes the average marginal effect for each predictor variable in a logistic regression model.

# Note that this function is non-generic is correct only for 
# logistic regression or logit models
AME <- function(mod) {
    p <- predict(mod,type="response")
    cf <- coef(mod)
    cf <- cf[names(cf)!="(Intercept)"]
    unname(mean(p*(1-p),na.rm=TRUE))*cf
}
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 12ms

A first model: Status quo preference as a “strong” predictor of the vote

The first model relates respondents’ votes in the plebiscite to their preference for the status quo. We get a pretty large coefficient for this predictor variable.

glm.statusquo <- glm(vote.bin~statusquo,
                     data=Chile,
                     family=binomial)
summary(glm.statusquo)
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 29ms

Call:
glm(formula = vote.bin ~ statusquo, family = binomial, data = Chile)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.21531    0.09964   2.161   0.0307 *  
statusquo    3.20554    0.14310  22.401   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2431.28  on 1753  degrees of freedom
Residual deviance:  752.59  on 1752  degrees of freedom
  (946 observations deleted due to missingness)
AIC: 756.59

Number of Fisher Scoring iterations: 6

Let’s compute the average marginal effect of statusquo:

AME(glm.statusquo)
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 13ms
statusquo 
0.1946738 

For comparison, we also fit a linear model:

lm.statusquo <- lm(vote.bin~statusquo,
                   data=Chile)
summary(lm.statusquo)
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 19ms

Call:
lm(formula = vote.bin ~ statusquo, data = Chile)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.08835 -0.10223 -0.02417  0.04055  1.02152 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.492167   0.006203   79.35   <2e-16 ***
statusquo   0.394079   0.005721   68.89   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2598 on 1752 degrees of freedom
  (946 observations deleted due to missingness)
Multiple R-squared:  0.7303,	Adjusted R-squared:  0.7302 
F-statistic:  4745 on 1 and 1752 DF,  p-value: < 2.2e-16

To illustrate how statusquo is related to vote.bin according to logistic regression and linear regression, we compute predicted values:

Chile <- within(Chile,{
    pred.glm.statusquo <- predict(glm.statusquo,
                                  type="response")
    pred.lm.statusquo <- predict(lm.statusquo,
                                  type="response")
})
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 13ms

Finally, we use graphics to get an idea of the “fit” of the two models, logistic regression and linear regression, with the data. For this we use the popular “ggplot2” package.

library(ggplot2)
theme_set(theme_bw())
Last executed at 19 January 2026 at 15:36:01 GMT+1 in 229ms

Attaching package: ‘ggplot2’

The following object is masked from ‘package:memisc’:

    syms

The following plot compares the “fit” of logistic regression, linear regression, and a model-free scatterplot smoother. The black curve connects the predicted values from the logistic regression model. Due to its construction, it stays strictly between 0 and 1. The red line connects the predicted values from the linear regression. Obviously there are many “out-of-range” predictions with value greater than 1 or less than 0 created by the linear model. The blue dotted curve connects the predictions from the model-free scatterplot smoother. This curve also stays mostly between 0 and 1 and is similar in shape to the curve that represents the logistic regression predictions.

subset(Chile,is.finite(pred.glm.statusquo)) |> 
 ggplot(aes(x=statusquo)) +
 geom_point(aes(y=vote.bin),shape=3) +
 geom_smooth(aes(y=vote.bin),se=FALSE,linetype="dotted") +
 geom_line(aes(y=pred.glm.statusquo),linetype="solid") +
 geom_line(aes(y=pred.lm.statusquo),linetype="solid",color="red")
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 1309ms
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The proportion of “out-of-range” predictions created by linear regression is quite large: About one third of the predicted values is greater than 1 or less than 0.

with(Chile,
   percent(pred.lm.statusquo > 1 | pred.lm.statusquo < 0))
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 15ms
Percentage          N 
  33.75143 1754.00000 

A second model: Age is a “weak” predictor with almost linear impact on the vote

The second model relates respondents’ votes in the plebiscite to their age. This gives a relative small coefficient for the predictor variable. However, this is not very surprising because the unit of measurement for age is year. It would be surprising that an age difference of a single year would have much of an impact.

glm.age <- glm(vote.bin ~ age, data = Chile, family=binomial)
summary(glm.age)
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 22ms

Call:
glm(formula = vote.bin ~ age, family = binomial, data = Chile)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.81418    0.13312  -6.116 9.58e-10 ***
age          0.02078    0.00327   6.356 2.07e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2435.5  on 1756  degrees of freedom
Residual deviance: 2394.1  on 1755  degrees of freedom
  (943 observations deleted due to missingness)
AIC: 2398.1

Number of Fisher Scoring iterations: 4

The value of the AME is even smaller, it is about one-fourth of the logistic regression coefficient. But given this is quite close to the possible maximum of a marginal effect. This is quite different from the situation with statusquo as predictor variable where the AME is about 0.6 times the logistic regression coefficient.

AME(glm.age)
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 18ms
       age 
0.00507336 

In a linear regression of vote.bin on age, the age variable gets a coefficient that is very close to its AME obtained from logistic regression. However, the R 2 R^2 does not indicated a particularly good fit, its value indicates that just about 2 percent of the variance of the response can be explained, in stark contrast to the model with statusquo as predictor variable where R 2 R^2 indicates that more 70 percent of the variance can be explained.

lm.age <- lm(vote.bin ~ age, data = Chile)
summary(lm.age)
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 24ms

Call:
lm(formula = vote.bin ~ age, data = Chile)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6574 -0.4682 -0.3915  0.5012  0.6085 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.2994234  0.0322561   9.283  < 2e-16 ***
age         0.0051133  0.0007889   6.482 1.17e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4944 on 1755 degrees of freedom
  (943 observations deleted due to missingness)
Multiple R-squared:  0.02338,	Adjusted R-squared:  0.02282 
F-statistic: 42.01 on 1 and 1755 DF,  p-value: 1.175e-10

Again, we compute predicted values to illustrate the relation between age and vote.bin according to logistic and linear regression.

Chile <- within(Chile,{
    pred.glm.age <- predict(glm.age,
                                  type="response")
    pred.lm.age <- predict(lm.age)
})
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 13ms

The following code would, in principle, create a plot with lines for the predictions from logistic regression and linear regression. But because the predictions coincide almost everywhere, only the curve for the linear regression is visible in the resulting plot.

subset(Chile,is.finite(pred.glm.age)) |> 
 ggplot(aes(x=age)) +
 geom_point(aes(y=vote.bin),shape=3) +
 geom_line(aes(y=pred.glm.age),linetype="solid") +
 geom_line(aes(y=pred.lm.age),linetype="solid",color="red")
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 393ms

Comparing the marginal effects obtained from the two models

To get an understanding of why the AME is so much smaller relative to the logistic regression coefficient in the model with statusquo than in the model with age as predictor variable, a closer look at the distribution of the marginal effects may be of interest.

The following lines of code define a function that computes the marginal effects of the predictor variable for each individual observation:

# Note that this function is non-generic and is correct only for 
# logistic regression or logit models
ME_i <- function(mod) {
    p <- predict(mod,type="response")
    cf <- coef(mod)
    cf <- cf[names(cf)!="(Intercept)"]
    (p*(1-p)) %o% cf
}
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 12ms

The next few lines compute the marginal effects for the predictor variables in the two models:

Chile %$$% {
    ME_statusquo <- ME_i(glm.statusquo)
    ME_age <- ME_i(glm.age)
}
Last executed at 19 January 2026 at 15:36:03 GMT+1 in 18ms

The following two plots show the distribution of the marginal effects of the two single-variable models:

#!opt: jupyter.plot.width=9, jupyter.plot.height=5
library(cowplot) # for `plot_grid()` ...
plot_grid(
 ggplot(Chile) + geom_histogram(aes(ME_statusquo)),
 ggplot(Chile) + geom_histogram(aes(ME_age))   
)

Last executed at 19 January 2026 at 15:36:04 GMT+1 in 523ms
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning messages:
1: Removed 946 rows containing non-finite outside the scale range (`stat_bin()`). 
2: Removed 943 rows containing non-finite outside the scale range (`stat_bin()`). 

The marginal effects for statusquo cluster around a minimum just above zero. This is a consequence from most of the predicted values to be close to 0 or 1, which in turn is a consequence of the strong influence of status quo preference on voting. This explains why the AME of statusquo is so small relative to the logistic regression coefficient, while the AME of age is not.

How the marginal effects vary with the values of the predictor variables is demonstrated in the plot created my next lines of code:

#!opt: jupyter.plot.width=9, jupyter.plot.height=5
g1 <- subset(Chile,is.finite(ME_statusquo)) |>
    ggplot(aes(x=statusquo)) + 
    geom_line(aes(y=ME_statusquo)) +
    geom_rug() + ylab("Marginal effect")
g2 <- subset(Chile,is.finite(ME_age)) |>
    ggplot(aes(x=age)) + 
    geom_line(aes(y=ME_age)) +
    geom_rug() + ylab("Marginal effect") + ylim(0,0.0053)
plot_grid(g1,g2)
Last executed at 19 January 2026 at 15:36:04 GMT+1 in 509ms

The plot shows that the marginal effects of statusquo vary much stronger with the values of the predictor variable than the marginal effects of age.

Summary

While giving only anecdotal evidence, the applied example points to some complications of logistic regressions, marginal effects, and average marginal effects.

Of course, these statements are not very precise: One may ask what a “very weak” or “very strong” influence of a predictor variable is. A further limitation is that the example discussed so far only concerns the strength of the influence of a predictor variable, it does not examine how the distribution of an independent variable may affect the AME. These issues will be addressed in later blog posts.

Another limitation of this empirical example is that it does not address the strength of AMEs and linear regression coefficients in comparison to logistic regression coefficients, asserted by Mood (2010): That they vary little or not at all between nested models when the predictor variables are uncorrelated. The performance of logistic regression coefficents, AMEs, and linear regression coefficients will be examined in the next blog post in this series.