Almost Surely Degenerate: Clifford Anderson-Bergman's Statistics Page
  • Home
  • Code
  • Papers
  • Almost Surely Degenerate Ramblings

Exploring the undefined bias of logistic regression

12/28/2016

0 Comments

 
Recently, on cross-validated, I used the example of logistic regression coefficients to demonstrate biased maximum likelihood estimates. In fact, the bias of these estimators is undefined: under the logistic regression model, there is a strictly positive (although extremely small) probability of perfect separation of the data by a hyper plane in the covariate space, leading to infinite estimates in the regression parameters. For illustration, here's an example in a single covariate. 
Picture
Because the regression coefficients have a positive probability of being either positive or negative infinity, the expected value for the regression parameters is undefined for any finite sample size!

Does that mean logisitic regression is broken? Of course not. Maximum likelihood theory tells us that the estimates converge in distribution, not that the mean converges. Side note: I think this would make a great example for a first year graduate course in probability. 

A commentor asked whether logistic regression is biased if we ignore the case of perfect separation. I decided to examine this using simulation. Another question I had is whether the estimated probabilities were biased. Using Rmarkdown, I walked through the exploration. Here goes! 

​We start with some utility functions. 
Picture
Question 1: Ignoring perfect separation, is there bias in the estimated logistic regression parameters?

To examine the small sample bias, we will simulate a large number of data sets, compute the logistic regression fit and extract the coefficients and estimated probabilities. 

The probability of getting perfect separation in these simulations is so small that even in this large simulation (10,000 sample fits), we still don't observe a single perfect separation. So this esstentially ignores the issue of perfect separation. 
Picture
Picture
​It may be a little hard to see from the plot, but the estimator is right skewed! It’s also worth noting that
perfect separation was never observed. We can examine this a little more formally by examining the mean,
median and corresponding standard errors.
Picture
We see a distinct upward bias in both the mean and median. What about the fitted probabilities?

Question 2: Are the fitted probabilities biased? 
Picture
Picture
There appears to be no conclusive bias in the estimated probabilities! If anything, there is a bias towards the middle on the tails. This is kind of interesting: remember that there was an upward bias on the regression parameters. If your regression parameter is over estimated, then your probabilities will be further away from the middle. Yet despite having an upward bias on the regression parameter, we have mild evidence toward the middle on the tails! Why? Jensen's inequality! Over estimating the regression parameters pulls the probabilities up, but not as much as under estimating the regression parameters pulls the estimated probabilities down. 
0 Comments



Leave a Reply.

    Author

    Clifford Anderson-Bergman
    Statistician who loves high dimensional constrained optimization and programming in C++

    Archives

    December 2016
    November 2016
    May 2016
    March 2016

    Categories

    All

    RSS Feed

    Blogroll

    r-bloggers
Powered by Create your own unique website with customizable templates.