Logistic Regression

Logistic regression is part of a category of
statistical models called generalized linear models. This broad class of models includes
ordinary regression and ANOVA, as well as multivariate statistics such as ANCOVA and
loglinear regression. An excellent treatment of generalized linear models is presented in
Agresti (1996).

Logistic regression allows one to
predict a discrete outcome, such as group membership, from a set of variables that may be
continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent or
response variable is dichotomous, such as presence/absence or success/failure.
Discriminant analysis is also used to predict group membership with only two groups.
However, discriminant analysis can only be used with continuous independent variables.
Thus, in instances where the independent variables are a categorical, or a mix of
continuous and categorical, logistic regression is preferred.

The Model:

The dependent variable in logistic
regression is usually dichotomous, that is, the dependent variable can take the value 1
with a probability of success q, or the value 0 with probability of failure 1-q. This type of variable is called a Bernoulli (or binary)
variable. Although not as common and not discussed in this treatment, applications of
logistic regression have also been extended to cases where the dependent variable is of
more than two cases, known as multinomial or polytomous [Tabachnick and Fidell (1996) use
the term polychotomous].

As mentioned previously, the
independent or predictor variables in logistic regression can take any form. That is,
logistic regression makes no assumption about the distribution of the independent
variables. They do not have to be normally distributed, linearly related or of equal
variance within each group.The relationship between the predictor and response variables
is not a linear function in logistic regression, instead, the logistic regression function
is used, which is the logit transformation of q:

Where a = the constant of the equation and, b = the coefficient of the predictor variables.

An
alternative form of the logistic regression equation is:

The goal of logistic regression is to
correctly predict the category of outcome for individual cases using the most parsimonious
model. To accomplish this goal, a model is created that includes all predictor variables
that are useful in predicting the response variable. Several different options are
available during model creation. Variables can be entered into the model in the order
specified by the researcher or logistic regression can test the fit of the model after
each coefficient is added or deleted, called stepwise regression.

Stepwise regression is used in the
exploratory phase of research but it is not recommended for theory testing (Menard 1995).
Theory testing is the testing of a-priori theories or hypotheses of the relationships
between variables. Exploratory testing makes no a-priori assumptions regarding the
relationships between the variables, thus the goal is to discover relationships.

Backward stepwise regression appears
to be the preferred method of exploratory analyses, where the analysis begins with a full
or saturated model and variables are eliminated from the model in an iterative process.
The fit of the model is tested after the elimination of each variable to ensure that the
model still adequately fits the data.When no more variables can be eliminated from the
model, the analysis has been completed.

There are two main uses of logistic
regression. The first is the prediction of group membership. Since logistic regression
calculates the probability or success over the probability of failure, the results of the
analysis are in the form of an odds ratio. For example, logistic regression is often used
in epidemiological studies where the result of the analysis is the probability of
developing cancer after controlling for other associated risks. Logistic regression also
provides knowledge of the relationships and strengths among the variables (e.g., smoking
10 packs a day puts you at a higher risk for developing cancer than working in an asbestos
mine).

The process by which coefficients are
tested for significance for inclusion or elimination from the model involves several
different techniques. Each of these will be discussed below.

Wald Test:

A Wald test is used to test the statistical
significance of each coefficient (b) in the model. A Wald test calculates a Z statistic,
which is:

This z value is then squared,
yielding a Wald statistic with a chi-square distribution. However, several authors have
identified problems with the use of the Wald statistic. Menard (1995) warns that for large
coefficients, standard error is inflated, lowering the Wald statistic (chi-square) value.
Agresti (1996) states that the likelihood-ratio test is more reliable for small sample
sizes than the Wald test.

Likelihood-Ratio Test:

The likelihood-ratio test uses the ratio of the
maximized value of the likelihood function for the full model (L₁) over the
maximized value of the likelihood function for the simpler model (L₀). The
likelihood-ratio test statistic equals:

This log
transformation of the likelihood functions yields a chi-squared statistic. This is the
recommended test statistic to use when building a model through backward stepwise
elimination.

Hosmer-Lemshow Goodness of Fit
Test:

The Hosmer-Lemshow statistic
evaluates the goodness-of-fit by creating 10 ordered groups of subjects and then compares
the number actually in the each group (observed) to the number predicted by the logistic
regression model (predicted). Thus, the test statistic is a chi-square statistic with a
desirable outcome of non-significance, indicating that the model prediction does not
significantly differ from the observed.

The 10 ordered groups are created
based on their estimated probability; those with estimated probability below 0.1 form one
group, and so on, up to those with probability 0.9 to 1.0. Each of these categories is
further divided into two groups based on the actual observed outcome variable (success,
failure). The expected frequencies for each of the cells are obtained from the model.If
the model is good, then most of the subjects with success are classified in the higher
deciles of risk and those with failure in the lower deciles of risk.

References:

Agresti, Alan. 1996. An Introduction
to Categorical Data Analysis. John Wiley and Sons, Inc.

Hosmer, David and Stanley
Lemeshow.1989. Applied Logistic Regression. John Wiley and Sons, Inc.

Menard, Scott.1995. Applied Logistic
Regression Analysis. Sage Publications.Series: Quantitative Applications in the Social
Sciences, No. 106.

Tabachnick , Barbara and Linda
Fidell.1996. Using Multivariate Statistics, Third edition. Harper Collins.

Applications in Ecological
Literature:

Trexler, J.C., and J. Travis.1993.
Nontraditional Regression Analyses.Ecology 74:1629-1637.

Connor, E.F., Adams-Manson, R.H.,
Carr, T.G., and M.W. Beck.1994. The effects of host plant phenology on the demography and
population dynamics of the leaf-mining moth, Cameraria hamadryadella (Lepidoptera:
Gracillariidae). Ecological Entomology 19:111-120.

Useful Websites:

Alan Agresti’s website with all
the data from the worked examples in his book:

http://lib.stat.cmu.edu/datasets/agresti

Good notes on logistic regression and
interpreting the SPSS output:

http://www2.chass.ncsu.edu/garson/pa765/logistic.htm

0 Comments:

Popular Posts

IR、ML、NLP

Total Pageviews