Logistic regression is part of a category of

statistical models called generalized linear models. This broad class of models includes

ordinary regression and ANOVA, as well as multivariate statistics such as ANCOVA and

loglinear regression. An excellent treatment of generalized linear models is presented in

Agresti (1996).

Logistic regression allows one to

predict a discrete outcome, such as group membership, from a set of variables that may be

continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent or

response variable is dichotomous, such as presence/absence or success/failure.

Discriminant analysis is also used to predict group membership with only two groups.

However, discriminant analysis can only be used with continuous independent variables.

Thus, in instances where the independent variables are a categorical, or a mix of

continuous and categorical, logistic regression is preferred.

**The Model:**

The dependent variable in logistic

regression is usually dichotomous, that is, the dependent variable can take the value 1

with a probability of success q, or the value 0 with probability of failure 1-q. This type of variable is called a Bernoulli (or binary)

variable. Although not as common and not discussed in this treatment, applications of

logistic regression have also been extended to cases where the dependent variable is of

more than two cases, known as multinomial or polytomous [Tabachnick and Fidell (1996) use

the term polychotomous].

As mentioned previously, the

independent or predictor variables in logistic regression can take any form. That is,

logistic regression makes no assumption about the distribution of the independent

variables. They do not have to be normally distributed, linearly related or of equal

variance within each group.The relationship between the predictor and response variables

is not a linear function in logistic regression, instead, the logistic regression function

is used, which is the logit transformation of q:

Where a = the constant of the equation and, b = the coefficient of the predictor variables.

An

alternative form of the logistic regression equation is:

The goal of logistic regression is to

correctly predict the category of outcome for individual cases using the most parsimonious

model. To accomplish this goal, a model is created that includes all predictor variables

that are useful in predicting the response variable. Several different options are

available during model creation. Variables can be entered into the model in the order

specified by the researcher or logistic regression can test the fit of the model after

each coefficient is added or deleted, called stepwise regression.

Stepwise regression is used in the

exploratory phase of research but it is not recommended for theory testing (Menard 1995).

Theory testing is the testing of a-priori theories or hypotheses of the relationships

between variables. Exploratory testing makes no a-priori assumptions regarding the

relationships between the variables, thus the goal is to discover relationships.

Backward stepwise regression appears

to be the preferred method of exploratory analyses, where the analysis begins with a full

or saturated model and variables are eliminated from the model in an iterative process.

The fit of the model is tested after the elimination of each variable to ensure that the

model still adequately fits the data.When no more variables can be eliminated from the

model, the analysis has been completed.

There are two main uses of logistic

regression. The first is the prediction of group membership. Since logistic regression

calculates the probability or success over the probability of failure, the results of the

analysis are in the form of an odds ratio. For example, logistic regression is often used

in epidemiological studies where the result of the analysis is the probability of

developing cancer after controlling for other associated risks. Logistic regression also

provides knowledge of the relationships and strengths among the variables (e.g., smoking

10 packs a day puts you at a higher risk for developing cancer than working in an asbestos

mine).

The process by which coefficients are

tested for significance for inclusion or elimination from the model involves several

different techniques. Each of these will be discussed below.

**Wald Test:**

A Wald test is used to test the statistical

significance of each coefficient (b) in the model. A Wald test calculates a *Z* statistic,

which is:

This z value is then squared,

yielding a Wald statistic with a chi-square distribution. However, several authors have

identified problems with the use of the Wald statistic. Menard (1995) warns that for large

coefficients, standard error is inflated, lowering the Wald statistic (chi-square) value.

Agresti (1996) states that the likelihood-ratio test is more reliable for small sample

sizes than the Wald test.

**Likelihood-Ratio Test:**

The likelihood-ratio test uses the ratio of the

maximized value of the likelihood function for the full model (L_{1}) over the

maximized value of the likelihood function for the simpler model (L_{0}). The

likelihood-ratio test statistic equals:

This log

transformation of the likelihood functions yields a chi-squared statistic. This is the

recommended test statistic to use when building a model through backward stepwise

elimination.

**Hosmer-Lemshow Goodness of FitTest:**

The Hosmer-Lemshow statistic

evaluates the goodness-of-fit by creating 10 ordered groups of subjects and then compares

the number actually in the each group (observed) to the number predicted by the logistic

regression model (predicted). Thus, the test statistic is a chi-square statistic with a

desirable outcome of non-significance, indicating that the model prediction does not

significantly differ from the observed.

The 10 ordered groups are created

based on their estimated probability; those with estimated probability below 0.1 form one

group, and so on, up to those with probability 0.9 to 1.0. Each of these categories is

further divided into two groups based on the actual observed outcome variable (success,

failure). The expected frequencies for each of the cells are obtained from the model.If

the model is good, then most of the subjects with success are classified in the higher

deciles of risk and those with failure in the lower deciles of risk.

**References:**

Agresti, Alan. 1996. An Introduction

to Categorical Data Analysis. John Wiley and Sons, Inc.

Hosmer, David and Stanley

Lemeshow.1989. Applied Logistic Regression. John Wiley and Sons, Inc.

Menard, Scott.1995. Applied Logistic

Regression Analysis. Sage Publications.Series: Quantitative Applications in the Social

Sciences, No. 106.

Tabachnick , Barbara and Linda

Fidell.1996. Using Multivariate Statistics, Third edition. Harper Collins.

**Applications in EcologicalLiterature:**

Trexler, J.C., and J. Travis.1993.

Nontraditional Regression Analyses.Ecology 74:1629-1637.

Connor, E.F., Adams-Manson, R.H.,

Carr, T.G., and M.W. Beck.1994. The effects of host plant phenology on the demography and

population dynamics of the leaf-mining moth, *Cameraria hamadryadella* (Lepidoptera:

Gracillariidae). Ecological Entomology 19:111-120.

**Useful Websites:**

Alan Agresti’s website with all

the data from the worked examples in his book:

http://lib.stat.cmu.edu/datasets/agresti

Good notes on logistic regression and

interpreting the SPSS output:

http://www2.chass.ncsu.edu/garson/pa765/logistic.htm

## 0 Comments:

Post a Comment