What should the first step be in regression analysis?

Stepwise regression is the step-by-step iterative construction of a regression model that involves the selection of independent variables to be used in a final model. It involves adding or removing potential explanatory variables in succession and testing for statistical significance after each iteration.

The availability of statistical software packages makes stepwise regression possible, even in models with hundreds of variables.

  • Stepwise regression is a method that iteratively examines the statistical significance of each independent variable in a linear regression model.
  • The forward selection approach starts with nothing and adds each new variable incrementally, testing for statistical significance.
  • The backward elimination method begins with a full model loaded with several variables and then removes one variable to test its importance relative to overall results.
  • Stepwise regression has its downsides, however, as it is an approach that fits data into a model to achieve the desired result.

The underlying goal of stepwise regression is, through a series of tests (e.g. F-tests, t-tests) to find a set of independent variables that significantly influence the dependent variable. This is done with computers through iteration, which is the process of arriving at results or decisions by going through repeated rounds or cycles of analysis. Conducting tests automatically with help from statistical software packages has the advantage of saving time and limiting mistakes.

Stepwise regression can be achieved either by trying out one independent variable at a time and including it in the regression model if it is statistically significant or by including all potential independent variables in the model and eliminating those that are not statistically significant. Some use a combination of both methods and therefore there are three approaches to stepwise regression:

  1. Forward selection begins with no variables in the model, tests each variable as it is added to the model, then keeps those that are deemed most statistically significant—repeating the process until the results are optimal.
  2. Backward elimination starts with a set of independent variables, deleting one at a time, then testing to see if the removed variable is statistically significant.
  3. Bidirectional elimination is a combination of the first two methods that test which variables should be included or excluded.

An example of a stepwise regression using the backward elimination method would be an attempt to understand energy usage at a factory using variables such as equipment run time, equipment age, staff size, temperatures outside, and time of year. The model includes all of the variables—then each is removed, one at a time, to determine which is least statistically significant. In the end, the model might show that time of year and temperatures are most significant, possibly suggesting the peak energy consumption at the factory is when air conditioner usage is at its highest. 

Regression analysis, both linear and multivariate, is widely used in the economics and investment world today. The idea is often to find patterns that existed in the past that might also recur in the future. A simple linear regression, for example, might look at the price-to-earnings ratios and stock returns over many years to determine if stocks with low P/E ratios (independent variable) offer higher returns (dependent variable). The problem with this approach is that market conditions often change and relationships that have held in the past do not necessarily hold true in the present or future.

Meanwhile, the stepwise regression process has many critics and there are even calls to stop using the method altogether. Statisticians note several drawbacks to the approach, including incorrect results, an inherent bias in the process itself, and the necessity for significant computing power to develop complex regression models through iteration.

No matter what statistical model you’re running, you need to go through the same steps.  The order and the specifics of how you do each step will differ depending on the data and the type of model you use.

These steps are in 4 phases.  Most people think of only the third as modeling.  But the phases before this one are fundamental to making the modeling go well. It will be much, much easier, more accurate, and more efficient if you don’t skip them.

And there is no point in running the model if you skip phase 4.

If you think of them all as part of the analysis, the modeling process will be faster, easier, and make more sense.

Phase 1: Define and Design

In the first 5 steps, the object is clarity. You want to make everything as clear as possible to yourself. The more clear things are at this point, the smoother everything will be.

1. Write out research questions in theoretical and operational terms

A lot of times, when researchers are confused about the right statistical method to use, the real problem is they haven’t defined their research questions.  They have a general idea of the relationship they want to test, but it’s a bit vague.  You need to be very specific.

For each research question, write it down in both theoretical and operational terms.

2. Design the study or define the design

Depending on whether you are collecting your own data or doing secondary data analysis, you need a clear idea of the design.  Design issues are about randomization and sampling. Some examples:

•    Nested and Crossed Factors
•    Potential confounders and control variables •    Longitudinal or repeated measurements on a study unit

•    Sampling: simple random sample or stratification or clustering

3. Choose the variables for answering research questions and determine their level of measurement

Every model has to take into account both the design and the level of measurement of the variables.

Level of measurement, remember, is whether a variable is nominal, ordinal, or numerical.  But there’s nuance here for choosing an analysis. For example, you also need to know if variables are discrete counts, continuous, proportions, time to event, etc.

It’s absolutely vital that you know the level of measurement of each response and predictor variable, because they determine both the type of information you can get from your model and the family of models that is appropriate.

4. Write an analysis plan

Write your best guess for the statistical method that will answer the research question, taking into account the design and the type of data.

It does not have to be final at this point—it just needs to be a reasonable approximation.

5. Calculate sample size estimations

This is the point at which you should calculate your sample sizes—before you collect data and after you have an analysis plan.  You need to know which statistical tests you will use as a basis for the estimates.

And there really is no point in running post-hoc power analyses—it doesn’t tell you anything.

Phase 2: Prepare and explore

6. Collect, code, enter, and clean data 

The parts that are most directly applicable to modeling are entering data and creating new variables.

For data entry, the analysis plan you wrote will determine how to set up the data set. For example, if you will be doing a linear mixed model, you will want the data in long format.

7. Create new variables

This step may take longer than you think–it can be quite time consuming.  It’s pretty rare for every variable you’ll need for analysis to be collected in exactly the right form.  Create indices, categorize, reverse code, whatever you need to do to get variables in their final form, including running principal components or factor analysis.

8. Run Univariate and Bivariate Descriptives

You need to know what you’re working with.  Check the distributions of the variables you intend to use, as well as bivariate relationships among all variables that might go into the model.

You may find something here that leads you back to step 7 or even step 4.   You might have to do some data manipulation or deal with missing data.

More commonly, it will alert you to issues that will become clear in later steps.  The earlier you are aware of issues, the better you can deal with them.  But even if you don’t discover the issue until later, it won’t throw you for a loop if you have a good understanding of your variables.

9. Run an initial model

Once you know what you’re working with, run the model listed in your analysis plan.  In all likelihood, this will not be the final model.

But it should be in the right family of models for the types of variables, the design, and to answer the research questions.  You need to have this model to have something to explore and refine.

Phase 3: Refine the model

10. Refine predictors and check model fit

If you are doing a truly exploratory analysis, or if the point of the model is pure prediction, you can use some sort of stepwise approach to determine the best predictors.

If the analysis is to test hypotheses or answer theoretical research questions, this part will be more about refinement.  You can

• Test, and possibly drop, interactions and quadratic or explore other types of non-linearity • Drop nonsignificant control variables

• Do hierarchical modeling to see the effects of predictors added alone or in blocks.


• Test the best specification of random effects

11. Test assumptions

Because you already investigated the right family of models in Part 1,  thoroughly investigated your variables in Step 8, and correctly specified your model in Step 10, you should not have big surprises here.  Rather, this step will be about confirming, checking, and refining.  But what you learn here can send you back to any of those steps for further refinement.

12. Check for and resolve data issues

Steps 11 and 12 are often done together, or perhaps back and forth.  This is where you check for data issues that can affect the model, but are not exactly assumptions.

Data issues are about the data, not the model, but occur within the context of the model. These include:

Once again, data issues don’t appear until you have chosen variables and put them in the model.

Phase 4: Answer the Research Question

13. Interpret Results

Now, finally, interpret the results.

You may not notice data issues or misspecified predictors until you interpret the coefficients.  Then you find something like a super high standard error or a coefficient with a sign opposite what you expected, sending you back to previous steps.

And now that you understand what you found, you can share it.

14. Write up Results

This might the hardest and most important step of all.

It includes creating graphs and tables that are ready for your reader (not just the ones you’ve created early on to help you understand what’s going on in the data).

It also includes the write-up of results, whether that’s for a journal article, thesis, or report for management. Or a conference paper or poster.

This step can take weeks, even if there is nothing that comes up during your write up that makes you go back and refine something earlier in the analysis—for example, realizing that you need more descriptive stats to complete a table.

What should the first step be in regression analysis?

The Pathway: Steps for Staying Out of the Weeds in Any Data Analysis

Get the road map for your data analysis before you begin. Learn how to make any statistical modeling – ANOVA, Linear Regression, Poisson Regression, Multilevel Model – straightforward and more efficient.

Reader Interactions

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.