Forward Selection: Achieving Accurate Insights

What is Forward Selection?

In the dynamic world of data analysis and machine learning, selecting the right variables for a model is a critical step that impacts the model’s performance, simplicity, and computational resources required. Modern datasets may include dozens or even hundreds of variables, but not all contribute to predictive accuracy. Forward Selection is an advanced method for variable selection that allows building a simple and efficient model by focusing on the most significant variables. In this article, we’ll review the method, understand its advantages and disadvantages, and focus on practical implementation in the regression process.

Stepwise Regression

Stepwise Regression is a special method of hierarchical regression where statistical algorithms determine which explanatory variables (predictors) to include in the model. This method has three basic variations: Forward Selection, Backward Elimination, and Bidirectional Elimination

The simplest approach to building a data-based model is called Forward Selection. The model starts without explanatory variables and adds variables gradually based on their statistical significance until a stopping criterion is reached. In this method, one variable is added at a time, and variables not yet in the model are evaluated for inclusion.The most significant variable (with the lowest p-value) is added to the model, provided its p-value is below a predetermined threshold. This threshold is often set above 0.05, such as 0.10 or 0.15, due to the exploratory nature of the method.

However, this method has drawbacks. Adding a new variable to the model may render previously selected variables less significant. Therefore, in some cases, it’s more appropriate to use the Backward Elimination method, which starts with a model including all possible variables and gradually removes insignificant variables based on a predetermined stopping criterion. In this process, the least significant variable is removed at each step, as long as its p-value exceeds the critical level set.

Another method is Stepwise Regression, which combines both approaches. In this method, the model allows both addition and removal of variables at different stages. This process includes selecting the least significant variable for removal and re-evaluating previously removed variables to check if they should be reintroduced into the model. Two separate significance levels must be set for this process: a higher significance level for removal and a lower significance level for addition.

In this article, I will focus on the Forward Selection approach.

How is it calculated?

As previously mentioned, Forward Selection is a method for building a regression model gradually by adding variables based on their statistical contribution. In this process, each step focuses on evaluating variables and selecting the most significant one to improve the model. Here are the main steps to perform the process, with an emphasis on linear regression:

  1. Define the initial model:
    Start with a basic model without explanatory variables.
  2. Evaluate potential variables:
    For each variable not in the current model (all if our model is empty or doesn’t exist yet), re-evaluate the model including this variable.
    Calculate a statistical criterion for the model; this criterion can be the p-value or R².
  3. Select the most significant variable:
    Now, choose the variable that brings the greatest improvement to the model, for example:
    Select the variable that leads to the most significant improvement in the R² value or the variable with a p-value below 0.05.
  4. Repeat the process:
    Now, repeat steps 2 and 3, adding one variable at a time.
    Continue until no additional variable significantly improves the model.
  5. Stop:
    Stop the repetition when: there are no more variables whose statistical significance is sufficient for inclusion in the model.
    Alternatively, in cases where the improvement in the selected metric becomes negligible.

Time to Code

I chose to write a program that performs the Forward Selection process to identify variables affecting linear regression.The program will prompt the user to enter the path and name of a CSV file, read the data from the file, and identify the most significant variables for the model based on a significance level of 0.05 (p-value).
Finally, the program will print the indices of the selected variables.

Assumptions

  • The number of columns in the file is not known in advance.
  • All values in the columns are continuous numerical values (no categorical columns).
  • The last column in the file is the dependent variable (response variable), and all other columns are independent variables (predictors).

Requirements

  • Implement the Forward Selection algorithm to gradually add variables to the regression model.
  • The decision to include a variable in the model will be based on a p-value of 0.05: a variable will be included in the model if its p-value is below 0.05.
  • A variable with a p-value above 0.05 will not be included in the model.
  • After completing the process, the program will print the indices of the selected columns.

Implementation

First, import the necessary libraries:

import pandas as pd
import statsmodels.api as sm

Next, import the data file and, if everything is correct, print the first five rows of the file:

input_path = input("Please specify file and path for CSV file: \n")
try:
    dataset = pd.read_csv(input_path)
    print(dataset.head(5))
except Exception as e:
    print(f"Error while reading the file: {e}")

Now, define our axes:

target = dataset.columns[-1]
X = dataset.iloc[:, :-1]
y = dataset[target]

Please note that the dependent variable (y) is, in fact, the last one.
All the features (predictors) are defined from index 0 (the first variable) up to the one before the last.

Now we’ll perform the elimination process.
To do so, we’ll create a while loop that iterates over all the columns and filters only the variables with a p-value lower than 0.05.

if not isinstance(X, pd.DataFrame):
    X = pd.DataFrame(X, columns=dataset.columns[:-1])
initial_features = []
remaining_features = list(X.columns)
selected_features = []
significance_level = 0.05

while remaining_features:
    p_values = []
    for feature in remaining_features:
        current_features = initial_features + [feature]
        X_current = sm.add_constant(X[current_features])
        model = sm.OLS(y, X_current).fit()
        p_values.append((feature, model.pvalues[feature]))

    p_values = sorted(p_values, key=lambda x: x[1])
    best_feature, best_p_value = p_values[0]

    if best_p_value < significance_level:
        initial_features.append(best_feature)
        remaining_features.remove(best_feature)
        selected_features.append(best_feature)
    else:
        break

selected_feature_indices = [X.columns.get_loc(col) for col in selected_features]
print("After forward selection the most important independent variables are in index:", selected_feature_indices)

Example of output:

After forward selection the most important independent variables are in index: [0, 2]

In summary

In this article, we explored the importance of selecting variables using Forward Selection in the model-building process.
We discussed the advantages of this method, such as simplifying the model, improving prediction accuracy, and saving computational resources.

Additionally, we presented the theoretical framework behind the method, and concluded with a technical demonstration that illustrates the selection process step by step.

This approach is particularly well-suited for data analysis and machine learning projects, offering valuable tools for better understanding the critical variables that influence model outcomes.

Leave a comment

I’m Dolev

I’m a 33-year-old dad of three and a Master’s student in Information Science.
After years in business development, I pivoted to data science—my true passion.
Now I’m a Senior Data Scientist at a mobile gaming startup.
This site is where I share my journey, projects, and insights.

Recent comments

No comments to show.