Linear Regression

Linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables).

The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.

Formulation

Given a dataset

where T denotes the transpose, so that xiTβ is the inner product between vectors xi and β.

Often these n equations are stacked together and written in matrix notation as

where

Notation and terminology

    • Sometimes one of the regressors can be a non-linear function of another regressor or of the data values, as in polynomial regression and segmented regression. The model remains linear as long as it is linear in the parameter vector β.

    • The values xij may be viewed as either observed values of random variables Xj or as fixed values chosen prior to observing the dependent variable. Both interpretations may be appropriate in different cases, and they generally lead to the same estimation procedures; however different approaches to asymptotic analysis are used in these two situations.

Applications:

Linear regression is widely used in various fields for prediction, forecasting, and understanding the relationships between variables. Some common applications include:

  • Predicting house prices based on features such as size, number of bedrooms, and location.

  • Forecasting sales based on advertising spending, economic indicators, etc.

  • Analyzing the impact of independent variables on a dependent variable in scientific research.

Advantages and Disadvantages:

  • Advantages:

    • Simple and easy to understand.

    • Provides interpretable coefficients for each independent variable.

    • Can be applied to both numerical and categorical independent variables.

  • Disadvantages:

    • Assumes a linear relationship between variables, which may not always be the case.

    • Sensitive to outliers and multicollinearity.

    • Limited to linear relationships and may not capture complex patterns in the data.

Implementation:

In Python, linear regression can be implemented using libraries such as scikit-learn or StatsModels. Here's a basic example using scikit-learn:

from sklearn.linear_model import LinearRegression

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test data
predictions = model.predict(X_test)

Last updated