Regression in Python: how to find relationships in your Data

Situatie

Solutie

The simplest form of regression in Python is, well, simple linear regression. With simple linear regression, you’re trying to see if there’s a relationship between two variables, with the first known as the “independent variable” and the latter the “dependent variable.” The independent variable is typically plotted on the x-axis and the dependent variable on the y-axis. This creates the classic scatterplot of data points. The objective is to plot a line that best fits this scatterplot.

We’ll start by using the example data of tips in New York City restaurants. We want to see if there’s a relationship between the total bill and the tip. This dataset is included in the Seaborn statistical plotting package, my favorite data visualization tool. I’ve set these tools up in a Mamba environment for easy access.

First, we’ll import Seaborn, a statistical plotting database.

import seaborn as sns

Then we’ll load in the dataset:

tips = sns.load_dataset('tips')

If you’re using a Jupyter notebook, like the one I’m linking to on my own GitHub page, be sure to include this line to have images display in the notebook instead of an external window:

%matplotlib inline

Now we’ll look at the scatterplot with the relplot method:

sns.relplot(x='total_bill',y='tip',data=tips)

Scatter plot of tips vs. total bill in Python using Seaborn.

The scatterplot appears to be mostly linear. This means that there might be a positive linear relationship between the amount on the bill and the tip. We can plot the regression line with the regplot method:

sns.regplot(x='total_bill',y='tip',data=tips)

Regression line of tips vs. total bill fitted over the scatterplot using Seaborn.

The line does seem to fit pretty well.

We can use another library, Pingouin, for more formal analysis. Pingouin’s linear_regression method will calculate the coefficients for the regression equation of the line to fit over the datapoints and determine the fit.

import pingouin as pg
pg.linear_regression(tips['total_bill'],tips['tip']).round(2)

Python simple linear regression of tips vs. restaurant bill using Pingouin result in Jupyter notebook.

The rounding will make the results easier to read. The number to pay attention to in linear regression is the correlation coefficient. In the resulting table, it’s listed as “r2” since it’s the square of the correlation coefficient. It’s .46, which indicates a good fit. Taking the square root reveals that it’s around .68, which is pretty close to 1, indicating a positive linear relationship more formally than the plot we saw earlier.

We can also build a model with the values in the table. You might remember the equation of a line: y = mx + b. The y would be the dependent variable, the m is the coefficient of x, or the total bill, which is .11. This determines how steep the line is. The b is the y-intercept, or 0.92.

The resulting equation based on this model is

tip = 0.11(total bill) + 0.92

The equations for regression flip this around, so it would be:

tip = 0.92 + 0.11(total_bill)

We can write a short Python function that predicts the tip based on the amount of the bill.

def tip(total_bill):
    return 0.92 + 0.11 * total_bill

Let’s predict the tip from a $100 restaurant bill:

tip(100)

The expected tip is around $12.

Multiple Linear Regression: Taking Regression into the Third Dimension, And Beyond

Linear regression can be extended to more variables than just two. You can look at more independent variables. Instead of fitting a line over data points on a plane, you’re fitting a plane over a scatterplot. Unfortunately, this is harder to visualize than with a 2D regression. I used multiple regression to build a model of laptop prices based on their specs.

We’ll use the tips dataset. This time, we’ll look at the size of the party with the “size” column. It’s easy to do in Pingouin.

pg.linear_regression(tips[['total_bill','size']],tips['tip']).round(2)

Note the double brackets on the first argument specifying the total bill and the size of the party. Notice that the r² is identical. This again means that there’s a good fit and that the total bill and the table size are good predictors of the tip.

We can rewrite our earlier model to account for size, using the coefficient of the table size:

def tip(total_bill,size): return 0.67 + 0.09 * total_bill + 0.19 * size

Nonlinear Regression: Fitting Curves

Not only can you fit linear regression, but you can also fit nonlinear curves. I’ll demonstrate this using NumPy to generate some data points that can represent a quadratic plot.

First, I’ll generate a large array of datapoints in NumPy for the x-axis:

x = np.linspace(-100,100,1000)

Now I’ll create a quadratic plot for the y-axis.

y = 4*x**2 + 2*x + 3

To create the regression, I’ll build a Pandas DataFrame, a data structure similar to a relational database, for the x and y axes. This will create columns for the x and y values, with the names “x” and “y”. We pass a dictionary for the DataFrame we want to create to Pandas. We’ll call out DataFrame “df.”

import pandas as pd
df = pd.DataFrame({'x':x,'y':y})

We can examine our DataFrame with the head method:

df.head()

Output of x and y values in the Python Pandas DataFrame.

We can create a scatterplot with Seaborn as we did earlier with the linear data:

sns.relplot(x='x',y='y',data=df)

Python Seaborn plot of a scatterplot suggesting a parabolic curve.

It looks like the classic parabolic plot you might remember from math class, plotting on your graphing calculator (which Python can replace). Let’s see if we can fit a parabola over it. Seaborn’s regplot method has the order option, which specifies the degree of the polynomial line to fit. Since we’re trying to plot a quadratic line, we’ll set the order to 2:

sns.regplot(x='x',y='y',order=2,data=df)

It does indeed seem to fit a classic quadratic parabola.

To get the format regression, we can use the nonlinear regression technique with Pingouin. We’ll just add another column to our DataFrame to square the x values:

df['x2'] = df['x']**2

We can then use linear regression to fit the quadratic curve:

pg.linear_regression(df[['x','x2']],df['y']).round(2)

Quadratic regression in Python with the Pingouin library.

Because this was an artificial creation, the r² is 1, indicating a very good fit, one that you probably wouldn’t see on real-life data.

We can also build a predictive model using a function:

def quad(x):
    return 3 + 2*x + 4*x**2

You can also extend this method to polynomials with degrees higher than 2.

Logistic Regression: Fitting Binary Categories

If you want to find a relationship for binary categories, such as a certain risk factor, like whether a person smokes or not, we can use logistic regression.

The easiest way to visualize this is once again using the Seaborn library. We’ll load the dataset of passengers on the Titanic. We want to see if ticket price was a predictor of who would survive the ill-fated journey or not.

titanic = sns.load_dataset('titanic')

We can examine the data the way we did with the quadratic DataFrame:

titanic.head()

Columns of Seaborn Titanic passenger data displayed in Jupyter notebook.

We’ll use the lmplot method, since it can fit a logistic curve:

sns.lmplot(x='fare',y='survived',logistic=True,data=titanic)

Logistic regression of passengers on the titanic fare vs survival.

We see the logistic curve over the number of passengers, separated by whether they survived or not. The “survived” column is already separated into 0 for “didn’t survive” and 1 for “survived.”

We can use Pingouin to determine formally if the fare price was a predictor of survival on the Titanic, using Pingouin, which, among its many statistical tests, offers logistic regression:

pg.logistic_regression(titanic['fare'],titanic['survived']).round(2)

Pingouin logistic regression in a Jupyter notebook.

Tip solutie

Permanent

Follow Us