Session 12: Simple Linear Regression and Forecasting

Simple Linear Regression

Overview

This module introduces one of the most important concepts in data analytics: understanding relationships and forecasting trends.

We will go from:

Intuition $\rightarrow$ how relationships work
Manual calculations $\rightarrow$ how models are built
Python implementation $\rightarrow$ how analysts actually use it
Business applications $\rightarrow$ how decisions are made
Time series forecasting $\rightarrow$ how to predict the future

Learning Objectives

After this session, you will be able to:

Understand what linear regression is conceptually
Manually compute regression coefficients
Interpret regression outputs in a business context
Build regression models in Python
Apply regression for trend analysis
Understand time series structure
Perform basic forecasting using ARIMA and Prophet

Intuition of Relationships

In analytics, we often ask: “If X changes $\rightarrow$ what happens to Y?”

Examples:

Marketing spend $\rightarrow$ Sales
Price $\rightarrow$ Demand
Discount $\rightarrow$ Conversion rate
Hours studied $\rightarrow$ Exam score

Visual Thinking

We will use the hours studied vs grade example from the picture you provided.

Hours Studied	Grade on Exam
2	69
9	98
5	82
5	77
3	71
7	84
1	55
8	94
6	84
2	64

import numpy as np
import matplotlib.pyplot as plt

hours = np.array([2, 9, 5, 5, 3, 7, 1, 8, 6, 2], dtype=float)
grades = np.array([69, 98, 82, 77, 71, 84, 55, 94, 84, 64], dtype=float)

plt.scatter(hours, grades)
plt.xlabel("Hours Studied")
plt.ylabel("Grade on Exam")
plt.title("Hours Studied vs Grade on Exam")
plt.show()

We observe:

As hours studied increase, exam grades also tend to increase
The relationship is positive
The data does not fall perfectly on one line, which is normal in real life
This makes the example ideal for introducing both intuition and model evaluation

Suggested use of the picture you shared:

Keep the screenshot as the raw real-world input at the start of this section, then place the clean table and the scatter plot right after it. It works especially well here because students can visually understand the pattern before seeing any formulas.

Linear Model

The prediction equation for a simple linear regression is:

\[ \hat{Y} = \beta_0 + \beta_1 X \]

A more complete statistical view is:

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

Where:

$Y$ is the actual observed outcome
$\hat{Y}$ is the predicted outcome
$X$ is the input variable
$\beta_0$ is the intercept
$\beta_1$ is the slope
$\varepsilon$ is the part of the outcome the model does not explain

In this lesson:

$X$ = hours studied
$Y$ = grade on exam

Interpretation

$\beta_1$ tells us how much $Y$ changes when $X$ increases by 1 unit
$\beta_0$ tells us the predicted value of $Y$ when $X = 0$

Example:

If $\beta_1 = 5$, then each extra hour studied is associated with about 5 extra grade points
If $\beta_0 = 55$, then the model predicts a grade of 55 for a student who studied 0 hours

Manual Linear Regression

Our goal Find the best line that fits the data.

Optimization Problem

We minimize:

\[ \sum (y_i - \hat{y}_i)^2 \]

This is called Least Squares.

To understand it:

$y_i$ is the actual value
$\hat{y}_i$ is the predicted value
$(y_i - \hat{y}_i)$ is the residual, or prediction error
Squaring makes all errors positive and penalizes larger errors more strongly

So the model is searching for the line that leaves the smallest total squared error.

Coefficient Formula

Slope:

\[ \beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

Intercept:

\[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]

Step-by-Step Example

We will use the same hours-studied dataset from the picture.

Step 1: Compute the means

\[ \bar{x} = 4.8 \qquad \bar{y} = 77.8 \]

Step 2: Compute the numerator and denominator for the slope

\[ \sum (x_i - \bar{x})(y_i - \bar{y}) = 320.6 \]

\[ \sum (x_i - \bar{x})^2 = 67.6 \]

Step 3: Compute the slope

\[ \beta_1 = \frac{320.6}{67.6} \approx 4.7426 \]

Step 4: Compute the intercept

\[ \beta_0 = 77.8 - (4.7426 \times 4.8) \approx 55.0355 \]

So the fitted line is:

\[ \hat{Y} = 55.04 + 4.74X \]

This means:

the baseline predicted grade is about 55.04
each extra study hour adds about 4.74 grade points

Python | Manual Calculation

x = hours
y = grades

x_mean = x.mean()
y_mean = y.mean()

beta1 = ((x - x_mean) * (y - y_mean)).sum() / ((x - x_mean) ** 2).sum()
beta0 = y_mean - beta1 * x_mean

print(f"Mean of X: {x_mean:.2f}")
print(f"Mean of Y: {y_mean:.2f}")
print(f"Intercept (β0): {beta0:.4f}")
print(f"Slope (β1): {beta1:.4f}")
print(f"Fitted line: y_hat = {beta0:.2f} + {beta1:.2f}x")

Mean of X: 4.80
Mean of Y: 77.80
Intercept (β0): 55.0355
Slope (β1): 4.7426
Fitted line: y_hat = 55.04 + 4.74x

Expected interpretation:

$\beta_0 \approx 55.04$
$\beta_1 \approx 4.74$

Predicting next step

Now that we have the fitted line, we can predict the grade for a new value of $X$.

For example, if a student studies for 4 hours:

new_hours = 4
predicted_grade = beta0 + beta1 * new_hours

print(f"Predicted grade for {new_hours} hours: {predicted_grade:.2f}")

Predicted grade for 4 hours: 74.01

Interpretation:

For a student who studied 4 hours, the model predicts a grade of about 74.01.

Regression in Python `scikit-learn`

Now let us build the same model using scikit-learn.

Installing Scikit Learn package

pip install -U scikit-learn

Important

Make sure to be in the right environment. In our case it is aca

Importing `LinearRegression`

from sklearn.linear_model import LinearRegression

X = hours.reshape(-1, 1)
y = grades

hours_model = LinearRegression()
hours_model.fit(X, y)

print(f"Intercept: {hours_model.intercept_:.4f}")
print(f"Slope: {hours_model.coef_[0]:.4f}")

Intercept: 55.0355
Slope: 4.7426

You should see values very close to the manual calculation above.

Visualization

x_line = np.linspace(hours.min(), hours.max(), 100).reshape(-1, 1)
y_line = hours_model.predict(x_line)

plt.scatter(hours, grades)
plt.plot(x_line, y_line)
plt.title("Linear Regression Fit")
plt.xlabel("Hours Studied")
plt.ylabel("Grade on Exam")
plt.show()

The scatter plot shows the observed data, and the line shows the model’s predicted trend.

Model Evaluation

Under the model evaluation section for the simple regression example.

In practice, we usually want to answer two questions:

How much variation does the model explain?
How large are the prediction errors?

For that reason, we will compute both:

$R^2$
RMSE

Manual Calculation of $R^2$ and RMSE

First, compute the predictions and residuals.

y_pred = hours_model.predict(X)
residuals = y - y_pred

ss_res = np.sum(residuals ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)

r2 = 1 - (ss_res / ss_tot)
rmse = np.sqrt(np.mean(residuals ** 2))

print(f"SS_res: {ss_res:.4f}")
print(f"SS_tot: {ss_tot:.4f}")
print(f"R-squared: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")

SS_res: 79.1213
SS_tot: 1599.6000
R-squared: 0.9505
RMSE: 2.8129

Manual formula for $R^2$:

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \]

Where:

$SS_{res} = \sum (y_i - \hat{y}_i)^2$ → unexplained variation left by the model
$SS_{tot} = \sum (y_i - \bar{y})^2$ → total variation in the target

Manual formula for RMSE:

\[ RMSE = \sqrt{\frac{1}{n}\sum (y_i - \hat{y}_i)^2} \]

RMSE is simply the square root of the average squared residual.

Interpretation

R-Squared

$R^2$ tells us how much of the variation in the target is explained by the model:

That means:

about 95.05% of the variation in exam grades is explained by hours studied
about 4.95% is due to other factors the model does not include

Those other factors might include:

sleep
stress
prior knowledge
exam difficulty
concentration level

RMSE

RMSE tells us the typical prediction error size
It uses the same unit as the target
Here the target is grade points

RMSE $\approx$ 2.81, which means the model’s predictions are off by about 2.81 grade points on average**

Why both metrics matter

$R^2$ is useful for understanding explanatory strength
RMSE is useful for understanding practical error size
A model can have a high $R^2$ and still have prediction errors that matter in practice

x_line = np.linspace(hours.min(), hours.max(), 100).reshape(-1, 1)
y_line = hours_model.predict(x_line)

plt.scatter(hours, grades)
plt.plot(x_line, y_line)
plt.title("Linear Regression Fit")
plt.xlabel("Hours Studied")
plt.ylabel("Grade on Exam")
plt.show()

Calculation with `Scikit-learn`

We can also compute $R^2$ and RMSE using scikit-learn functions.

from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error
r2_sklearn = r2_score(y, y_pred)
rmse_sklearn = np.sqrt(mean_squared_error(y, y_pred))

print(f"R-squared (sklearn): {r2_sklearn:.4f}")
print(f"RMSE (sklearn): {rmse_sklearn:.4f}")

R-squared (sklearn): 0.9505
RMSE (sklearn): 2.8129

Business Applications

Example 1 | Marketing Spend → Sales

Now we move from a classroom-style example to a more business-oriented dataset.

Here we have advertising data with:

tv
radio
social_media
influencer
sales

import pandas as pd

ads = pd.read_csv('../data/regression/advertising_and_sales.csv')
ads.head()

	tv	radio	social_media	influencer	sales
0	16000.0	6566.23	2907.98	Mega	54732.76
1	13000.0	9237.76	2409.57	Mega	46677.90
2	41000.0	15886.45	2913.41	Mega	150177.83
3	83000.0	30020.03	6922.30	Mega	298246.34
4	15000.0	8437.41	1406.00	Micro	56594.18

Use case

Forecast sales
Allocate marketing budget
Compare channels
Introduce multiple regression with a categorical feature

Try each channel separately

We begin with a separate simple regression for each numeric channel.

channel_results = []

for channel in ["tv", "radio", "social_media"]:
    X_channel = ads[[channel]]
    y_sales = ads["sales"]

    channel_model = LinearRegression()
    channel_model.fit(X_channel, y_sales)

    y_hat = channel_model.predict(X_channel)

    r2_channel = r2_score(y_sales, y_hat) 
    rmse_channel = root_mean_squared_error(y_sales, y_hat)

    channel_results.append({
        "channel": channel,
        "intercept": channel_model.intercept_,
        "slope": channel_model.coef_[0],
        "r_squared": r2_channel,
        "rmse": rmse_channel
    })

pd.DataFrame(channel_results).round(4)

	channel	intercept	slope	r_squared	rmse
0	tv	-132.4925	3.5615	0.9990	2948.5897
1	radio	40586.8007	8.3616	0.7545	46081.4060
2	social_media	118672.5717	22.1879	0.2782	79019.9030

Interpretation of the channel-by-channel models:

TV $\rightarrow$ Sales: TV is the strongest single-channel predictor in this sample
Radio $\rightarrow$ Sales: Radio alone still shows a positive relationship with sales
Social Media $\rightarrow$ Sales: Social media alone is also positively associated with sales

Important

Do not compare slope sizes blindly across channels, because the channels are on different scales. A slope of 40 for social media does not automatically mean social media is “better” than TV. The unit sizes differ, and the variables also move together.

Check how strongly the channels move together

ads[["tv", "radio", "social_media", "sales"]].corr().round(3)

	tv	radio	social_media	sales
tv	1.000	0.869	0.528	0.999
radio	0.869	1.000	0.606	0.869
social_media	0.528	0.606	1.000	0.527
sales	0.999	0.869	0.527	1.000

You should notice that the spend variables are strongly correlated with one another.

import plotly.express as px

# Compute correlation
corr = ads[["tv", "radio", "social_media", "sales"]].corr().round(3)

# Plot
fig = px.imshow(
    corr,
    text_auto=True,
    aspect="auto",
)

# Improve layout
fig.update_layout(
    title="Correlation Matrix",
    xaxis_title="",
    yaxis_title="",
)

fig.show()

Multiple regression using all numeric channels

Now let us include all three spend channels in the same model.

X_multi_num = ads[["tv", "radio", "social_media"]]
y_sales = ads["sales"]

multi_num_model = LinearRegression()
multi_num_model.fit(X_multi_num, y_sales)

y_hat_multi_num = multi_num_model.predict(X_multi_num)

multi_num_results = pd.DataFrame({
    "feature": ["intercept"] + list(X_multi_num.columns),
    "coefficient": [multi_num_model.intercept_] + list(multi_num_model.coef_)
}).round(4)

multi_num_r2 = r2_score(y_sales, y_hat_multi_num)
multi_num_rmse = root_mean_squared_error(y_sales, y_hat_multi_num)


print(f"R-squared: {multi_num_r2:.4f}")
print(f"RMSE: {multi_num_rmse:.2f}")

R-squared: 0.9990
RMSE: 2948.54

multi_num_results

	feature	coefficient
0	intercept	-133.9630
1	tv	3.5626
2	radio	-0.0040
3	social_media	0.0050

Interpretation:

The fitted numeric-only multiple regression is approximately:

multi_num_model.intercept_

np.float64(-133.96296784214792)

\[ \hat{Sales} = \beta_0 + \beta_1(TV) + \beta_2(Radio) + \beta_3(SocialMedia) \]

\[\downarrow\]

Sales = -133.96 + 3.56(TV) + -0.00(Radio) + 0.00(SocialMedia)

What does this mean?

Holding the other channels constant, TV remains strongly positive
Radio and social media become slightly negative in this sample
That does not automatically mean radio or social media are harmful
More likely, the predictors are overlapping so strongly that the model struggles to separate their unique contributions cleanly

This is a classic multiple-regression interpretation issue: a variable can look positive in a simple regression and unstable in a multiple regression when predictors are highly correlated.

Multiple regression with the `influencer` variable

influencer is categorical, so we need dummy variables.

We will make Mega the reference category.

ads["influencer"] = pd.Categorical(
    ads["influencer"],
    categories=["Mega", "Macro", "Micro", "Nano"]
)

X_multi = pd.get_dummies(
    ads[["tv", "radio", "social_media", "influencer"]],
    drop_first=True,
    dtype=int
)

multi_model = LinearRegression()
multi_model.fit(X_multi, y_sales)

y_hat_multi = multi_model.predict(X_multi)

multi_results = pd.DataFrame({
    "feature": ["intercept"] + list(X_multi.columns),
    "coefficient": [multi_model.intercept_] + list(multi_model.coef_)
}).round(4)

multi_r2 = r2_score(y_sales, y_hat_multi)
multi_rmse = root_mean_squared_error(y_sales, y_hat_multi)

multi_results
print(f"R-squared: {multi_r2:.4f}")
print(f"RMSE: {multi_rmse:.2f}")

R-squared: 0.9990
RMSE: 2948.31

Interpretation:

The fitted model is approximately:

\[ \hat{Sales} = {intercept} + {tv_coef}(TV) - {radio_coef}(Radio) - {social_media_coef}(SocialMedia) - 3562.83(Macro) + 3596.86(Micro) + 1501.38(Nano) \]

Where:

Mega is the baseline influencer category
Macro, Micro, and Nano are dummy adjustments relative to Mega

Model fit:

$R^2 \approx 0.9994$
RMSE $\approx 2083.33$

How to interpret the coefficients:

TV = 3.77
Holding the other variables constant, an extra $1 in TV spend is associated with about $3.77 more sales
Radio = -0.42 and Social Media = -0.38
These slightly negative coefficients should be interpreted very carefully. In this tiny sample, the predictors are highly correlated, so these signs are more likely reflecting multicollinearity and overlap than a real negative business effect
Influencer dummies
These are shifts relative to Mega
- Macro = -3562.83
- Micro = +3596.86
- Nano = +1501.38

Important caution:

Do not over-interpret the influencer coefficients here:

Macro appears only once
Micro appears twice
Nano appears twice
Mega appears seven times

So this is a good tutorial example for how to include a categorical variable, but not a strong basis for making real strategic claims.

Example 2 | Price → Demand

For this example, we generate realistic synthetic data with a negative relationship.

np.random.seed(42)

price = np.linspace(10, 100, 40)
demand = 1200 - 8 * price + np.random.normal(0, 35, size=40)

X_price = price.reshape(-1, 1)

price_model = LinearRegression()
price_model.fit(X_price, demand)

demand_pred = price_model.predict(X_price)

price_r2 = r2_score(demand, demand_pred)
price_rmse = root_mean_squared_error(demand, demand_pred)

print(f"Intercept: {price_model.intercept_:.4f}")
print(f"Slope: {price_model.coef_[0]:.4f}")
print(f"R-squared: {price_r2:.4f}")
print(f"RMSE: {price_rmse:.4f}")

Intercept: 1209.3747
Slope: -8.3096
R-squared: 0.9797
RMSE: 31.8794

Business meaning:

each $1 increase in price reduces demand by about 8.31 units
the relationship is strong
this is exactly the kind of model used in pricing strategy and elasticity-style discussions

Example 3 | Customer Age → Revenue

For this example, we generate realistic synthetic data with a positive relationship.

np.random.seed(7)

age = np.random.randint(18, 66, size=50)
revenue = 18 + 2.4 * age + np.random.normal(0, 12, size=50)

X_age = age.reshape(-1, 1)

age_model = LinearRegression()
age_model.fit(X_age, revenue)

revenue_pred = age_model.predict(X_age)

age_r2 = r2_score(revenue, revenue_pred)
age_rmse = root_mean_squared_error(revenue, revenue_pred)

print(f"Intercept: {age_model.intercept_:.4f}")
print(f"Slope: {age_model.coef_[0]:.4f}")
print(f"R-squared: {age_r2:.4f}")
print(f"RMSE: {age_rmse:.4f}")

Intercept: 18.6013
Slope: 2.3835
R-squared: 0.8834
RMSE: 12.8626

Business meaning:

each additional year of age is associated with about 2.38 more revenue units
this kind of model is useful for segmentation and value analysis
it may suggest that different age groups contribute differently to revenue

From Regression to Time Series

Key Difference

Regression	Time Series
Observations are treated as independent	Observations are ordered in time
Order is not the main issue	Order matters
Often explains relationships	Often explains and forecasts sequences

In time series analysis, yesterday can influence today, and the sequence itself matters.

Time Series Components

The main components of a time series are:

Trend → long-term upward or downward movement
Seasonality → repeating pattern
Noise → irregular variation

Not every time series contains all three clearly, but this framework helps us reason about the data.

Gold Dataset

Important

Remember to download the dataset from the Github repository and place it in the correct folder before running the code.

Dataset: gold_prices_2022_2023.csv URL
Dataset: gold_prices_2023_2024.csv URL
Dataset: gold_prices_2024_2025.csv URL
Dataset: gold_prices_2025_2026.csv URL

Source: Market Watch

Data cleaning note:

Gold price data often has missing values on weekends and holidays. The dataset we are using has been cleaned to include only business days, so there are no missing dates. This allows us to focus on the modeling aspect without needing to handle missing data in this example.

import pandas as pd
from os.path import join
PATH = '../data/regression/time_series'

(250, 5)

	Date	Open	High	Low	Close
0	2023-04-21	2,016.10	2,016.80	1,982.30	1,990.50
1	2023-04-20	2,007.70	2,024.20	2,002.20	2,019.10
2	2023-04-19	2,017.90	2,020.30	1,980.90	2,007.30
3	2023-04-18	2,007.90	2,024.60	2,003.30	2,019.70
4	2023-04-17	2,014.10	2,028.00	1,993.40	2,007.00

ts = pd.read_csv(join(PATH, 'gold_prices_2022_2023.csv'), parse_dates=['Date'])
print(ts.shape)
ts.head()

Reading and Concatenating Multiple Files

The website provides data in 1-year chunks, so we will read and concatenate multiple files to get a longer time series.

Using glob and os, we can read all the CSV files in the folder and combine them into one DataFrame.

import glob
import os
file_paths = glob.glob(join(PATH, "*.csv"))
file_paths

['../../lab/python/data/regression/time_series/gold_prices_2023_2024.csv',
 '../../lab/python/data/regression/time_series/gold_prices_2024_2025.csv',
 '../../lab/python/data/regression/time_series/gold_prices_2025_2026.csv',
 '../../lab/python/data/regression/time_series/gold_prices_2022_2023.csv']

If we are interested in just the file names without the full paths, we can extract them using os.path.basename().

file_names = [os.path.basename(file) for file in file_paths]
file_names

['gold_prices_2023_2024.csv',
 'gold_prices_2024_2025.csv',
 'gold_prices_2025_2026.csv',
 'gold_prices_2022_2023.csv']

Combining the files into a single DataFrame can be done using pd.concat().

dataframes = [pd.read_csv(file, parse_dates=['Date']) for file in file_paths]
ts_combined = pd.concat(dataframes, ignore_index=True)
print(ts_combined.shape)
ts_combined.head()

(1008, 5)

	Date	Open	High	Low	Close
0	2024-04-23	2,342.30	2,347.90	2,304.60	2,342.10
1	2024-04-22	2,403.70	2,404.30	2,338.20	2,346.40
2	2024-04-19	2,394.00	2,433.30	2,386.80	2,413.80
3	2024-04-18	2,377.90	2,408.00	2,377.20	2,398.00
4	2024-04-17	2,398.00	2,412.00	2,370.70	2,388.40

ts_combined.dtypes

Date     datetime64[us]
Open                str
High                str
Low                 str
Close               str
dtype: object

ts_combined["Close"] = (
    ts_combined["Close"]
    .astype(str)
    .str.replace(",", "", regex=False)
    .str.strip()
)

ts_combined["Close"] = pd.to_numeric(ts_combined["Close"], errors="coerce")

Visualization

Befor visualizing, we should sort the combined DataFrame by date to ensure the time series is in the correct order.

ts_combined = ts_combined.sort_values(by='Date')

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.ticker import MaxNLocator


plt.plot(
    ts_combined["Date"],
    ts_combined["Close"],
    linewidth=2,
    marker="o",
    markersize=3,
    alpha=0.85
)

plt.title("Close Price Over Time", fontsize=18, pad=15)
plt.xlabel("Date", fontsize=13)
plt.ylabel("Close", fontsize=13)

# Improve x-axis date formatting
plt.gca().xaxis.set_major_locator(mdates.MonthLocator(interval=3))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

# Reduce number of y-axis labels
plt.gca().yaxis.set_major_locator(MaxNLocator(nbins=8))

plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

As you can see, the gold price data shows a clear upward trend over the two-year period, with some fluctuations along the way. This kind of pattern is common in financial time series, where we often see a combination of trend and noise. In the next sections, we will explore how to model this time series and make forecasts based on it.

Trend Analysis (Regression on Time)

Idea

Treat time as X:

\[ Close = \beta_0 + \beta_1 \cdot Time \]

This is not yet a full time-series model, but it is a very useful bridge from regression to forecasting.

Python Example

Create a time index variable that counts the business days from the start of the series
Fit a linear regression model using this time index as the predictor
Evaluate the model fit using $R^2$ and RMSE

ts_combined["time_index"] = np.arange(1, len(ts_combined) + 1)

X_time = ts_combined[["time_index"]]
y_time = ts_combined["Close"]

trend_model = LinearRegression()
trend_model.fit(X_time, y_time)

trend_pred = trend_model.predict(X_time)

trend_r2 = r2_score(y_time, trend_pred)
trend_rmse = np.sqrt(mean_squared_error(y_time, trend_pred))

Interpretation:

For this short sample:

print(f"Intercept: {trend_model.intercept_:.2f}")
print(f"Slope: {trend_model.coef_[0]:.4f}")
print(f"R-squared: {trend_r2:.4f}")
print(f"RMSE: {trend_rmse:.4f}")

Intercept: 1152.71
Slope: 2.9477
R-squared: 0.8247
RMSE: 395.4914

Meaning:

on average, the adjusted close increases by about 2.9477 per business-day step**
the upward trend is present, but not all variation is explained by a straight line
this is expected, because market prices are noisy and time dependence matters

ARIMA

ARIMA is a classic forecasting model for ordered data.

AR → autoregressive part, which uses past values
I → integrated part, which applies differencing
MA → moving-average part, which uses past forecast errors

In simple terms:

ARIMA helps us model time dependence
differencing helps stabilize a series when the level changes over time
the model can then be used for short-run forecasting

The model is written as:

\[ ARIMA(p, d, q) \]

where:

$p$ controls how many past values are used.
$d$ controls how many times the series is differenced.
$q$ controls how many past forecast errors are used.

So, an ARIMA model combines three ideas:

Use past values of the series.
Transform the series into a more stable form.
Use past forecast errors to improve future predictions.

The $p$ Parameter: AutoRegressive Lags

The first parameter, $p$, represents the number of autoregressive lags.

This means the model uses previous values of the time series to predict the current value.

For example, if we use an AR model with one lag, the model assumes that today’s value depends partly on yesterday’s value:

\[ y_t = c + \phi_1 y_{t-1} + \varepsilon_t \]

Here:

$y_t$ is the value at time $t$.
$y_{t-1}$ is the value one period before.
$\phi_1$ measures how strongly the previous value influences the current value.
$\varepsilon_t$ is the remaining unexplained error.

If $p = 1$, the model uses one previous value.

If $p = 2$, the model uses two previous values:

\[ y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t \]

In simple terms, $p$ answers the question:

How many previous observations should the model remember?

The $d$ Parameter: Differencing

The second parameter, $d$, represents the number of times the series is differenced.

Many real-world time series are not stationary. A stationary series has a relatively stable mean and variance over time. However, stock prices, sales, revenue, population, and website traffic often have trends.

ARIMA works best when the series is stationary, so we often transform the original series before modeling it.

Differencing means subtracting the previous value from the current value:

\[ \Delta y_t = y_t - y_{t-1} \]

Instead of modeling the original values, the model uses the change from one period to the next.

For example, suppose we have this series:

Time	Value
1	100
2	110
3	125
4	140

The first difference is:

Time	Difference
2	10
3	15
4	15

So instead of asking:

What will the next value be?

the differenced model asks:

What will the next change be?

If $d = 0$, the model uses the original series.

If $d = 1$, the model uses first differences.

If $d = 2$, the model differences the series twice, meaning it models the change in the changes.

In simple terms, $d$ answers the question:

How many times do we need to remove trend from the data?

The $q$ Parameter: Moving-Average Lags

The third parameter, $q$, represents the number of moving-average lags.

This part of the model uses past forecast errors.

A forecast error is the difference between the actual value and the predicted value:

\[ \varepsilon_t = y_t - \hat{y}_t \]

If the model made a mistake yesterday, that mistake may contain useful information for today’s forecast.

For example, an MA(1) model can be written as:

\[ y_t = c + \varepsilon_t + \theta_1 \varepsilon_{t-1} \]

This means today’s value depends partly on today’s random shock and yesterday’s forecast error.

If $q = 1$, the model uses one previous forecast error.

If $q = 2$, the model uses two previous forecast errors.

In simple terms, $q$ answers the question:

How many previous forecast mistakes should the model use to correct itself?

Interpreting ARIMA(1, 1, 1)

In this lesson, we use:

\[ ARIMA(1, 1, 1) \]

This means:

Component	Meaning	Interpretation
$p = 1$	One autoregressive lag	The model uses the previous value.
$d = 1$	One difference	The model models changes rather than raw values.
$q = 1$	One moving-average lag	The model uses the previous forecast error.

So ARIMA(1, 1, 1) means:

First, difference the series once to remove trend. Then model the differenced series using one lag of the series and one lag of the forecast error.

Conceptually, the model says:

The next change in the series depends on the previous change and the previous forecasting mistake.

This is why ARIMA(1, 1, 1) is often a useful practical demonstration model. It is simple enough to explain, but it already contains all three main ARIMA ideas: memory, differencing, and error correction.

Python Example

from statsmodels.tsa.arima.model import ARIMA

ts_arima = ts_combined.set_index("Date").asfreq("B")

arima_model = ARIMA(ts_arima["Close"], order=(1, 1, 1))
arima_fit = arima_model.fit()

arima_forecast = arima_fit.forecast(steps=3)

For this short dataset, the next 3 business-day forecasts are approximately:

arima_forecast

2026-04-27    4742.197622
2026-04-28    4742.974111
2026-04-29    4743.438757
Freq: B, Name: predicted_mean, dtype: float64

Note

The statsmodels package is often preinstalled in many environments, but if you get ModuleNotFoundError, install it with:

pip install statsmodels

Prophet

Prophet is a forecasting package for Python and R developed by Facebook.

It is popular because:

it is easy to use
it works naturally with date-based data
it can handle trend changes
it can model seasonality and holiday effects when enough history is available
it provides an intuitive forecasting workflow

For this very short dataset, we use Prophet mainly to demonstrate the workflow correctly.

Python Example

from prophet import Prophet

df_prophet = ts_combined.rename(columns={"Date": "ds", "Close": "y"})

prophet_model = Prophet()
prophet_model.fit(df_prophet)

future = prophet_model.make_future_dataframe(periods=3, freq="B")

forecast = prophet_model.predict(future)

forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].tail(3)

	ds	yhat	yhat_lower	yhat_upper
1008	2026-04-27	5146.627474	5008.323560	5282.824426
1009	2026-04-28	5153.177394	5016.647241	5286.196621
1010	2026-04-29	5161.716647	5036.371597	5294.723632

Note

If you get ModuleNotFoundError, install Prophet with:

pip install prophet

Visualization

prophet_model.plot(forecast)
plt.title("Gold Futures (GC00) - Close Price Forecast")
plt.xlabel("Date")
plt.ylabel("Close Price")
plt.show()

prophet_model.plot_components(forecast)
plt.show()

How to read the Prophet output:

the main forecast plot shows the historical data, the fitted forecast line, and the uncertainty interval
the component plots break the forecast into interpretable pieces such as trend and seasonality
with this short business-day sample, the trend is the most useful part to discuss
with a longer history, Prophet becomes much more useful for storytelling around trend changes and seasonal patterns

Extra

Tip

Whatch this video for regularization
Read this for regularization

Overview

Learning Objectives

Intuition of Relationships

Visual Thinking

Linear Model

Interpretation

Manual Linear Regression

Optimization Problem

Coefficient Formula

Step-by-Step Example

Python | Manual Calculation

Predicting next step

Regression in Python scikit-learn

Installing Scikit Learn package

Importing LinearRegression

Visualization

Model Evaluation

Manual Calculation of \(R^2\) and RMSE

Interpretation

Calculation with Scikit-learn

Business Applications

Example 1 | Marketing Spend → Sales

Try each channel separately

Check how strongly the channels move together

Multiple regression using all numeric channels

Multiple regression with the influencer variable

Example 2 | Price → Demand

Example 3 | Customer Age → Revenue

From Regression to Time Series

Key Difference

Time Series Components

Gold Dataset

Reading and Concatenating Multiple Files

Visualization

Trend Analysis (Regression on Time)

Idea

Python Example

ARIMA

The \(p\) Parameter: AutoRegressive Lags

The \(d\) Parameter: Differencing

The \(q\) Parameter: Moving-Average Lags

Interpreting ARIMA(1, 1, 1)

Python Example

Prophet

Python Example

Visualization

Extra

Regression in Python `scikit-learn`

Importing `LinearRegression`

Calculation with `Scikit-learn`

Multiple regression with the `influencer` variable