Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. Python
  2. Python
  3. Session 12: Simple Linear Regression and Forecasting
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Session 10: A/B Testing
      • Session 11: Cohort Analysis
      • Session 12: Simple Linear Regression and Forecasting
      • Session 13: Logistic Regression
      • Session 14: Clustering
      • Session 15: Geoanalytics
      • Session 16: SQL Alchemy
      • Slides
        • Grammar of Graphics
        • Data Analyst
  • Tableau
    • Tableau
      • Session 01: Introduction to Tableau
      • Session 02: Intermediate Visual Analytics
      • Session 03: Advanced Analytics
      • Session 04: Dashboard Design & Performance
      • Session 05: Sales Analysis Dashboard
      • Session 06: Customer Analysis Dashboard
      • Session 07: Spatial Analytics
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Overview
  • Learning Objectives
  • Intuition of Relationships
    • Visual Thinking
    • Linear Model
    • Interpretation
  • Manual Linear Regression
    • Optimization Problem
    • Coefficient Formula
    • Step-by-Step Example
    • Python | Manual Calculation
    • Predicting next step
  • Regression in Python scikit-learn
    • Installing Scikit Learn package
    • Importing LinearRegression
    • Visualization
  • Model Evaluation
    • Manual Calculation of \(R^2\) and RMSE
    • Interpretation
    • Calculation with Scikit-learn
  • Business Applications
    • Example 1 | Marketing Spend → Sales
    • Example 2 | Price → Demand
    • Example 3 | Customer Age → Revenue
  • From Regression to Time Series
    • Key Difference
    • Time Series Components
    • Gold Dataset
    • Reading and Concatenating Multiple Files
    • Visualization
    • Trend Analysis (Regression on Time)
    • Python Example
  • ARIMA
    • Interpreting ARIMA(1, 1, 1)
    • Python Example
  • Prophet
    • Python Example
    • Visualization
  • Extra
  1. Python
  2. Python
  3. Session 12: Simple Linear Regression and Forecasting

Session 12: Simple Linear Regression and Forecasting

Simple Linear Regression

Overview

This module introduces one of the most important concepts in data analytics: understanding relationships and forecasting trends.

We will go from:

  • Intuition \(\rightarrow\) how relationships work
  • Manual calculations \(\rightarrow\) how models are built
  • Python implementation \(\rightarrow\) how analysts actually use it
  • Business applications \(\rightarrow\) how decisions are made
  • Time series forecasting \(\rightarrow\) how to predict the future

Learning Objectives

After this session, you will be able to:

  • Understand what linear regression is conceptually
  • Manually compute regression coefficients
  • Interpret regression outputs in a business context
  • Build regression models in Python
  • Apply regression for trend analysis
  • Understand time series structure
  • Perform basic forecasting using ARIMA and Prophet

Intuition of Relationships

In analytics, we often ask: “If X changes \(\rightarrow\) what happens to Y?”

Examples:

  • Marketing spend \(\rightarrow\) Sales
  • Price \(\rightarrow\) Demand
  • Discount \(\rightarrow\) Conversion rate
  • Hours studied \(\rightarrow\) Exam score

Visual Thinking

We will use the hours studied vs grade example from the picture you provided.

Hours Studied Grade on Exam
2 69
9 98
5 82
5 77
3 71
7 84
1 55
8 94
6 84
2 64
import numpy as np
import matplotlib.pyplot as plt

hours = np.array([2, 9, 5, 5, 3, 7, 1, 8, 6, 2], dtype=float)
grades = np.array([69, 98, 82, 77, 71, 84, 55, 94, 84, 64], dtype=float)

plt.scatter(hours, grades)
plt.xlabel("Hours Studied")
plt.ylabel("Grade on Exam")
plt.title("Hours Studied vs Grade on Exam")
plt.show()

We observe:

  • As hours studied increase, exam grades also tend to increase
  • The relationship is positive
  • The data does not fall perfectly on one line, which is normal in real life
  • This makes the example ideal for introducing both intuition and model evaluation

Suggested use of the picture you shared:

Keep the screenshot as the raw real-world input at the start of this section, then place the clean table and the scatter plot right after it. It works especially well here because students can visually understand the pattern before seeing any formulas.

Linear Model

The prediction equation for a simple linear regression is:

\[ \hat{Y} = \beta_0 + \beta_1 X \]

A more complete statistical view is:

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

Where:

  • \(Y\) is the actual observed outcome
  • \(\hat{Y}\) is the predicted outcome
  • \(X\) is the input variable
  • \(\beta_0\) is the intercept
  • \(\beta_1\) is the slope
  • \(\varepsilon\) is the part of the outcome the model does not explain

In this lesson:

  • \(X\) = hours studied
  • \(Y\) = grade on exam

Interpretation

  • \(\beta_1\) tells us how much \(Y\) changes when \(X\) increases by 1 unit
  • \(\beta_0\) tells us the predicted value of \(Y\) when \(X = 0\)

Example:

  • If \(\beta_1 = 5\), then each extra hour studied is associated with about 5 extra grade points
  • If \(\beta_0 = 55\), then the model predicts a grade of 55 for a student who studied 0 hours

Manual Linear Regression

Our goal Find the best line that fits the data.

Optimization Problem

We minimize:

\[ \sum (y_i - \hat{y}_i)^2 \]

This is called Least Squares.

To understand it:

  • \(y_i\) is the actual value
  • \(\hat{y}_i\) is the predicted value
  • \((y_i - \hat{y}_i)\) is the residual, or prediction error
  • Squaring makes all errors positive and penalizes larger errors more strongly

So the model is searching for the line that leaves the smallest total squared error.

Coefficient Formula

Slope:

\[ \beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

Intercept:

\[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]

Step-by-Step Example

We will use the same hours-studied dataset from the picture.

Step 1: Compute the means

\[ \bar{x} = 4.8 \qquad \bar{y} = 77.8 \]

Step 2: Compute the numerator and denominator for the slope

\[ \sum (x_i - \bar{x})(y_i - \bar{y}) = 320.6 \]

\[ \sum (x_i - \bar{x})^2 = 67.6 \]

Step 3: Compute the slope

\[ \beta_1 = \frac{320.6}{67.6} \approx 4.7426 \]

Step 4: Compute the intercept

\[ \beta_0 = 77.8 - (4.7426 \times 4.8) \approx 55.0355 \]

So the fitted line is:

\[ \hat{Y} = 55.04 + 4.74X \]

This means:

  • the baseline predicted grade is about 55.04
  • each extra study hour adds about 4.74 grade points

Python | Manual Calculation

x = hours
y = grades

x_mean = x.mean()
y_mean = y.mean()

beta1 = ((x - x_mean) * (y - y_mean)).sum() / ((x - x_mean) ** 2).sum()
beta0 = y_mean - beta1 * x_mean

print(f"Mean of X: {x_mean:.2f}")
print(f"Mean of Y: {y_mean:.2f}")
print(f"Intercept (β0): {beta0:.4f}")
print(f"Slope (β1): {beta1:.4f}")
print(f"Fitted line: y_hat = {beta0:.2f} + {beta1:.2f}x")
Mean of X: 4.80
Mean of Y: 77.80
Intercept (β0): 55.0355
Slope (β1): 4.7426
Fitted line: y_hat = 55.04 + 4.74x

Expected interpretation:

  • \(\beta_0 \approx 55.04\)
  • \(\beta_1 \approx 4.74\)

Predicting next step

Now that we have the fitted line, we can predict the grade for a new value of \(X\).

For example, if a student studies for 4 hours:

new_hours = 4
predicted_grade = beta0 + beta1 * new_hours

print(f"Predicted grade for {new_hours} hours: {predicted_grade:.2f}")
Predicted grade for 4 hours: 74.01

Interpretation:

For a student who studied 4 hours, the model predicts a grade of about 74.01.

Regression in Python scikit-learn

Now let us build the same model using scikit-learn.

Installing Scikit Learn package

pip install -U scikit-learn
Important

Make sure to be in the right environment. In our case it is aca

Importing LinearRegression

from sklearn.linear_model import LinearRegression
X = hours.reshape(-1, 1)
y = grades

hours_model = LinearRegression()
hours_model.fit(X, y)

print(f"Intercept: {hours_model.intercept_:.4f}")
print(f"Slope: {hours_model.coef_[0]:.4f}")
Intercept: 55.0355
Slope: 4.7426

You should see values very close to the manual calculation above.

Visualization

x_line = np.linspace(hours.min(), hours.max(), 100).reshape(-1, 1)
y_line = hours_model.predict(x_line)

plt.scatter(hours, grades)
plt.plot(x_line, y_line)
plt.title("Linear Regression Fit")
plt.xlabel("Hours Studied")
plt.ylabel("Grade on Exam")
plt.show()

The scatter plot shows the observed data, and the line shows the model’s predicted trend.

Model Evaluation

Under the model evaluation section for the simple regression example.

In practice, we usually want to answer two questions:

  • How much variation does the model explain?
  • How large are the prediction errors?

For that reason, we will compute both:

  • \(R^2\)
  • RMSE

Manual Calculation of \(R^2\) and RMSE

First, compute the predictions and residuals.

y_pred = hours_model.predict(X)
residuals = y - y_pred

ss_res = np.sum(residuals ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)

r2 = 1 - (ss_res / ss_tot)
rmse = np.sqrt(np.mean(residuals ** 2))

print(f"SS_res: {ss_res:.4f}")
print(f"SS_tot: {ss_tot:.4f}")
print(f"R-squared: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
SS_res: 79.1213
SS_tot: 1599.6000
R-squared: 0.9505
RMSE: 2.8129

Manual formula for \(R^2\):

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \]

Where:

  • \(SS_{res} = \sum (y_i - \hat{y}_i)^2\) → unexplained variation left by the model
  • \(SS_{tot} = \sum (y_i - \bar{y})^2\) → total variation in the target

Manual formula for RMSE:

\[ RMSE = \sqrt{\frac{1}{n}\sum (y_i - \hat{y}_i)^2} \]

RMSE is simply the square root of the average squared residual.

Interpretation

R-Squared

\(R^2\) tells us how much of the variation in the target is explained by the model:

That means:

  • about 95.05% of the variation in exam grades is explained by hours studied
  • about 4.95% is due to other factors the model does not include

Those other factors might include:

  • sleep
  • stress
  • prior knowledge
  • exam difficulty
  • concentration level

RMSE

  • RMSE tells us the typical prediction error size
  • It uses the same unit as the target
  • Here the target is grade points

RMSE \(\approx\) 2.81, which means the model’s predictions are off by about 2.81 grade points on average**

Why both metrics matter

  • \(R^2\) is useful for understanding explanatory strength
  • RMSE is useful for understanding practical error size
  • A model can have a high \(R^2\) and still have prediction errors that matter in practice
x_line = np.linspace(hours.min(), hours.max(), 100).reshape(-1, 1)
y_line = hours_model.predict(x_line)

plt.scatter(hours, grades)
plt.plot(x_line, y_line)
plt.title("Linear Regression Fit")
plt.xlabel("Hours Studied")
plt.ylabel("Grade on Exam")
plt.show()

Calculation with Scikit-learn

We can also compute \(R^2\) and RMSE using scikit-learn functions.

from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error
r2_sklearn = r2_score(y, y_pred)
rmse_sklearn = np.sqrt(mean_squared_error(y, y_pred))
print(f"R-squared (sklearn): {r2_sklearn:.4f}")
print(f"RMSE (sklearn): {rmse_sklearn:.4f}")
R-squared (sklearn): 0.9505
RMSE (sklearn): 2.8129

Business Applications

Example 1 | Marketing Spend → Sales

Now we move from a classroom-style example to a more business-oriented dataset.

Here we have advertising data with:

  • tv
  • radio
  • social_media
  • influencer
  • sales
import pandas as pd

ads = pd.read_csv('../data/regression/advertising_and_sales.csv')
ads.head()
tv radio social_media influencer sales
0 16000.0 6566.23 2907.98 Mega 54732.76
1 13000.0 9237.76 2409.57 Mega 46677.90
2 41000.0 15886.45 2913.41 Mega 150177.83
3 83000.0 30020.03 6922.30 Mega 298246.34
4 15000.0 8437.41 1406.00 Micro 56594.18
ImportantUse case
  • Forecast sales
  • Allocate marketing budget
  • Compare channels
  • Introduce multiple regression with a categorical feature

Try each channel separately

We begin with a separate simple regression for each numeric channel.

channel_results = []

for channel in ["tv", "radio", "social_media"]:
    X_channel = ads[[channel]]
    y_sales = ads["sales"]

    channel_model = LinearRegression()
    channel_model.fit(X_channel, y_sales)

    y_hat = channel_model.predict(X_channel)

    r2_channel = r2_score(y_sales, y_hat) 
    rmse_channel = root_mean_squared_error(y_sales, y_hat)

    channel_results.append({
        "channel": channel,
        "intercept": channel_model.intercept_,
        "slope": channel_model.coef_[0],
        "r_squared": r2_channel,
        "rmse": rmse_channel
    })

pd.DataFrame(channel_results).round(4)
channel intercept slope r_squared rmse
0 tv -132.4925 3.5615 0.9990 2948.5897
1 radio 40586.8007 8.3616 0.7545 46081.4060
2 social_media 118672.5717 22.1879 0.2782 79019.9030

Interpretation of the channel-by-channel models:

  • TV \(\rightarrow\) Sales: TV is the strongest single-channel predictor in this sample
  • Radio \(\rightarrow\) Sales: Radio alone still shows a positive relationship with sales
  • Social Media \(\rightarrow\) Sales: Social media alone is also positively associated with sales
Important

Do not compare slope sizes blindly across channels, because the channels are on different scales. A slope of 40 for social media does not automatically mean social media is “better” than TV. The unit sizes differ, and the variables also move together.

Check how strongly the channels move together

ads[["tv", "radio", "social_media", "sales"]].corr().round(3)
tv radio social_media sales
tv 1.000 0.869 0.528 0.999
radio 0.869 1.000 0.606 0.869
social_media 0.528 0.606 1.000 0.527
sales 0.999 0.869 0.527 1.000

You should notice that the spend variables are strongly correlated with one another.

import plotly.express as px

# Compute correlation
corr = ads[["tv", "radio", "social_media", "sales"]].corr().round(3)

# Plot
fig = px.imshow(
    corr,
    text_auto=True,
    aspect="auto",
)

# Improve layout
fig.update_layout(
    title="Correlation Matrix",
    xaxis_title="",
    yaxis_title="",
)

fig.show()

Multiple regression using all numeric channels

Now let us include all three spend channels in the same model.

X_multi_num = ads[["tv", "radio", "social_media"]]
y_sales = ads["sales"]

multi_num_model = LinearRegression()
multi_num_model.fit(X_multi_num, y_sales)

y_hat_multi_num = multi_num_model.predict(X_multi_num)

multi_num_results = pd.DataFrame({
    "feature": ["intercept"] + list(X_multi_num.columns),
    "coefficient": [multi_num_model.intercept_] + list(multi_num_model.coef_)
}).round(4)

multi_num_r2 = r2_score(y_sales, y_hat_multi_num)
multi_num_rmse = root_mean_squared_error(y_sales, y_hat_multi_num)


print(f"R-squared: {multi_num_r2:.4f}")
print(f"RMSE: {multi_num_rmse:.2f}")
R-squared: 0.9990
RMSE: 2948.54
multi_num_results
feature coefficient
0 intercept -133.9630
1 tv 3.5626
2 radio -0.0040
3 social_media 0.0050

Interpretation:

The fitted numeric-only multiple regression is approximately:

multi_num_model.intercept_
np.float64(-133.96296784214792)

\[ \hat{Sales} = \beta_0 + \beta_1(TV) + \beta_2(Radio) + \beta_3(SocialMedia) \]

\[\downarrow\]

Sales = -133.96 + 3.56(TV) + -0.00(Radio) + 0.00(SocialMedia)

What does this mean?

  • Holding the other channels constant, TV remains strongly positive
  • Radio and social media become slightly negative in this sample
  • That does not automatically mean radio or social media are harmful
  • More likely, the predictors are overlapping so strongly that the model struggles to separate their unique contributions cleanly

This is a classic multiple-regression interpretation issue: a variable can look positive in a simple regression and unstable in a multiple regression when predictors are highly correlated.

Multiple regression with the influencer variable

influencer is categorical, so we need dummy variables.

We will make Mega the reference category.

ads["influencer"] = pd.Categorical(
    ads["influencer"],
    categories=["Mega", "Macro", "Micro", "Nano"]
)

X_multi = pd.get_dummies(
    ads[["tv", "radio", "social_media", "influencer"]],
    drop_first=True,
    dtype=int
)

multi_model = LinearRegression()
multi_model.fit(X_multi, y_sales)

y_hat_multi = multi_model.predict(X_multi)

multi_results = pd.DataFrame({
    "feature": ["intercept"] + list(X_multi.columns),
    "coefficient": [multi_model.intercept_] + list(multi_model.coef_)
}).round(4)

multi_r2 = r2_score(y_sales, y_hat_multi)
multi_rmse = root_mean_squared_error(y_sales, y_hat_multi)

multi_results
print(f"R-squared: {multi_r2:.4f}")
print(f"RMSE: {multi_rmse:.2f}")
R-squared: 0.9990
RMSE: 2948.31

Interpretation:

The fitted model is approximately:

\[ \hat{Sales} = {intercept} + {tv_coef}(TV) - {radio_coef}(Radio) - {social_media_coef}(SocialMedia) - 3562.83(Macro) + 3596.86(Micro) + 1501.38(Nano) \]

Where:

  • Mega is the baseline influencer category
  • Macro, Micro, and Nano are dummy adjustments relative to Mega

Model fit:

  • \(R^2 \approx 0.9994\)
  • RMSE \(\approx 2083.33\)

How to interpret the coefficients:

  • TV = 3.77
    Holding the other variables constant, an extra $1 in TV spend is associated with about $3.77 more sales

  • Radio = -0.42 and Social Media = -0.38
    These slightly negative coefficients should be interpreted very carefully. In this tiny sample, the predictors are highly correlated, so these signs are more likely reflecting multicollinearity and overlap than a real negative business effect

  • Influencer dummies
    These are shifts relative to Mega

    • Macro = -3562.83
    • Micro = +3596.86
    • Nano = +1501.38

Important caution:

Do not over-interpret the influencer coefficients here:

  • Macro appears only once
  • Micro appears twice
  • Nano appears twice
  • Mega appears seven times

So this is a good tutorial example for how to include a categorical variable, but not a strong basis for making real strategic claims.

Example 2 | Price → Demand

For this example, we generate realistic synthetic data with a negative relationship.

np.random.seed(42)

price = np.linspace(10, 100, 40)
demand = 1200 - 8 * price + np.random.normal(0, 35, size=40)

X_price = price.reshape(-1, 1)

price_model = LinearRegression()
price_model.fit(X_price, demand)

demand_pred = price_model.predict(X_price)

price_r2 = r2_score(demand, demand_pred)
price_rmse = root_mean_squared_error(demand, demand_pred)

print(f"Intercept: {price_model.intercept_:.4f}")
print(f"Slope: {price_model.coef_[0]:.4f}")
print(f"R-squared: {price_r2:.4f}")
print(f"RMSE: {price_rmse:.4f}")
Intercept: 1209.3747
Slope: -8.3096
R-squared: 0.9797
RMSE: 31.8794

Business meaning:

  • each $1 increase in price reduces demand by about 8.31 units
  • the relationship is strong
  • this is exactly the kind of model used in pricing strategy and elasticity-style discussions

Example 3 | Customer Age → Revenue

For this example, we generate realistic synthetic data with a positive relationship.

np.random.seed(7)

age = np.random.randint(18, 66, size=50)
revenue = 18 + 2.4 * age + np.random.normal(0, 12, size=50)

X_age = age.reshape(-1, 1)

age_model = LinearRegression()
age_model.fit(X_age, revenue)

revenue_pred = age_model.predict(X_age)

age_r2 = r2_score(revenue, revenue_pred)
age_rmse = root_mean_squared_error(revenue, revenue_pred)

print(f"Intercept: {age_model.intercept_:.4f}")
print(f"Slope: {age_model.coef_[0]:.4f}")
print(f"R-squared: {age_r2:.4f}")
print(f"RMSE: {age_rmse:.4f}")
Intercept: 18.6013
Slope: 2.3835
R-squared: 0.8834
RMSE: 12.8626

Business meaning:

  • each additional year of age is associated with about 2.38 more revenue units
  • this kind of model is useful for segmentation and value analysis
  • it may suggest that different age groups contribute differently to revenue

From Regression to Time Series

Key Difference

Regression Time Series
Observations are treated as independent Observations are ordered in time
Order is not the main issue Order matters
Often explains relationships Often explains and forecasts sequences

In time series analysis, yesterday can influence today, and the sequence itself matters.

Time Series Components

The main components of a time series are:

  • Trend → long-term upward or downward movement
  • Seasonality → repeating pattern
  • Noise → irregular variation

Not every time series contains all three clearly, but this framework helps us reason about the data.

Gold Dataset

Important

Remember to download the dataset from the Github repository and place it in the correct folder before running the code.

  • Dataset: gold_prices_2022_2023.csv URL
  • Dataset: gold_prices_2023_2024.csv URL
  • Dataset: gold_prices_2024_2025.csv URL
  • Dataset: gold_prices_2025_2026.csv URL

Source: Market Watch

Data cleaning note:

Gold price data often has missing values on weekends and holidays. The dataset we are using has been cleaned to include only business days, so there are no missing dates. This allows us to focus on the modeling aspect without needing to handle missing data in this example.

import pandas as pd
from os.path import join
PATH = '../data/regression/time_series'
(250, 5)
Date Open High Low Close
0 2023-04-21 2,016.10 2,016.80 1,982.30 1,990.50
1 2023-04-20 2,007.70 2,024.20 2,002.20 2,019.10
2 2023-04-19 2,017.90 2,020.30 1,980.90 2,007.30
3 2023-04-18 2,007.90 2,024.60 2,003.30 2,019.70
4 2023-04-17 2,014.10 2,028.00 1,993.40 2,007.00
ts = pd.read_csv(join(PATH, 'gold_prices_2022_2023.csv'), parse_dates=['Date'])
print(ts.shape)
ts.head()

Reading and Concatenating Multiple Files

The website provides data in 1-year chunks, so we will read and concatenate multiple files to get a longer time series.

Using glob and os, we can read all the CSV files in the folder and combine them into one DataFrame.

import glob
import os
file_paths = glob.glob(join(PATH, "*.csv"))
file_paths
['../../lab/python/data/regression/time_series/gold_prices_2023_2024.csv',
 '../../lab/python/data/regression/time_series/gold_prices_2024_2025.csv',
 '../../lab/python/data/regression/time_series/gold_prices_2025_2026.csv',
 '../../lab/python/data/regression/time_series/gold_prices_2022_2023.csv']

If we are interested in just the file names without the full paths, we can extract them using os.path.basename().

file_names = [os.path.basename(file) for file in file_paths]
file_names
['gold_prices_2023_2024.csv',
 'gold_prices_2024_2025.csv',
 'gold_prices_2025_2026.csv',
 'gold_prices_2022_2023.csv']

Combining the files into a single DataFrame can be done using pd.concat().

dataframes = [pd.read_csv(file, parse_dates=['Date']) for file in file_paths]
ts_combined = pd.concat(dataframes, ignore_index=True)
print(ts_combined.shape)
ts_combined.head()
(1008, 5)
Date Open High Low Close
0 2024-04-23 2,342.30 2,347.90 2,304.60 2,342.10
1 2024-04-22 2,403.70 2,404.30 2,338.20 2,346.40
2 2024-04-19 2,394.00 2,433.30 2,386.80 2,413.80
3 2024-04-18 2,377.90 2,408.00 2,377.20 2,398.00
4 2024-04-17 2,398.00 2,412.00 2,370.70 2,388.40
ts_combined.dtypes
Date     datetime64[us]
Open                str
High                str
Low                 str
Close               str
dtype: object
ts_combined["Close"] = (
    ts_combined["Close"]
    .astype(str)
    .str.replace(",", "", regex=False)
    .str.strip()
)

ts_combined["Close"] = pd.to_numeric(ts_combined["Close"], errors="coerce")

Visualization

Befor visualizing, we should sort the combined DataFrame by date to ensure the time series is in the correct order.

ts_combined = ts_combined.sort_values(by='Date')
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.ticker import MaxNLocator


plt.plot(
    ts_combined["Date"],
    ts_combined["Close"],
    linewidth=2,
    marker="o",
    markersize=3,
    alpha=0.85
)

plt.title("Close Price Over Time", fontsize=18, pad=15)
plt.xlabel("Date", fontsize=13)
plt.ylabel("Close", fontsize=13)

# Improve x-axis date formatting
plt.gca().xaxis.set_major_locator(mdates.MonthLocator(interval=3))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m"))

# Reduce number of y-axis labels
plt.gca().yaxis.set_major_locator(MaxNLocator(nbins=8))

plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

As you can see, the gold price data shows a clear upward trend over the two-year period, with some fluctuations along the way. This kind of pattern is common in financial time series, where we often see a combination of trend and noise. In the next sections, we will explore how to model this time series and make forecasts based on it.

Trend Analysis (Regression on Time)

Idea

Treat time as X:

\[ Close = \beta_0 + \beta_1 \cdot Time \]

This is not yet a full time-series model, but it is a very useful bridge from regression to forecasting.

Python Example

  1. Create a time index variable that counts the business days from the start of the series
  2. Fit a linear regression model using this time index as the predictor
  3. Evaluate the model fit using \(R^2\) and RMSE
ts_combined["time_index"] = np.arange(1, len(ts_combined) + 1)

X_time = ts_combined[["time_index"]]
y_time = ts_combined["Close"]

trend_model = LinearRegression()
trend_model.fit(X_time, y_time)

trend_pred = trend_model.predict(X_time)

trend_r2 = r2_score(y_time, trend_pred)
trend_rmse = np.sqrt(mean_squared_error(y_time, trend_pred))

Interpretation:

For this short sample:

print(f"Intercept: {trend_model.intercept_:.2f}")
print(f"Slope: {trend_model.coef_[0]:.4f}")
print(f"R-squared: {trend_r2:.4f}")
print(f"RMSE: {trend_rmse:.4f}")
Intercept: 1152.71
Slope: 2.9477
R-squared: 0.8247
RMSE: 395.4914

Meaning:

  • on average, the adjusted close increases by about 2.9477 per business-day step**
  • the upward trend is present, but not all variation is explained by a straight line
  • this is expected, because market prices are noisy and time dependence matters

ARIMA

ARIMA is a classic forecasting model for ordered data.

  • AR → autoregressive part, which uses past values
  • I → integrated part, which applies differencing
  • MA → moving-average part, which uses past forecast errors

In simple terms:

  • ARIMA helps us model time dependence
  • differencing helps stabilize a series when the level changes over time
  • the model can then be used for short-run forecasting

The model is written as:

\[ ARIMA(p, d, q) \]

where:

  • \(p\) controls how many past values are used.
  • \(d\) controls how many times the series is differenced.
  • \(q\) controls how many past forecast errors are used.

So, an ARIMA model combines three ideas:

  1. Use past values of the series.
  2. Transform the series into a more stable form.
  3. Use past forecast errors to improve future predictions.

The \(p\) Parameter: AutoRegressive Lags

The first parameter, \(p\), represents the number of autoregressive lags.

This means the model uses previous values of the time series to predict the current value.

For example, if we use an AR model with one lag, the model assumes that today’s value depends partly on yesterday’s value:

\[ y_t = c + \phi_1 y_{t-1} + \varepsilon_t \]

Here:

  • \(y_t\) is the value at time \(t\).
  • \(y_{t-1}\) is the value one period before.
  • \(\phi_1\) measures how strongly the previous value influences the current value.
  • \(\varepsilon_t\) is the remaining unexplained error.

If \(p = 1\), the model uses one previous value.

If \(p = 2\), the model uses two previous values:

\[ y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t \]

In simple terms, \(p\) answers the question:

How many previous observations should the model remember?

The \(d\) Parameter: Differencing

The second parameter, \(d\), represents the number of times the series is differenced.

Many real-world time series are not stationary. A stationary series has a relatively stable mean and variance over time. However, stock prices, sales, revenue, population, and website traffic often have trends.

ARIMA works best when the series is stationary, so we often transform the original series before modeling it.

Differencing means subtracting the previous value from the current value:

\[ \Delta y_t = y_t - y_{t-1} \]

Instead of modeling the original values, the model uses the change from one period to the next.

For example, suppose we have this series:

Time Value
1 100
2 110
3 125
4 140

The first difference is:

Time Difference
2 10
3 15
4 15

So instead of asking:

What will the next value be?

the differenced model asks:

What will the next change be?

If \(d = 0\), the model uses the original series.

If \(d = 1\), the model uses first differences.

If \(d = 2\), the model differences the series twice, meaning it models the change in the changes.

In simple terms, \(d\) answers the question:

How many times do we need to remove trend from the data?

The \(q\) Parameter: Moving-Average Lags

The third parameter, \(q\), represents the number of moving-average lags.

This part of the model uses past forecast errors.

A forecast error is the difference between the actual value and the predicted value:

\[ \varepsilon_t = y_t - \hat{y}_t \]

If the model made a mistake yesterday, that mistake may contain useful information for today’s forecast.

For example, an MA(1) model can be written as:

\[ y_t = c + \varepsilon_t + \theta_1 \varepsilon_{t-1} \]

This means today’s value depends partly on today’s random shock and yesterday’s forecast error.

If \(q = 1\), the model uses one previous forecast error.

If \(q = 2\), the model uses two previous forecast errors.

In simple terms, \(q\) answers the question:

How many previous forecast mistakes should the model use to correct itself?

Interpreting ARIMA(1, 1, 1)

In this lesson, we use:

\[ ARIMA(1, 1, 1) \]

This means:

Component Meaning Interpretation
\(p = 1\) One autoregressive lag The model uses the previous value.
\(d = 1\) One difference The model models changes rather than raw values.
\(q = 1\) One moving-average lag The model uses the previous forecast error.

So ARIMA(1, 1, 1) means:

First, difference the series once to remove trend. Then model the differenced series using one lag of the series and one lag of the forecast error.

Conceptually, the model says:

The next change in the series depends on the previous change and the previous forecasting mistake.

This is why ARIMA(1, 1, 1) is often a useful practical demonstration model. It is simple enough to explain, but it already contains all three main ARIMA ideas: memory, differencing, and error correction.

Python Example

from statsmodels.tsa.arima.model import ARIMA

ts_arima = ts_combined.set_index("Date").asfreq("B")

arima_model = ARIMA(ts_arima["Close"], order=(1, 1, 1))
arima_fit = arima_model.fit()

arima_forecast = arima_fit.forecast(steps=3)

For this short dataset, the next 3 business-day forecasts are approximately:

arima_forecast
2026-04-27    4742.197622
2026-04-28    4742.974111
2026-04-29    4743.438757
Freq: B, Name: predicted_mean, dtype: float64
Note

The statsmodels package is often preinstalled in many environments, but if you get ModuleNotFoundError, install it with:

pip install statsmodels

Prophet

Prophet is a forecasting package for Python and R developed by Facebook.

It is popular because:

  • it is easy to use
  • it works naturally with date-based data
  • it can handle trend changes
  • it can model seasonality and holiday effects when enough history is available
  • it provides an intuitive forecasting workflow

For this very short dataset, we use Prophet mainly to demonstrate the workflow correctly.

Python Example

from prophet import Prophet

df_prophet = ts_combined.rename(columns={"Date": "ds", "Close": "y"})

prophet_model = Prophet()
prophet_model.fit(df_prophet)

future = prophet_model.make_future_dataframe(periods=3, freq="B")

forecast = prophet_model.predict(future)

forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].tail(3)
ds yhat yhat_lower yhat_upper
1008 2026-04-27 5146.627474 5008.323560 5282.824426
1009 2026-04-28 5153.177394 5016.647241 5286.196621
1010 2026-04-29 5161.716647 5036.371597 5294.723632
Note

If you get ModuleNotFoundError, install Prophet with:

pip install prophet

Visualization

prophet_model.plot(forecast)
plt.title("Gold Futures (GC00) - Close Price Forecast")
plt.xlabel("Date")
plt.ylabel("Close Price")
plt.show()

prophet_model.plot_components(forecast)
plt.show()

How to read the Prophet output:

  • the main forecast plot shows the historical data, the fitted forecast line, and the uncertainty interval
  • the component plots break the forecast into interpretable pieces such as trend and seasonality
  • with this short business-day sample, the trend is the most useful part to discuss
  • with a longer history, Prophet becomes much more useful for storytelling around trend changes and seasonal patterns

Extra

Tip
  • Whatch this video for regularization
  • Read this for regularization