Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. Python
  2. Python
  3. Session 13: Logistic Regression
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Session 10: A/B Testing
      • Session 11: Cohort Analysis
      • Session 12: Simple Linear Regression and Forecasting
      • Session 13: Logistic Regression
      • Session 14: Clustering
      • Session 15: Geoanalytics
      • Session 16: SQL Alchemy
      • Slides
        • Grammar of Graphics
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Session Goal
  • Introduction
    • Regression vs Classification
    • Real Business Problems
    • Binary Target Variable
  • Why Not Linear Regression?
    • The Problem
    • Why This Breaks for Classification
    • The Need for Transformation
  • From Probability to Decision
    • Predicting Probability
    • Threshold-Based Decision
  • Logistic Regression Core Idea
    • Linear Score
    • Example
    • Transition to Probability
    • Intuition
    • Mini Example
    • Storytelling Perspective
    • Intuition Through Examples
    • Interpretation
  • Understanding Odds
    • From Probability to Odds
    • Example 1:
    • Example 2: Winning Game
    • Example 3: Lossing Game
    • Odds Table
    • From Odds to Log-Odds
  • Logistic Regression Equation
    • Interpreting Coefficients Properly
    • Example
    • Business Translation
  • Bringing Everything Together
    • Step 1 | Linear Score
    • Step 2 | Convert to Probability
    • Step 3 | Make Decision
    • Test your understanding | Odds
    • Question 2 | Log-Odds to Probability
    • Question 1 | Solution
    • Question 2 | Solution
  • Case Study 1: Customer Churn Prediction
    • Download the Dataset
    • Observing the Dataframe
    • Exploratory Data Analysis
    • Data Preprocessing
    • Train-Test Split
    • stratify=y
    • Training the Logistic Regression Model
    • Predicting Probabilities on the Test Set
    • Converting the Probabilities into Binary Predictions
    • Previewing the Predictions
    • Coefficient Interpretation with Exponentiated Betas
    • Retention Strategy
    • Confusion Matrix
    • Classification Metrics
    • ROC Curve and AUC
    • Revenue Matrix
  • Homework: Customer Churn Prediction
  1. Python
  2. Python
  3. Session 13: Logistic Regression

Session 13: Logistic Regression

Classification
Logistic Regression

Session Goal

In this session, we develop a complete understanding of logistic regression, starting from intuition and building up to implementation.

The focus is not only on the model itself, but on how it is used in real-world decision-making.

The key idea is:

Logistic regression connects data → probability → decision

Think of it as a pipeline:

  • We start with data about customers, users, or transactions
  • The model converts that data into a probability
  • That probability is then used to make a decision

For example:

  • A customer has a 0.82 probability of purchasing
  • Do we contact them or not?

This is where analytics meets business.

Introduction

Regression vs Classification

Before introducing logistic regression, we must clearly distinguish between two types of problems.

As we already know the Regression is used when the output is a continuous number:

  • predicting revenue
  • estimating house prices
  • forecasting demand

Example:

A customer is expected to spend $120 next month


Classification is used when the output is a category:

  • purchase vs not purchase
  • churn vs not churn
  • approve vs reject

Example:

Will the customer purchase? → Yes or No


This distinction is critical.

ImportantWhy Regression is Misleading

Logistic regression is called “regression”, but it is actually used for classification problems.

\[\downarrow\]

Because it predicts probabilities, not categories directly.

Real Business Problems

Marketing Campaign

You have 1,000 customers \(\rightarrow\) You want to send an offer \(\rightarrow\) Each contact costs money

Question/Problem:

Who should we contact?


Telecom Churn

  • Customers may leave the service
  • Retention campaigns are expensive

Question/Problem:

Which customers are at risk of churning?


Loan Approval

  • Bank gives loans to customers
  • Some customers default

Question/Problem:

Should we approve this loan?


Important

All these problems are not just predictions, as they are decisions with consequences.

Binary Target Variable

In all these examples, the outcome has only two possible values.

\[ y \in \{0, 1\} \]

We encode outcomes as:

  • \(1\) → event happens (purchase, churn, default)
  • \(0\) → event does not happen

Example dataset:

Customer Purchase
A 1
B 0
C 1

This encoding is essential because it allows us to work mathematically with classification.


Why Not Linear Regression?

The Problem

At first glance, you might think:

Why not just use linear regression?

We can try:

\[ \hat{y} = \beta_0 + \beta_1x \]

But here is the issue:

Linear regression can produce any value:

\[ -\infty < \hat{y} < +\infty \]

Why This Breaks for Classification

Suppose we predict:

  • \(-0.3\) → does this make sense as probability?
  • \(1.4\) → can probability be greater than 1?

\(\downarrow\)

No!

Probabilities must always satisfy:

\[ 0 \leq p \leq 1 \]


Imagine a marketing model predicts:

  • Customer A → 1.25 probability
  • Customer B → -0.15 probability

\[\downarrow\]

These outputs are meaningless.

This is why linear regression is not suitable for classification.

The Need for Transformation

Thus, we need a function that:

  1. accepts any number from \(-\infty\) to \(+\infty\)
  2. always outputs a value between 0 and 1

This is exactly what the sigmoid function does.

From Probability to Decision

Predicting Probability

Logistic regression does not directly say:

“This customer will purchase”

Instead, it says:

\[ P(y = 1 \mid X) \]

Customer A → 0.82 probability of purchase


The probability gives us flexibility.

We can:

  • rank customers
  • prioritize actions
  • control risk

Threshold-Based Decision

To make a final decision, we use a threshold.

\[ \hat{y} = \begin{cases} 1, & p \geq 0.5 \\ 0, & p < 0.5 \end{cases} \]


Example table

Customer Probability Decision
A 0.82 1
B 0.63 1
C 0.47 0
D 0.12 0
ImportantImportant to know

The threshold does not have to be 0.5.

  • In marketing → lower threshold (more aggressive)
  • In banking → higher threshold (more conservative)

Business translation:

  • Probability → level of confidence
  • Threshold → business strategy

This is where the model becomes actionable.

Logistic Regression Core Idea

Linear Score

Logistic regression starts similarly to linear regression.

We compute a score:

\[ z = \beta_0 + \beta_1x_1 + \cdots + \beta_kx_k \]

This is a weighted combination of features.

Example

Suppose:

\[ z = -2 + 0.6 \cdot \text{support calls} - 0.05 \cdot \text{tenure} \]

Interpretation (pay attention to signs):

  • more support calls → increases score
  • longer tenure → decreases score
Important

This score \(z\) is not yet a probability.

It is just a position on a scale:

  • very negative → unlikely
  • around zero → uncertain
  • very positive → likely

Transition to Probability

To convert this score into a probability, we apply the sigmoid function:

\[ p = \frac{1}{1 + e^{-z}} \]

Intuition

Think of it like a decision system:

  • raw score → internal evaluation
  • sigmoid → converts into probability
  • threshold → converts into decision
customer raw_score probability decision
0 Customer A -4.0 0.017986 No Churn
1 Customer B -1.5 0.182426 No Churn
2 Customer C 0.0 0.500000 Churn
3 Customer D 1.5 0.817574 Churn
4 Customer E 4.0 0.982014 Churn

Mini Example

If:

\[ z = 0 \]

then:

\[ p = 0.5 \]


If:

\[ z = 2 \]

then:

\[ p \approx 0.88 \]


If:

\[ z = -2 \]

then:

\[ p \approx 0.12 \]

Important

Logistic regression works in three steps:

  1. Compute a score \(z\)
  2. Convert score into probability \(p\)
  3. Apply threshold to make decision

This completes the conceptual foundation before moving into implementation.

Storytelling Perspective

Imagine a telecom company evaluating customers for churn.

Each customer gets a score:

  • many complaints → increases score
  • long tenure → decreases score
  • high monthly charges → increases score

So:

  • Customer A → \(z = -3.2\) → very unlikely to churn
  • Customer B → \(z = 0.4\) → uncertain
  • Customer C → \(z = 2.1\) → very likely to churn

But these are just scores, not probabilities.


Intuition Through Examples

Let’s examine how different values of \(z\) behave.

import numpy as np
import pandas as pd

z_values = [-5, -2, -1, 0, 1, 2, 5]
p_values = 1 / (1 + np.exp(-np.array(z_values)))

df_sigmoid = pd.DataFrame({
    "score_z": z_values,
    "probability": p_values
})

df_sigmoid
score_z probability
0 -5 0.006693
1 -2 0.119203
2 -1 0.268941
3 0 0.500000
4 1 0.731059
5 2 0.880797
6 5 0.993307

Interpretation

Score (\(z\)) Probability
very negative close to 0
0 0.5
very positive close to 1

Logistic regression is not linear in probability — it is linear in log-odds

Understanding Odds

From Probability to Odds

Instead of working directly with probability, logistic regression uses odds:

\[ \text{odds} = \frac{p}{1-p} \]

Example 1:

If:

\[ p = 0.75 \]

then:

\[ \text{odds} = \frac{0.75}{0.25} = 3 \]

Interpretation:

The event is 3 times more likely to happen than not happen

Example 2: Winning Game

Sayin that the odds in favor of my team (FC Barcelona) winning a game are 4 to 1 means:

  • 4 of 5 times they win
  • 1 of 5 times they lose.

Example 3: Lossing Game

Sayin that the odds in favor of my team (Real Madrid) winning a game are 1 to 4 means:

  • 1 of 5 times they win
  • 4 of 5 times they lose.

Odds Table

probabilities = [0.1, 0.3, 0.5, 0.7, 0.9]

odds_table = pd.DataFrame({
    "probability": probabilities
})

odds_table["odds"] = odds_table["probability"] / (1 - odds_table["probability"])

odds_table
probability odds
0 0.1 0.111111
1 0.3 0.428571
2 0.5 1.000000
3 0.7 2.333333
4 0.9 9.000000
Important
  • probability increases linearly
  • odds increase non-linearly

This makes odds more suitable for modeling.

From Odds to Log-Odds

Why Take the Log?

Odds are always positive:

\[ 0 < \text{odds} < \infty \]

But we want a model that can handle:

\[ -\infty < \text{value} < +\infty \]

So we take the logarithm:

\[ \log\left(\frac{p}{1-p}\right) \]

This is called the log-odds or logit.

Logistic Regression Equation

Now we connect everything: Logistic regression models the log-odds as a linear function of features:

\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + \cdots + \beta_kx_k \]

This equation tells us:

Each feature changes the log-odds, not the probability directly

Interpreting Coefficients Properly

Direction of Effect

  • \(\beta > 0\) → increases probability
  • \(\beta < 0\) → decreases probability

Magnitude via Odds Ratio

To interpret magnitude, we use:

\[ e^{\beta} \]

sometimes called the odds ratio or exp($\beta$).


Example

If:

\[ \beta = 0.7 \]

then:

\[ e^{0.7} \approx 2.01 \]

\[\downarrow\]

A one-unit increase in this feature doubles the odds of the event


Business Translation

Suppose:

  • feature = number of support calls
  • coefficient = 0.7
  • \(e^{0.7} \approx 2\)

Then:

Each additional support call doubles the odds of churn

Bringing Everything Together

Logistic regression works as a pipeline:

Step 1 | Linear Score

\[ z = \beta_0 + \beta_1x_1 + \cdots \]


Step 2 | Convert to Probability

\[ p = \frac{1}{1 + e^{-z}} \]


Step 3 | Make Decision

\[ \hat{y} = \begin{cases} 1, & p \geq \text{threshold} \\ 0, & p < \text{threshold} \end{cases} \]

Test your understanding | Odds

Question 1 | Probability to Odds

A customer has a 60% probability of churn.

  • What are the odds of churn?
  • Interpret the result

Question 2 | Log-Odds to Probability

A model gives:

\[ \log(\text{odds}) = 2 \]

  • Convert to odds
  • Convert to probability
  • Interpret the result

Question 1 | Solution

Recall:

\[ \text{odds} = \frac{p}{1-p} \]

\[ p = 0.6 \]

\[ \text{odds} = \frac{0.6}{0.4} = 1.5 \]

Interpretation:

  • The event is 1.5 times more likely to happen than not happen
  • Churn is more likely than staying, but not extremely strong

Question 2 | Solution

Recall:

\[ \text{odds} = e^{\text{log-odds}} \]

\[ \text{odds} = e^2 \approx 7.39 \]

Now convert to probability:

\[ p = \frac{7.39}{1 + 7.39} \approx 0.88 \]

Interpretation:

  • The probability is approximately 88%
  • This represents a very high likelihood of the event
ImportantFinal Intuition
  • The model builds a score
  • The sigmoid converts it into probability
  • The threshold converts it into action

Logistic regression is not about predicting 0 or 1
It is about estimating how likely something is to happen

This is what makes it powerful for real-world decision systems.

TipVideo Explanation

Check out this video for a visual explanation of logistic regression:

Odd Ratios and Log-Odds Explained | Logistic Regression Intuition

Case Study 1: Customer Churn Prediction

In this section, we will use sythetic data to illustrate the entire logistic regression pipeline.

  • train a logistic regression model
  • interpret results using:
    • coefficients and \(e^{\beta}\)
    • confusion matrix
    • revenue matrix

Download the Dataset

import pandas as pd

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

df = pd.read_csv("https://raw.githubusercontent.com/hovhannisyan91/data_analytics_with_python/refs/heads/main/data/regression/logistic_regression/synthetic_churn_data.csv")
print(f"Dataset shape: {df.shape}")
df.head()


df.to_csv("../../lab/python/data/regression/logistic_regression/synthetic_churn_data.csv", index=False)
Dataset shape: (1000, 6)

Observing the Dataframe

Before modeling, we should always inspect the dataset.

df.info()
<class 'pandas.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   tenure           1000 non-null   int64  
 1   monthly_charges  1000 non-null   float64
 2   support_calls    1000 non-null   int64  
 3   contract_type    1000 non-null   int64  
 4   churn            1000 non-null   int64  
 5   education_level  1000 non-null   str    
dtypes: float64(1), int64(4), str(1)
memory usage: 53.8 KB

The dataset now contains both numeric and categorical variables.

Column Meaning
tenure Number of months the customer has stayed with the company
monthly_charges Monthly amount paid by the customer
support_calls Number of support calls made by the customer
contract_type Contract type, where 0 = prepaid and 1 = postpaid
education_level Customer education category
churn Target variable, where 1 = churn and 0 = no churn

The target variable is:

\[ y = \text{churn} \]

where:

1: customer churned (left the service)
0: customer did not churn (stayed with the service)

Exploratory Data Analysis

Target Variable Distribution

df['churn'].value_counts(normalize=True)
churn
0    0.633
1    0.367
Name: proportion, dtype: float64

Churn Rate by Contract Type

churn_rate_by_contract = df.groupby('contract_type')['churn'].mean().reset_index()
import plotly.express as px
fig = px.bar(churn_rate_by_contract, x='contract_type', y='churn', title='Churn Rate by Contract Type')
fig.update_layout(
    xaxis_title='Contract Type (0 = Prepaid, 1 = Postpaid',
    yaxis_title='Churn Rate',
    template='plotly_white'
)   
fig.show()

As it was expected, the churn rate is lower for postpaid customers (contract_type = 1) compared to prepaid customers (contract_type = 0).

Tip

Think about why this might be the case. What are the possible reasons for this difference in churn rates between prepaid and postpaid customers?

Churn Rate by Education Level

churn_rate_by_education = df.groupby('education_level')['churn'].mean().reset_index()
import plotly.express as px
fig = px.bar(churn_rate_by_education, x='education_level', y='churn', title='Churn Rate by Education Level')
fig.update_layout(
    xaxis_title='Education Level',
    yaxis_title='Churn Rate',
    template='plotly_white'
)   
fig.show()

  • The lowest observed churn rate is 12.40%.
  • The highest observed churn rate is 63.60%.

Data Preprocessing

We now separate the dataset into input variables and the target variable.

X = df.drop(columns=["churn"])
y = df["churn"]
  • X: contains all the features that we will use to predict churn
  • y contains the target variable indicating whether a customer churned or not.

Encoding Categorical Variables

Logistic regression (basically all the ML models) requires numeric input.

The column education_level contains text categories, so we convert it into dummy variables.

Beofre encoding, the education_level column has 4 categories: and we will create 3 new binary columns to represent these categories (dropping one to avoid multicollinearity). As a base category, we will use education_level = "High School".

In ordert to make sure that the baseiline category is we will enforce the category using pd.Categorical before encoding.

education_order = ["High School", "Bachelor", "Master", "PhD"]

df["education_level"] = pd.Categorical(
    df["education_level"],
    categories=education_order,
    ordered=True
)

education_dummies = pd.get_dummies(
    df["education_level"],
    prefix="education",
    drop_first=True,
    dtype=int
)

education_dummies.head()
education_Bachelor education_Master education_PhD
0 0 0 0
1 1 0 0
2 0 1 0
3 0 0 0
4 0 1 0

The remaining dummy variables compare each education level against that baseline.

For example, if the baseline is High School, then:

Dummy Variable Interpretation
education_level_Bachelor Bachelor compared with High School
education_level_Master Master compared with High School
education_level_PhD PhD compared with High School

Once we have the dummy variables, we can concatenate them back to the original dataframe and drop the original education_level column. Pay attention to the axis=1, which indicates that we are concatenating columns (not rows).

X_encoded = pd.concat([X.drop(columns=["education_level"]), education_dummies], axis=1)

X_encoded.head()
tenure monthly_charges support_calls contract_type education_Bachelor education_Master education_PhD
0 29 46.867736 6 0 0 0 0
1 15 74.163421 6 0 1 0 0
2 8 83.347822 3 1 0 1 0
3 21 45.788769 2 0 0 0 0
4 19 33.935607 7 0 0 1 0

Checking the final set of features

print("Final set of features:")
print(X_encoded.dtypes)
Final set of features:
tenure                  int64
monthly_charges       float64
support_calls           int64
contract_type           int64
education_Bachelor      int64
education_Master        int64
education_PhD           int64
dtype: object

Making sure that all the features are numeric is crucial for the logistic regression model to work properly. If there were any remaining categorical variables, we would need to encode them as well before proceeding with modeling.

Train-Test Split

Before training the model, we divide the data into two parts:

  1. Training set
  2. Test set

The training set is used to teach the model.
The test set is used to evaluate whether the model can make good predictions on new, unseen data

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In machine learning, we should not evaluate the model on the same data that was used for training.

If we train and test the model on the same dataset, the model may appear to perform very well simply because it has already seen those examples. This does not tell us whether the model can generalize to new customers.

The train-test split helps us answer the following question:

Can the model make accurate predictions for customers it has not seen before?

For example, if we have 1,000 customers, using test_size=0.2 means:

Dataset Part Percentage Number of Customers
Training set 80% 800
Test set 20% 200

The model learns from the training set and is evaluated on the test set.

random_state=42

The train-test split is random by default.

This means that every time we run the code, Python may select different rows for the training and test sets.

As a result, the model results may slightly change every time we run the notebook.

By setting:

random_state=42

we make the split reproducible.

This means that every time we run the code, we get the same training and testing sets.

The number 42 is not mathematically special. It is simply a commonly used fixed number.

stratify=y

The argument stratify=y is very important for classification problems.

It tells Python to preserve the same class distribution in both the training set and the test set.

For example, suppose the full dataset has the following churn distribution:

Class Meaning Percentage
0 Did not churn 80%
1 Churned 20%

With stratify=y, the training and test sets will keep approximately the same balance:

Dataset Part Did Not Churn Churned
Training set 80% 20%
Test set 80% 20%

This is especially important when the target variable is imbalanced.

In churn prediction, the number of customers who churn is usually much smaller than the number of customers who do not churn.

Without stratification, the test set may accidentally contain too few churned customers or too many churned customers.

That would make the model evaluation unreliable.

For example, if the test set contains very few churned customers, the model may look better than it really is.

Output of train_test_split()

The function returns four objects:

X_train, X_test, y_train, y_test

Each one has a specific purpose.

  • X_train: contains the input features for the training data. This is the data the model uses to learn patterns.
  • X_test: contains the input features for the test data. This allows us to test how well the model performs on unseen customers.
  • y_train: contains the true target values for the training data. This is what the model tries to predict during training.
  • y_test: contains the true target values for the test data. This is what we use to evaluate the model’s predictions on the test set.

Let’s check the shapes of these objects to confirm that the split was done correctly.

train_size = len(X_train)
test_size = len(X_test)

train_churn_rate = y_train.mean()
test_churn_rate = y_test.mean()

print("Training rows:", train_size)
print("Test rows:", test_size)
print("Training churn rate:", round(train_churn_rate, 3))
print("Test churn rate:", round(test_churn_rate, 3))
Training rows: 800
Test rows: 200
Training churn rate: 0.368
Test churn rate: 0.365

Thus , the train-test split is used to check whether the model can generalize.

The training set contains 800 customers.

The test set contains 200 customers.

The churn rate in the training set is np.float64(36.75)%, while the churn rate in the test set is np.float64(36.5)%.

Because we used stratification, these two percentages should be close.

Training the Logistic Regression Model

model = LogisticRegression(max_iter=1000)

model.fit(X_train, y_train)
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2'

Specify the norm of the penalty:

- `None`: no penalty is added;
- `'l2'`: add a L2 penalty term and it is the default choice;
- `'l1'`: add a L1 penalty term;
- `'elasticnet'`: both L1 and L2 penalty terms are added.

.. warning::
Some penalties may not work with some solvers. See the parameter
`solver` below, to know the compatibility between the penalty and
solver.

.. versionadded:: 0.19
l1 penalty with SAGA solver (allowing 'multinomial' + L1)

.. deprecated:: 1.8
`penalty` was deprecated in version 1.8 and will be removed in 1.10.
Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for
`penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for
`'penalty='elasticnet'`.
'deprecated'
C C: float, default=1.0

Inverse of regularization strength; must be a positive float.
Like in support vector machines, smaller values specify stronger
regularization. `C=np.inf` results in unpenalized logistic regression.
For a visual example on the effect of tuning the `C` parameter
with an L1 penalty, see:
:ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.
1.0
l1_ratio l1_ratio: float, default=0.0

The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting
`l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty.
Any value between 0 and 1 gives an Elastic-Net penalty of the form
`l1_ratio * L1 + (1 - l1_ratio) * L2`.

.. warning::
Certain values of `l1_ratio`, i.e. some penalties, may not work with some
solvers. See the parameter `solver` below, to know the compatibility between
the penalty and solver.

.. versionchanged:: 1.8
Default value changed from None to 0.0.

.. deprecated:: 1.8
`None` is deprecated and will be removed in version 1.10. Always use
`l1_ratio` to specify the penalty type.
0.0
dual dual: bool, default=False

Dual (constrained) or primal (regularized, see also
:ref:`this equation `) formulation. Dual formulation
is only implemented for l2 penalty with liblinear solver. Prefer `dual=False`
when n_samples > n_features.
False
tol tol: float, default=1e-4

Tolerance for stopping criteria.
0.0001
fit_intercept fit_intercept: bool, default=True

Specifies if a constant (a.k.a. bias or intercept) should be
added to the decision function.
True
intercept_scaling intercept_scaling: float, default=1

Useful only when the solver `liblinear` is used
and `self.fit_intercept` is set to `True`. In this case, `x` becomes
`[x, self.intercept_scaling]`,
i.e. a "synthetic" feature with constant value equal to
`intercept_scaling` is appended to the instance vector.
The intercept becomes
``intercept_scaling * synthetic_feature_weight``.

.. note::
The synthetic feature weight is subject to L1 or L2
regularization as all other features.
To lessen the effect of regularization on synthetic feature weight
(and therefore on the intercept) `intercept_scaling` has to be increased.
1
class_weight class_weight: dict or 'balanced', default=None

Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one.

The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.

.. versionadded:: 0.17
*class_weight='balanced'*
None
random_state random_state: int, RandomState instance, default=None

Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the
data. See :term:`Glossary ` for details.
None
solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs'

Algorithm to use in the optimization problem. Default is 'lbfgs'.
To choose a solver, you might want to consider the following aspects:

- 'lbfgs' is a good default solver because it works reasonably well for a wide
class of problems.
- For :term:`multiclass` problems (`n_classes >= 3`), all solvers except
'liblinear' minimize the full multinomial loss, 'liblinear' will raise an
error.
- 'newton-cholesky' is a good choice for
`n_samples` >> `n_features * n_classes`, especially with one-hot encoded
categorical features with rare categories. Be aware that the memory usage
of this solver has a quadratic dependency on `n_features * n_classes`
because it explicitly computes the full Hessian matrix.
- For small datasets, 'liblinear' is a good choice, whereas 'sag'
and 'saga' are faster for large ones;
- 'liblinear' can only handle binary classification by default. To apply a
one-versus-rest scheme for the multiclass setting one can wrap it with the
:class:`~sklearn.multiclass.OneVsRestClassifier`.

.. warning::
The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`
for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for
Elastic-Net) and on (multinomial) multiclass support:

================= ======================== ======================
solver l1_ratio multinomial multiclass
================= ======================== ======================
'lbfgs' l1_ratio=0 yes
'liblinear' l1_ratio=1 or l1_ratio=0 no
'newton-cg' l1_ratio=0 yes
'newton-cholesky' l1_ratio=0 yes
'sag' l1_ratio=0 yes
'saga' 0<=l1_ratio<=1 yes
================= ======================== ======================

.. note::
'sag' and 'saga' fast convergence is only guaranteed on features
with approximately the same scale. You can preprocess the data with
a scaler from :mod:`sklearn.preprocessing`.

.. seealso::
Refer to the :ref:`User Guide ` for more
information regarding :class:`LogisticRegression` and more specifically the
:ref:`Table `
summarizing solver/penalty supports.

.. versionadded:: 0.17
Stochastic Average Gradient (SAG) descent solver. Multinomial support in
version 0.18.
.. versionadded:: 0.19
SAGA solver.
.. versionchanged:: 0.22
The default solver changed from 'liblinear' to 'lbfgs' in 0.22.
.. versionadded:: 1.2
newton-cholesky solver. Multinomial support in version 1.6.
'lbfgs'
max_iter max_iter: int, default=100

Maximum number of iterations taken for the solvers to converge.
1000
verbose verbose: int, default=0

For the liblinear and lbfgs solvers set verbose to any positive
number for verbosity.
0
warm_start warm_start: bool, default=False

When set to True, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
Useless for liblinear solver. See :term:`the Glossary `.

.. versionadded:: 0.17
*warm_start* to support *lbfgs*, *newton-cg*, *sag*, *saga* solvers.
False
n_jobs n_jobs: int, default=None

Does not have any effect.

.. deprecated:: 1.8
`n_jobs` is deprecated in version 1.8 and will be removed in 1.10.
None

max_iter=1000 is used to ensure that the optimization algorithm has enough iterations to converge to a solution. Logistic regression uses an iterative process to find the best coefficients, and sometimes it may require more iterations than the default (which is usually 100) to find the optimal solution, especially if the dataset is complex or has many features.

The model estimates coefficients for each feature.

Internally, logistic regression models the log-odds of churn:

\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + \cdots + \beta_kx_k \]

where \(p\) is the probability of churn.

Predicting Probabilities on the Test Set

probs = model.predict_proba(X_test)
probs[:5]
array([[0.98318855, 0.01681145],
       [0.58272749, 0.41727251],
       [0.52510491, 0.47489509],
       [0.3840094 , 0.6159906 ],
       [0.67936502, 0.32063498]])

The output probs contains the predicted probability of churn for each customer in the test set.

We use [:, 1] because class 1 represents churn.

probs = probs[:, 1]
probs[:5]
array([0.01681145, 0.41727251, 0.47489509, 0.6159906 , 0.32063498])

Converting the Probabilities into Binary Predictions

threshold = 0.5
preds = (probs >= threshold).astype(int)
preds[:5]
array([0, 0, 0, 1, 0])

The threshold converts probabilities into class predictions.

\[ \hat{y} = \begin{cases} 1, & p \geq 0.5 \\ 0, & p < 0.5 \end{cases} \]

Previewing the Predictions

prediction_results = X_test.copy()

prediction_results["actual_churn"] = y_test.values
prediction_results["predicted_probability"] = probs
prediction_results["predicted_churn"] = preds

prediction_results.head()
tenure monthly_charges support_calls contract_type education_Bachelor education_Master education_PhD actual_churn predicted_probability predicted_churn
361 34 55.062693 2 1 0 0 0 0 0.016811 0
5 23 103.493024 4 0 0 0 1 0 0.417273 0
692 25 75.778342 5 0 0 1 0 1 0.474895 0
708 6 112.717782 5 1 0 1 0 0 0.615991 1
841 26 67.601813 7 1 0 0 1 0 0.320635 0

Coefficient Interpretation with Exponentiated Betas

The raw coefficients are log-odds coefficients.

To make them easier to interpret, we exponentiate them.

coef_df = pd.DataFrame({
    "feature": X_encoded.columns,
    "beta": model.coef_[0],
    "exp_beta": np.exp(model.coef_[0])
})

coef_df = coef_df.sort_values("exp_beta", ascending=False)

coef_df
feature beta exp_beta
2 support_calls 0.569350 1.767118
4 education_Bachelor 0.122790 1.130646
5 education_Master 0.112264 1.118808
1 monthly_charges 0.020344 1.020552
0 tenure -0.055288 0.946213
6 education_PhD -0.226263 0.797508
3 contract_type -1.228894 0.292616

Interpretation

The table contains two important values:

Column Meaning
beta Effect on log-odds
exp_beta Odds multiplier, calculated as \(e^{\beta}\)

Rules for interpretation:

  • If exp_beta > 1, the feature increases the odds of churn
  • If exp_beta < 1, the feature decreases the odds of churn
  • If exp_beta = 1, the feature has almost no effect on churn odds
Support Calls
  • exp_beta = 1.77
  • Each additional support call multiplies churn odds by 1.77
  • This means churn odds increase by approximately 77%
  • This is the strongest churn-increasing factor

Business meaning: frequent support calls are an early warning signal

Education: Bachelor
  • exp_beta = 1.13
  • Bachelor customers have 1.13 times the odds of churn compared with the baseline education group
  • This means approximately 13% higher odds of churn

The effect is positive but relatively small

Education: Master
  • exp_beta = 1.12
  • Master customers have 1.12 times the odds of churn compared with the baseline education group
  • This means approximately 12% higher odds of churn

The effect is also positive but relatively small

Education: PhD
  • exp_beta = 0.80
  • PhD customers have 0.80 times the odds of churn compared with the baseline education group
  • This means approximately 20% lower odds of churn
  • This result goes against the original assumption that higher education should always increase churn risk

Possible reason: other variables may explain churn better, or the synthetic relationship may not be strong enough

Monthly Charges
  • exp_beta = 1.02
  • Each one-unit increase in monthly charges multiplies churn odds by 1.02
  • This means approximately 2% higher odds of churn per unit
  • The single-unit effect is small, but larger price differences can matter

Example:

\[ 1.02^{10} \approx 1.22 \]

A 10-unit increase in monthly charges is associated with roughly 22% higher odds of churn.

Tenure
  • exp_beta = 0.95
  • Each additional month of tenure multiplies churn odds by 0.95
  • This means churn odds decrease by approximately 5% per month

Business meaning: longer-tenure customers are more stable

Contract Type
  • exp_beta = 0.29
  • Postpaid contract customers have 0.29 times the odds of churn compared with prepaid customers
  • This means approximately 71% lower odds of churn
  • This is the strongest churn-reducing factor

Business meaning: long-term contracts strongly reduce churn risk

Retention Strategy

The company should prioritize customers who:

  • have many support calls
  • pay higher monthly charges
  • have short tenure
  • are on postpaid contracts

These customers are more likely to churn and are good candidates for retention campaigns.

Confusion Matrix

One of the most common ways to evaluate classification models is through the confusion matrix.

The confusion matrix compares actual churn outcomes with predicted churn outcomes.

Outcome Meaning
True Positive Customer churned and model predicted churn
False Positive Customer did not churn but model predicted churn
False Negative Customer churned but model missed it
True Negative Customer did not churn and model predicted no churn
cm = confusion_matrix(y_test, preds)

tn, fp, fn, tp = cm.ravel()

cm
array([[103,  24],
       [ 23,  50]])

Classification Metrics

accuracy = accuracy_score(y_test, preds)
precision = precision_score(y_test, preds)
recall = recall_score(y_test, preds)

print("Accuracy:", round(accuracy, 3))
print("Precision:", round(precision, 3))
print("Recall:", round(recall, 3))
Accuracy: 0.765
Precision: 0.676
Recall: 0.685
TipAccuracy

The model accuracy is 76.5%.


TipPrecision

The precision is 67.57%.

This means that among customers predicted as churners, 67.57% actually churned.

TipRecall

The recall is 68.49%.

This means that among all actual churners, the model identified 68.49%.

ROC Curve and AUC

After using the confusion matrix, we can also evaluate the logistic regression (all the available classification models) model using the ROC curve (Receiver Operating Characteristic) and the AUC (Area Under the Curve).

ROC Intuition

The confusion matrix depends on one selected threshold.

For example:

threshold = 0.5

But in business problems, the threshold may change depending on the strategy.

For churn prediction:

  • lower threshold → contact more customers
  • higher threshold → contact fewer customers
  • lower threshold may increase recall
  • higher threshold may increase precision

The ROC curve helps us understand how the model behaves across different threshold values.

ImportantROC Curve

The ROC curve compares two quantities:

Metric Meaning
True Positive Rate How many actual churners we correctly identify
False Positive Rate How many non-churners we incorrectly classify as churners

The formulas are:

\[ TPR = \frac{TP}{TP + FN} \]

\[ FPR = \frac{FP}{FP + TN} \]

In churn language:

  • TPR answers: among customers who actually churned, how many did we detect?
  • FPR answers: among customers who did not churn, how many did we incorrectly target?

ROC Curve Code

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, probs)

auc_score = roc_auc_score(y_test, probs)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {auc_score:.3f})")
plt.plot([0, 1], [0, 1], linestyle="--", label="Random Model")

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Logistic Regression Model")
plt.legend()
plt.grid(True)
plt.show()

The model AUC is 0.826.

This means that the model has a 82.56% ability to rank a randomly selected churner higher than a randomly selected non-churner.


How to Interpret AUC

AUC Value Interpretation
0.50 No better than random guessing
0.60–0.70 Weak model
0.70–0.80 Acceptable model
0.80–0.90 Strong model
0.90+ Very strong model

Business Interpretation

  • If AUC is close to 0.5, the model cannot separate churners from non-churners well
  • If AUC is high, the model is good at ranking customers by churn risk
  • A high AUC does not automatically mean the campaign is profitable

We still need the confusion matrix and revenue matrix to choose the best threshold


Threshold Table from ROC Curve

The ROC curve gives many possible thresholds.

We can inspect some of them:

roc_threshold_table = pd.DataFrame({
    "threshold": thresholds,
    "false_positive_rate": fpr,
    "true_positive_rate": tpr
})

roc_threshold_table.head(10)
threshold false_positive_rate true_positive_rate
0 inf 0.000000 0.000000
1 0.942007 0.000000 0.013699
2 0.918466 0.000000 0.027397
3 0.914077 0.007874 0.027397
4 0.872713 0.007874 0.109589
5 0.865498 0.015748 0.109589
6 0.864796 0.015748 0.123288
7 0.863052 0.023622 0.123288
8 0.856353 0.023622 0.136986
9 0.836171 0.047244 0.136986

Summary

  • The ROC curve evaluates model ranking ability.
  • The revenue matrix evaluates business value.

\[\downarrow\]

  • ROC/AUC tells us whether the model separates churners from non-churners
  • Confusion matrix tells us the classification results at one threshold
  • Revenue matrix tells us whether the classification strategy creates profit

Revenue Matrix

There is a cost associated with contacting customers and a cost associated with losing customers. Usally from Data Analytics perspective, we want to maximize revenue, not just accuracy. Revenue matrix allows us to evaluate the model based on the financial impact of its predictions.

Assume the company contacts customers predicted as likely to churn.

retention_cost = 10
retention_benefit = 50

revenue_from_saved_customers = tp * (retention_benefit - retention_cost)
cost_from_unnecessary_contacts = fp * retention_cost

net_profit = revenue_from_saved_customers - cost_from_unnecessary_contacts

print("Revenue from saved customers:", revenue_from_saved_customers)
print("Cost from unnecessary contacts:", cost_from_unnecessary_contacts)
print("Net profit:", net_profit)
Revenue from saved customers: 2000
Cost from unnecessary contacts: 240
Net profit: 1760

Prediction Outcome Business Meaning Financial Effect
True Positive Correctly targeted churner retention_benefit - retention_cost
False Positive Contacted non-churner unnecessarily -retention_cost
False Negative Missed churner 0
True Negative Correctly ignored non-churner 0

The model correctly targeted np.int64(50) churners.

These customers generated np.int64(2000) units of value after subtracting campaign cost.

The model also contacted np.int64(24) customers unnecessarily, creating a cost of np.int64(240).

The final estimated campaign profit is np.int64(1760).

Homework: Customer Churn Prediction

Use the same steps to build a logistic regression model on the customer_churn_data.csv dataset.

df = pd.read_csv("https://raw.githubusercontent.com/hovhannisyan91/data_analytics_with_python/refs/heads/main/data/regression/logistic_regression/Telco_Customer_Churn.csv")
df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

We have customer information for a Telecommunications company:

We’ve got customer IDs, general customer info, the servies they’ve subscribed too, type of contract and monthly charges. This is a historic customer information so we have a field stating whether that customer has churned

Field Descriptions:

  • customerID - Customer ID
  • gender - Whether the customer is a male or a female
  • SeniorCitizen - Whether the customer is a senior citizen or not (1, 0)
  • Partner - Whether the customer has a partner or not (Yes, No)
  • Dependents - Whether the customer has dependents or not (Yes, No)
  • tenure - Number of months the customer has stayed with the company
  • PhoneService - Whether the customer has a phone service or not (Yes, No)
  • MultipleLines - Whether the customer has multiple lines or not (Yes, No, No phone service)
  • InternetService - Customer’s internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity - Whether the customer has online security or not (Yes, No, No internet service)
  • OnlineBackup - Whether the customer has online backup or not (Yes, No, No internet service)
  • DeviceProtection - Whether the customer has device protection or not (Yes, No, No internet service)
  • TechSupport - Whether the customer has tech support or not (Yes, No, No internet service)
  • StreamingTV - Whether the customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies - Whether the customer has streaming movies or not (Yes, No, No internet service)
  • Contract - The contract term of the customer (Month-to-month, One year, Two year)
  • PaperlessBilling - Whether the customer has paperless billing or not (Yes, No)
  • PaymentMethod - The customer’s payment method (Electronic check, Mailed check Bank transfer (automatic), Credit card (automatic))
  • MonthlyCharges - The amount charged to the customer monthly
  • TotalCharges - The total amount charged to the customer
  • Churn - Whether the customer churned or not (Yes or No)