Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. Python
  2. Python
  3. Session 10: A/B Testing
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Session 10: A/B Testing
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Outline
  • What is A/B Testing?
    • Applications of A/B Testing
  • A/B Testing Steps
    • Choosing metrics | Step 1
    • Power Analysis | Step 2
    • Random Sampling | Step 3
    • Analyzing the results | Step 4
  • Power Analysis With Python
    • Loading Packages
    • Calculating Sample Size
    • Sampling distributions
    • Power
  • Hypothesis Testing
    • Interpretation
  • Practical Significance
    • Case 1 | low variance
    • Case 2 | huge sample size
    • Utils Functions
    • Case 1 | Experiment
    • Case 2 | Experiment
  • A/A Testing
  1. Python
  2. Python
  3. Session 10: A/B Testing

Session 10: A/B Testing

Multivariate Analysis
Biviate Analysis
Hypothesis Testing
Statistics
Design of Experiments

Outline

In this session we will cover the following topics:

  • What is A/B testing?
  • A/B testing steps
  • Statistics Review
  • Hypothesis testing with Python
  • Multivariate testing
Tip

Before you start, make sure to read the Statistics Session 6 materials, as we will be using the concepts of hypothesis testing and p-value in this session.

What is A/B Testing?

A/B Testing is a simplified term for randomized controlled experiment, where two samples (A and B) of a single object (product/service) are compared.

Have you ever seen the same website with multiple designs during a certain period of time?

Applications of A/B Testing

  • User Experience (UX): Testing Software Navigation, Color, Shape of the components
  • Marketing: Testing the content of a campaign
  • Drug Development: measuring the effect of the drug compared with either its competitors or placebo

\[\downarrow\]

practically everywhere

Important

In order to give an answer, we need to run an experiment!

Remember the Zen of Python: “In the face of ambiguity, refuse the temptaticon to guess.”

A/B Testing Steps

In general, A/B testing is done with four sequential steps:

  1. Choose and characterize metrics to evaluate your experiments:
    • What do you care about?
    • How do you want to measure the effect?
  2. Power Analysis:
    • Significance level (\(\alpha\))
    • Statistical power (\(1-\beta\))
    • Practical Significance level
    • Calculate the required sample size
  3. Sample for control/treatment groups and run the test
  4. Analyze the results and draw valid conclusions

Choosing metrics | Step 1

We have two types of metrics:

  • Invariant metrics - do not change from control to treatment groups
  • Evaluation metrics - the ones in the change of which we are interested.

Four categories of metrics:

  • Sums and counts
  • Distribution (mean, median, percentiles)
  • Probability and rates (e.g. Click-through probability, Click-through rate)
  • Ratios: Return on Investment (RoI)

Power Analysis | Step 2

The power of the test (\(1-\beta\)) is the probability of rejecting the \(H_0\) when it is False.

Statistical Power

We use power to calculate the sample size we need. In general, we have the following parameters:

  • Power of the test (\(1-\beta\))
  • Significance level (\(\alpha\))
  • Effect size (\(\delta\))
  • Sample size (\(n\))
Note

If you determine any three, the forth will be calculated and derived naturally

The rule of thumb for \(1-\beta\) is 0.8, which means that we have 80% chance of rejecting the \(H_0\) when it is False.

Effect Size

Effect size:

\[H_0: \mu_1=\mu_2\] \[H_1: \mu_1\ne\mu_2\]

Sometimes we want to reject the \(H_0\) with a certain effect, for example when \(|\mu_1-\mu_2|>\delta\)

Effect size use case

The news broadcasting company is testing whether users stay longer on their website with the new website design. The control group consists of visits to the old website, while the treatment group consists of visits to the new website. The new design will be considered effective if the difference in the average duration of the stay is more than 5.5 minutes; thus \(\mu_t-\mu_c >5.5\), the 5.5 here is the effect.

The effect that we want to detect is 5.5, while the effect size is standardized by the standard deviation: \[d = \frac{|\mu_t-\mu_c|}{\sigma}\]

T-Value

The degree of difference relative to the variation in our data groups.
Large t-values indicate a higher degree of difference between the groups.

P-Value

P-value measures the probability that the results would occur by random chance. Therefore, the smaller the p-value is, the more statistically significant difference there will be between the two groups.

Sample size

Sample size will be determined by the below formula:

\[n= \left( \frac{Z_{1-\alpha/2}+Z_{1-\beta/2}}{Effect \text{ }Size} \right)^2\]

where

  • \(Z_{1-\alpha/2}\) is the z-score corresponding to the desired confidence level (e.g., for a 95% confidence level, \(Z_{1-\alpha/2} \approx 1.96\))
  • \(Z_{1-\beta/2}\) is the z-score corresponding to the desired power level (e.g., for 80% power, \(Z_{1-\beta/2} \approx 0.84\))
  • \(Effect \text{ }Size\) is the standardized effect size, calculated as the difference in means divided by the standard deviation.
  • \(n\) is the required sample size per group.

Random Sampling | Step 3

Once we have determined the required sample size, we can randomly assign users to either the control or treatment group.

This randomization helps to ensure that any differences observed between the groups can be attributed to the treatment effect rather than confounding variables.

Let’s say we have the following DataFrame:

print(f'The shape of the Dataframe: {df_sampling.shape}' )
print(f'The columns of the Dataframe: {df_sampling.columns}' )
df_sampling.head()
The shape of the Dataframe: (100, 3)
The columns of the Dataframe: Index(['user_id', 'category', 'score'], dtype='str')
user_id category score
0 1 B 79.00
1 2 A 65.95
2 3 C 71.22
3 4 B 87.99
4 5 A 71.44

Summary statistics of the score column:

df_sampling['score'].describe()[['mean', 'std', 'min', 'max']]
mean     74.859800
std       9.843752
min      53.680000
max     104.140000
Name: score, dtype: float64
df_sampling['category'].value_counts().sort_index()
category
A    53
B    33
C    14
Name: count, dtype: int64
df_sampling.groupby('category')['score'].describe()[['mean', 'std', 'min', 'max']]
mean std min max
category
A 74.200000 8.973076 60.29 94.97
B 76.272424 10.014058 58.25 96.28
C 74.027857 12.705514 53.68 104.14

Random Sampling

Suppose we want to take a random sample of 20 users from this DataFrame for our A/B test. We can use the sample method from pandas to do this:

random_sample = df_sampling.sample(n=20, random_state=42)
random_sample.shape
(20, 3)
random_sample.groupby('category')['score'].describe()[['mean', 'std', 'min', 'max']]
mean std min max
category
A 71.211250 8.747513 61.23 83.40
B 75.071429 6.553916 64.64 83.38
C 67.838000 10.015841 53.68 77.68
random_sample['category'].value_counts().sort_index()
category
A    8
B    7
C    5
Name: count, dtype: int64

Stratified Sampling | proportional to the category distribution

stratified_sample = (
    df_sampling
    .groupby("category", group_keys=False)
    .sample(frac=0.2, random_state=42)
    .reset_index(drop=True)
)
ImportantLine by line explanation
.groupby("category", group_keys=False)
  • Splits data into strata (A, B, C…)
  • Each group is processed independently

.sample(frac=0.2, random_state=42)
  • Takes 20% from each category
  • Preserves distribution
  • Deterministic due to seed
stratified_sample.head()
user_id category score
0 39 A 78.14
1 84 A 77.19
2 93 A 92.24
3 29 A 89.63
4 86 A 86.06

Summary statistics of the score column for the stratified sample:

stratified_sample.groupby('category')['score'].describe()[['mean', 'std', 'min', 'max']]
mean std min max
category
A 76.688182 10.510287 60.29 92.24
B 78.715714 10.081969 66.79 92.68
C 63.650000 5.447054 57.73 68.45
stratified_sample['category'].value_counts().sort_index()
category
A    11
B     7
C     3
Name: count, dtype: int64

Checking the distribution of the category column in the original DataFrame and the stratified sample:

Original DataFrame category distribution:

df_sampling['category'].value_counts(normalize=True).sort_index()
category
A    0.53
B    0.33
C    0.14
Name: proportion, dtype: float64

Stratified Sample category distribution:

stratified_sample['category'].value_counts(normalize=True).sort_index()
category
A    0.523810
B    0.333333
C    0.142857
Name: proportion, dtype: float64

Stratified Sampling | Equal number of samples from each category

There might be situations where we want to ensure an equal number of samples from each category.

stratified_sample = (
    df_sampling
    .groupby("category", group_keys=False)
    .sample(n=10, random_state=42)
    .reset_index(drop=True)
)
Important

n must be less than or equal to the smallest group size in the original DataFrame. In this case, since category C has only 14 samples, we can sample at most 20 from each category.

df_sampling['category'].value_counts()
category
A    53
B    33
C    14
Name: count, dtype: int64

Splitting the data into control and treatment groups

Pure Random Splitting
  • Completely random
  • Does NOT preserve category distribution
  • Can introduce bias
rng = np.random.default_rng(42)

df_random_split = df_sampling.assign(
    group=rng.choice(["control", "treatment"], size=len(df_sampling))
)
df_random_split['group'].value_counts()
group
treatment    52
control      48
Name: count, dtype: int64
Stratified Splitting
df_stratified_split = (
    df_sampling
    .assign(
        group=lambda x: (
            x.groupby("category")["user_id"]
            .transform(
                lambda g: rng.permutation(
                    ["control"] * (len(g)//2) + 
                    ["treatment"] * (len(g) - len(g)//2)
                )
            )
        )
    )
)
df_stratified_split[['group', 'category']].value_counts()
group      category
treatment  A           27
control    A           26
treatment  B           17
control    B           16
treatment  C            7
control    C            7
Name: count, dtype: int64

Analyzing the results | Step 4

Recall the decision rules for hypothesis testing fro Statistics Session 6:

@hypothesis-testing-decision-rules

@visual-representation-hypothesis-testing

Power Analysis With Python

Loading Packages

import numpy as np
import pandas as pd
import math
from statsmodels.stats.power import TTestIndPower
from statsmodels.stats.multitest import multipletests
from scipy.stats import ttest_ind
import scipy
import matplotlib.pyplot as plt
Tip

Do not forget to install the required packages before running the code.

Make sure that you are in the correct virtual environment, and run the following command in your terminal:

bash
pip install statsmodels scipy

NOTE: you must run the above command in your terminal, given the fact that the your virtual environment is activated.

Calculating Sample Size

How much sample do you need to take, if you want to detect effect size of \(0.4\), with the power of \(0.8\) and significance level of \(0.05\) ?

You will do two independent samples t-test.

N = TTestIndPower().solve_power(effect_size = 0.4, power = 0.8,
                            alpha = 0.05)

N
99.08032514659006

**Note, sample size is per group 99.08032514659006

Sampling distributions

Sampling distribution of the means for two groups (control and treatment):

\[H_0: \mu_c=\mu_t\] \[H_1: \mu_c\ne\mu_t\]


Power

Relationship between power and effect size: There is a direct relationship between power and effect size: Increasing the effect size will increase power.

TTestIndPower().plot_power(dep_var='nobs', 
                            nobs=np.array(range(5, 100)), 
                            effect_size=np.array([0.2, 0.5, 0.8]),
                            title='Power of t-Test')

Plot power and effect size using python, sample size = 100

TTestIndPower().plot_power(dep_var='effect_size', nobs= [100],
                                 effect_size=np.arange(0.1, 1, 0.05),
                                 title='Power of t-Test')

Increasing sample size will also increase the power, as with the higher sample size the sampling distribution of the mean becomes narrower. Recall, the standard deviation of the sampling distribution (Standard Error) of the mean is calculated as: \[SE = \frac{\sigma}{\sqrt{n}}\]

Hypothesis Testing

TipUse case

The news streaming company is adding a new feature to the website. The effect the company is trying to detect is equal to 5 minutes.

  • It is a randomized experiment, meaning that every visitor to the site will have 0.5 probability of being in the treatment (new feature) group and 0.5 probability of being in control group (old design).
  • The minimum effect they want to detect is an increase by 5 minutes.
  • From the historical data they have estimated the standard deviation to be 13.7 minutes.

The hypothesis:

\[H_0: \mu_t = \mu_c\] \[H_1: \mu_t\ne\mu_c\]

Sample Size

Specifications:

  • \(\sigma = 13.7\)
  • Power: \((1-\beta) = 0.8\)
  • Significance level: \(\alpha = 0.05\)
  • Effect size: \(\frac{|\mu_t-\mu_c|}{\sigma} = \frac{5}{13.7} = 0.365\)

Determine the sample size for each group:

TTestIndPower().solve_power(effect_size = 0.365, power = 0.8,
                            alpha = 0.05, alternative = 'larger')
93.49756951363241

Loading the data

Load the data

user_id viewing_time Group
0 4b5630ee914e848e8d07221556b0a2fb 38.354937 control
1 c01f179e4b57ab8bd9de309e6d576c48 49.534278 control
2 11946e7a3ed5e1776e81c0f0ecd383d0 35.468325 control
3 234a2a5581872457b9fe1187d1616b13 69.014875 control
4 dd4ad37ee474732a009111e3456e7ed7 51.547207 control
expr = pd.read_csv('../data/ab_testing/experiment.csv')
expr.head()
expr.groupby('Group')['viewing_time'].mean()
Group
control      48.386186
treatment    52.081302
Name: viewing_time, dtype: float64

T-test

T-test with scipy:

ctrl = expr[expr['Group'] == 'control']['viewing_time']
treatment = expr[expr['Group'] == 'treatment']['viewing_time']
test_res = ttest_ind(treatment, ctrl)
tstat, pvalue=test_res
f"t-statistics: {tstat:.4f}"
't-statistics: 1.6002'
f"p-value: {pvalue:.4f}"
'p-value: 0.1128'

Interpretation

We failed to reject the \(H_0\) where \(\alpha = 0.05\). We haven’t seen the anticipated improvement!

How much improvement is there ?

diff=treatment.mean() - ctrl.mean()
sd_pooled=math.sqrt((treatment.std()**2+ ctrl.std()**2)/2)

To find the detected effect size, calculate Cohen’s d.

\[\frac{\bar{x_t}-\bar{x_c}}{pooled \; SD}\]

\[\downarrow\]

\[\frac{\bar{x_t}-\bar{x_c}}{\sqrt{(s_t^2+s_c^2)/2}}\]

f"The detected Effect: {diff/sd_pooled:.4f}"
'The detected Effect: 0.3200'

Practical Significance

There could be cases, where statistical tests show significnce while in the reality the difference is not that actual.

Such effects might happen in bellow cases:

  1. low variance between two samples
  2. the sample size is huge

Recall:

\[t_{value}=\frac{\bar{x_1}-\bar{x_1}}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\]

Case 1 | low variance

case1=pd.read_csv("../data/ab_testing/case1.csv")
case1.head()
score1 score2
0 85 87
1 85 86
2 86 87
3 86 86
4 85 86
case1.describe().loc[['count','mean','std']]
score1 score2
count 20.000000 20.000000
mean 85.550000 86.400000
std 0.510418 0.502625

Case 2 | huge sample size

case2=pd.read_csv("../data/ab_testing/case2.csv")
case2.head()
score1 score2
0 88 95
1 89 88
2 91 93
3 94 87
4 87 89
case2.describe().loc[['count','mean','std']]
score1 score2
count 20.000000 20.000000
mean 90.650000 90.750000
std 2.777257 2.788605

Utils Functions

def measures(data):
    x1=data.describe().loc['mean','score1']
    x2=data.describe().loc['mean','score2']
    s1=data.describe().loc['std','score1']
    s2=data.describe().loc['std','score2']
    return x1,x2,s1,s2
def ttest(x1,x2,s1,s2,n):
    t_value=(x1-x2)/math.sqrt(s1**2/n+s2**2/n)
    p_value=scipy.stats.t.sf(abs(t_value), df=n-1)*2
    return f't-value: {t_value:.4f}', f'p-value: {p_value:.4f}'

Case 1 | Experiment

# t-test for case 1
c1=measures(case1)
ttest(*c1,20)
('t-value: -5.3065', 'p-value: 0.0000')

Case 2 | Experiment

N=200

c2=measures(case2)
ttest(*c2,200)
('t-value: -0.3593', 'p-value: 0.7197')

N=200

c2=measures(case2)
ttest(*c2,200)
('t-value: -0.3593', 'p-value: 0.7197')

N=2000

c2=measures(case2)
ttest(*c2,2000)
('t-value: -1.1363', 'p-value: 0.2560')

N=20000

c2=measures(case2)
ttest(*c2,20000)
('t-value: -3.5933', 'p-value: 0.0003')

A/A Testing

A/A testing is a type of experiment where two identical versions of a product or service are compared to each other. The purpose of A/A testing is to validate the experimental setup and ensure that there are no biases or confounding factors that could affect the results of future A/B tests.