Session 02: Python basic Syntax, Data Structures

Python

List

Tuple

Set

Dictionary

Pandas DataFrames

Arithmetic Operations

Let revenue be \(r\) and tax rate be \(t\).

The total revenue including tax is:

\[ \text{total} = r \times (1 + t) \]

r = 100
t = 0.2

total = r * (1 + t)
total

120.0

Python supports:

Addition +
Subtraction -
Multiplication *
Division /
Power **

Boolean values are either True or False.

100 > 50
100 == 50
100 != 50

Logical operators:

and
or
not

print((100 > 50) and (20 < 30))

True

Boolean logic becomes essential when filtering data.

Basic Data Structures

Before working comfortably with pandas, we must understand the core Python data structures.

Every DataFrame is built on top of them.

In analytics, these structures represent:

Collections of values
Observations
Attributes
Mappings between keys and values

List

A list is an ordered, mutable collection.

sales = [100, 200, 150, 200]
sales

[100, 200, 150, 200]

Properties:

Ordered
Indexed
Allows duplicates
Mutable

Access elements:

sales[0]
sales[-1]

Modify elements:

sales.append(300)
sales[1] = 250
sales

[100, 250, 150, 200, 300]

Remove elements:

sales.remove(150)
sales

[100, 250, 200, 300]

Length of the list:

len(sales)

Analytical Context

A list can represent:

Daily sales
Customer revenues
Monthly growth rates

If values are \(x_1, x_2, ..., x_n\), the total revenue is:

\[ \sum_{i=1}^{n} x_i \]

total = 0
for value in sales:
    total += value

total

Tuple

A tuple is ordered but immutable.

coordinates = (40.18, 44.51)
coordinates

(40.18, 44.51)

Properties:

Ordered
Indexed
Immutable

Why immutability matters:

Prevents accidental changes
Safe for constant data
Can be used as dictionary keys

Set

A set is a collection data structure that stores unique, unordered elements. It is highly optimized for membership testing and set algebra operations, making it extremely useful in analytical workflows.

customer_ids = {1, 2, 3, 3, 4}
customer_ids

{1, 2, 3, 4}

Important

Notice that duplicate values are automatically removed.

Key Properties

Unordered: Elements do not have a fixed position (no indexing like lists).
Unique values only: Duplicate entries are automatically eliminated.
Mutable container: You can add or remove elements after creation.
Fast membership checking: Checking x in set is typically O(1) due to hashing.
Heterogeneous (with restriction): Can contain multiple data types, but all elements must be hashable (e.g., no lists or dicts).

Creating Sets

# Using curly braces
s1 = {1, 2, 3}

# Using set() constructor
s2 = set([1, 2, 2, 3])

# Empty set (IMPORTANT: {} creates a dict!)
empty_set = set()

Basic Operations

Handling elements

s = {1, 2, 3}

s.add(4)        # Add element
s.remove(2)     # Remove element (error if not found)
s.discard(10)   # Safe remove (no error)
s.pop()         # Remove random element

Membership Check

3 in s   # True
10 in s  # False

False

Set Algebra

These operations are the main reason sets are powerful in analytics.

segment_a = {1, 2, 3}
segment_b = {3, 4, 5}

Intersection (common elements)

segment_a.intersection(segment_b)
# or
segment_a & segment_b

{3}

Result: {3}

Union (all unique elements)

segment_a.union(segment_b)
# or
segment_a | segment_b

{1, 2, 3, 4, 5}

Result: {1, 2, 3, 4, 5}

Difference

elements in A but not in B

segment_a.difference(segment_b)
# or
segment_a - segment_b

{1, 2}

Result: {1, 2}

Symmetric Difference

non-overlapping elements

segment_a.symmetric_difference(segment_b)
# or
segment_a ^ segment_b

{1, 2, 4, 5}

Result: {1, 2, 4, 5}

Analytical Use Cases

Removing Duplicates

transactions = [101, 102, 102, 103, 104]
unique_transactions = set(transactions)
unique_transactions

{101, 102, 103, 104}

Customer Overlap Between Campaigns

campaign_a = {1, 2, 3, 4}
campaign_b = {3, 4, 5, 6}

overlap = campaign_a & campaign_b
overlap

{3, 4}

Used for:

Cross-sell analysis
Audience overlap measurement

Finding New Customers

new_customers = campaign_b - campaign_a
new_customers

{5, 6}

Data Validation / Integrity Checks

valid_ids = {1, 2, 3, 4, 5}
incoming_ids = {2, 3, 6}

invalid_ids = incoming_ids - valid_ids

Limitations

No indexing → s[0]
No guaranteed order
Elements must be immutable (hashable)

# This will fail
s = {[1, 2], [3, 4]}

Important

TypeError: unhashable type: ‘list’

When to Use Sets

Use a set when:

You don’t care about order
You need unique values
You need fast lookups
You are doing comparisons between groups

Comparison with Lists

Feature	List	Set
Order	Yes	No
Duplicates	Allowed	Not allowed
Indexing	Yes	No
Membership speed	O(n)	O(1)
Use case	Sequence	Uniqueness / logic

A set is not just a container, it is a mathematical tool embedded in Python that enables:

Efficient deduplication
Fast membership checks
Powerful group comparisons

In data analytics, sets are especially valuable when working with:

Customer segments
Campaign overlaps
Data validation
Feature engineering logic

Think of sets as your go-to structure for “who is in / who is not in” problems.

Dictionary

A dictionary maps keys to values.

customer = {
    "name": "Anna",
    "revenue": 150,
    "city": "Yerevan"
}

customer

{'name': 'Anna', 'revenue': 150, 'city': 'Yerevan'}

Properties:

Keys must be unique
Values can be any type
Fast lookup

Access values:

customer["revenue"]

Add or update:

customer["segment"] = "Premium"
customer["revenue"] = 200
customer

{'name': 'Anna', 'revenue': 200, 'city': 'Yerevan', 'segment': 'Premium'}

Remove:

del customer["city"]
customer

{'name': 'Anna', 'revenue': 200, 'segment': 'Premium'}

From Dictionary to Structured Data

A collection of dictionaries can represent tabular data:

customers = [
    {"name": "Anna", "revenue": 150},
    {"name": "David", "revenue": 220},
    {"name": "Mariam", "revenue": 90}
]

customers

[{'name': 'Anna', 'revenue': 150},
 {'name': 'David', 'revenue': 220},
 {'name': 'Mariam', 'revenue': 90}]

This structure is very close to what pandas formalizes.

Transition to Pandas

A pandas DataFrame is conceptually:

A dictionary of columns
Each column behaves like a labeled list

flowchart LR
    A[List] --> C[Dictionary]
    C --> D[DataFrame]

Creating a DataFrame

import pandas as pd

data = {
    "name": ["Anna", "David", "Mariam"],
    "revenue": [150, 220, 90],
    "city": ["Yerevan", "Tbilisi", "Warsaw"]
}

df = pd.DataFrame(data)
df

	name	revenue	city
0	Anna	150	Yerevan
1	David	220	Tbilisi
2	Mariam	90	Warsaw

Inspecting Structure

df.info()
df.shape
df.columns

<class 'pandas.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   name     3 non-null      str  
 1   revenue  3 non-null      int64
 2   city     3 non-null      str  
dtypes: int64(1), str(2)
memory usage: 239.0 bytes

Index(['name', 'revenue', 'city'], dtype='str')

Basic DataFrame Manipulation

Selecting Columns

df["revenue"]
df[["name", "revenue"]]

	name	revenue
0	Anna	150
1	David	220
2	Mariam	90

Adding a Column

Suppose tax rate \(t = 0.2\).

\[ \text{revenue\_after\_tax} = r \times (1 - t) \]

t = 0.2
df["revenue_after_tax"] = df["revenue"] * (1 - t)
df

	name	revenue	city	revenue_after_tax
0	Anna	150	Yerevan	120.0
1	David	220	Tbilisi	176.0
2	Mariam	90	Warsaw	72.0

Removing a Column

df.drop("city", axis=1)

	name	revenue	revenue_after_tax
0	Anna	150	120.0
1	David	220	176.0
2	Mariam	90	72.0

To modify permanently:

df = df.drop("city", axis=1)
df

	name	revenue	revenue_after_tax
0	Anna	150	120.0
1	David	220	176.0
2	Mariam	90	72.0

Filtering Rows

df[df["revenue"] > 100]

	name	revenue	revenue_after_tax
0	Anna	150	120.0
1	David	220	176.0

Equivalent SQL:

SELECT *
FROM customers
WHERE revenue > 100;

Updating Values

df.loc[df["revenue"] < 100, "segment"] = "Low"
df.loc[df["revenue"] >= 100, "segment"] = "High"
df

	name	revenue	revenue_after_tax	segment
0	Anna	150	120.0	High
1	David	220	176.0	High
2	Mariam	90	72.0	Low

This applies conditional logic to structured data.

Analytical Flow

flowchart LR
    A[Python Structures] --> B[Dictionary of Lists]
    B --> C[DataFrame]
    C --> D[Select]
    C --> E[Filter]
    C --> F[Transform]
    D --> G[Insights]
    E --> G
    F --> G

We have have moved from:

Lists
Dictionaries
Sets
Tuples

To:

Structured tabular data
Column manipulation
Row filtering
Feature creation

Important

Understanding the foundations makes pandas intuitive instead of magical.

Next, after contidions and loops, we will go deeper into selection, filtering, and aggregation using real datasets.

Mutable vs Immutable

Understanding mutability is essential for writing correct analytical code.

Mutable Objects

Mutable objects can be changed after creation.

list
dict
set
pandas DataFrame

Example:

sales = [100, 200, 150]
sales[0] = 300
sales

[300, 200, 150]

The original object is modified.

Immutable Objects

Immutable objects cannot be changed after creation.

int
float
str
bool
tuple

Example:

x = 10
x = x + 5
x

Here, a new object is created. The original value is not modified.

Trying to modify a tuple:

coordinates = (40.18, 44.51)
# coordinates[0] = 41   # Error

Why This Matters in Data Analytics

Mutability affects:

Function behavior
Memory references
Unexpected side effects
pandas chained assignments

If you modify a mutable object inside a function, the original data may change.

Understanding this prevents subtle analytical bugs.

Conditional Statements

Conditional statements allow a program to make decisions. In data analytics, decisions appear everywhere:

Filtering rows
Classifying customers
Detecting anomalies
Creating segments
Handling missing values

At the core of these operations lies the if statement.

The Basic `if` Statement

revenue = 150

if revenue > 100:
    print("High revenue")
else:
    print("Normal revenue")

High revenue

Structure:

if keyword
A condition that evaluates to True or False
A colon :
An indented block of code

If the condition is True, the indented block runs.
If it is False, nothing happens.

Indentation | Why It Matters

In Python, indentation defines structure.
It is not optional formatting — it is syntax.

revenue = 150

if revenue > 100:
    print("High revenue")

print("Analysis complete")

High revenue
Analysis complete

Only the indented line belongs to the if block.

If indentation is incorrect, Python raises an error:

revenue = 150

if revenue > 100:
print("High revenue")   # IndentationError

Best practice:

Use 4 spaces per indentation level
Never mix tabs and spaces
Keep indentation consistent

Using `elif` for Multiple Conditions

When there are more than two possible categories, use elif.

revenue = 180

if revenue > 200:
    print("Very high revenue")
elif revenue > 100:
    print("High revenue")
elif revenue > 50:
    print("Medium revenue")
else:
    print("Low revenue")

High revenue

Important principles:

Conditions are evaluated from top to bottom
The first True condition executes
Remaining conditions are skipped

Boolean Expressions and Comparison Operators

Every conditional statement depends on a Boolean expression.

A Boolean expression evaluates to either True or False.

100 > 50
100 == 50
100 != 50

True

Common comparison operators:

> greater than
< less than
>= greater than or equal
<= less than or equal
== equal
!= not equal

These operators form the foundation of filtering logic in data analysis.

Combining Conditions with Logical Operators

Often, a single condition is not enough.

Python provides logical operators:

and
or
not

revenue = 150
is_active = True

if revenue > 100 and is_active:
    print("Target customer")

Target customer

Rules:

and → both conditions must be True
or → at least one condition must be True
not → reverses the Boolean value

Example:

is_active = False

if not is_active:
    print("Customer is inactive")

Customer is inactive

Logical operators become extremely important when filtering datasets with multiple criteria.

Nested Conditional Statements

Conditionals can be placed inside other conditionals.

revenue = 150
region = "EU"

if revenue > 100:
    if region == "EU":
        print("High EU revenue")
    else:
        print("High non-EU revenue")

High EU revenue

Notice how indentation increases with each nested block.

Each indentation level represents a new logical layer.

Deep nesting reduces readability.
In data analytics, clarity is preferred over complexity.

Common Mistakes

1. Using = instead of ==

if revenue = 100:   # SyntaxError

Correct:

if revenue == 100:
    print("Equal")

= assigns a value.
== compares values.

2. Forgetting the colon

if revenue > 100
    print("High")

The colon is mandatory.

3. Incorrect indentation

if revenue > 100:
print("High")

Python requires consistent indentation.

Visualizing Conditional Flow

flowchart TD
    A[Start] --> B{Condition True?}
    B -->|Yes| C[Execute Block]
    B -->|No| D[Skip Block]
    C --> E[Continue]
    D --> E

Indentation defines what belongs to the decision branch.

Why Conditional Logic Matters in Analytics

Conditional logic is the basis of:

Data filtering
Segmentation
Rule-based scoring
Feature creation
Data validation

Every WHERE clause in SQL is conceptually an if statement applied to rows.

Understanding conditional statements deeply ensures that later pandas filtering feels natural and intuitive.

Loops

Important

Conditional statements allow decisions.

Loops allow repetition.

In data analytics, repetition appears constantly:

Iterating over values
Applying rules to many observations
Aggregating manually
Cleaning records
Transforming data

The most common loop in Python is the for loop.

The Basic `for` Loop

sales = [100, 200, 150]

for value in sales:
    print(value)

100
200
150

Structure:

for keyword
A temporary variable (value)
The keyword in
An iterable object (sales)
A colon :
An indented block

The loop runs once for each element in the collection.

How a `for` Loop Works

flowchart TD
    A[Start] --> B[Take first element]
    B --> C[Execute indented block]
    C --> D{More elements?}
    D -->|Yes| B
    D -->|No| E[Stop]

Each iteration processes one element.

Loop With Accumulation

Loops are often used to compute totals.

sales = [100, 200, 150]

total = 0

for value in sales:
    total = total + value

total

Mathematically, if values are \(x_1, x_2, ..., x_n\):

\[ \text{Total} = \sum_{i=1}^{n} x_i \]

This manual summation mirrors what sum() does internally.

Loop With Conditional Logic

You can combine loops and conditions.

sales = [100, 200, 150, 50]

for value in sales:
    if value > 120:
        print("High sale:", value)

High sale: 200
High sale: 150

Now we are:

Iterating
Evaluating
Filtering

This is conceptually similar to a SQL WHERE clause.

Looping Over Dictionaries

Loops are not limited to lists.

customer = {
    "name": "Anna",
    "revenue": 150,
    "city": "Yerevan"
}

for key in customer:
    print(key, ":", customer[key])

name : Anna
revenue : 150
city : Yerevan

You can also iterate over key–value pairs:

for key, value in customer.items():
    print(key, value)

name Anna
revenue 150
city Yerevan

The `range()` Function

Sometimes you need numeric iteration.

range(5) generates:

for i in range(5):
    print(i)

flowchart LR
    A[range 0 to 4] --> B[0]
    A --> C[1]
    A --> D[2]
    A --> E[3]
    A --> F[4]

Nested Loops

Loops can be nested inside each other.

for i in range(3):
    for j in range(2):
        print("i =", i, ", j =", j)

i = 0 , j = 0
i = 0 , j = 1
i = 1 , j = 0
i = 1 , j = 1
i = 2 , j = 0
i = 2 , j = 1

Indentation increases with nesting.

Each additional level increases computational complexity.

Common Mistakes With Loops

1. Forgetting indentation

for value in sales:
print(value)

Indentation is required.

2. Modifying a collection while iterating

This can lead to unexpected behavior.

Analytical Perspective

In analytics, loops are useful for:

Custom feature engineering
Rule-based transformations
Processing API responses
Working with nested structures

However:

For tabular data, pandas vectorized operations are usually faster and cleaner than loops.

Loops are foundational knowledge.
Vectorization is analytical optimization.

Summary

You now understand:

Basic for loop structure
Loop flow
Accumulation logic
Combining loops with conditions
Iterating over dictionaries
Using range()
Nested loops

For more information on loops, see the official Python documentation: or this tutorial:

List Comprehension

List comprehension allows us to create new lists in a concise and readable way.

It replaces many simple loops.

Basic Structure

new_list = [expression for item in iterable]

Equivalent traditional loop:

sales = [100, 200, 150]

new_sales = []

for value in sales:
    new_sales.append(value * 1.2)

new_sales

Now using list comprehension:

sales = [100, 200, 150]

new_sales = [value * 1.2 for value in sales]
new_sales

The result is identical, but the syntax is cleaner.

With Conditional Filtering

You can add a condition:

sales = [100, 200, 150, 50]

high_sales = [value for value in sales if value > 120]
high_sales

Traditional version:

high_sales = []

for value in sales:
    if value > 120:
        high_sales.append(value)

high_sales

List comprehension combines:

Iteration
Conditional filtering
Transformation

In one readable line.

Mathematical Interpretation

If values are \(x_1, x_2, ..., x_n\), and we want only those where \(x_i > 120\):

\[ \{ x_i \mid x_i > 120 \} \]

List comprehension expresses this directly in code.

Conditional Expression Inside Comprehension

You can also transform conditionally:

sales = [100, 200, 150, 50]

labels = ["High" if value > 120 else "Low" for value in sales]
labels

This mirrors segmentation logic.

When to Use List Comprehension

Use when:

You are transforming a list
You are filtering values
The logic is simple and readable

Avoid when:

Logic becomes too complex
Multiple nested conditions reduce clarity

Readability is more important than brevity.

Conceptual Flow

flowchart LR
    A[Original List] --> B[Iterate]
    B --> C{Condition?}
    C -->|Yes| D[Transform]
    C -->|No| E[Skip or Alternate]
    D --> F[New List]
    E --> F

List comprehension is structured iteration with transformation.

Train Yourself

Given:

revenues = [120, 250, 80, 310, 95]

Create a new list with revenues after applying 10% tax.
Create a list containing only revenues greater than 100.
Create a list labeling each revenue as "High" if > 200, otherwise "Normal".

Use list comprehension only.

Why This Matters for Pandas

List comprehension is conceptually similar to:

Creating new columns
Applying transformations
Conditional feature engineering

Soon, you will see how pandas vectorizes this behavior.

Homework

This homework integrates:

Arithmetic operations
Boolean logic
Lists, sets, dictionaries
Conditional statements
Loops
List comprehension
Basic pandas DataFrame manipulation

You will simulate a small revenue analytics pipeline and submit your work as a Jupyter Notebook (.ipynb) file.

Note

Create a 02_python_foundamentals_for_data_analytics.ipynb file and complete the following tasks. Once finished push to GitHub and share the link.

Scenario

You are analyzing weekly transaction data for a small company.

The company wants to:

Adjust revenues for tax
Apply discount rules
Classify customers
Remove duplicate IDs
Prepare data for structured analysis

Part 1

Given:

revenues = [120, 250, 80, 310, 95]
tax_rate = 0.18
discount_rate = 0.10

Let revenue be \(r\).

Final revenue formula:

\[ \text{final} = r \times (1 + \text{tax\_rate}) \times (1 - \text{discount\_rate}) \]

Tasks

Compute revenues including tax.
Compute final revenues after tax and discount.
Create a Boolean list indicating whether revenue > 100.
Create a Boolean list indicating whether revenue is between 100 and 300.
Add Markdown explanation: Why do parentheses matter in the formula?

Part 2

Using:

sales = [120, 250, 80, 310, 95]

Compute total revenue manually using a loop.
Compute average revenue.
Identify the maximum value without using max().
Count how many values are greater than 150.
Create a new list of revenues after adding 5% commission.

Part 3

Segmentation rules:

“Premium” if revenue > 250
“Standard” if 100 < revenue ≤ 250
“Low” otherwise

Create a list of segment labels.
Count how many customers fall into each segment.
Add Markdown explanation: Why does the order of elif statements matter?

Part 4

Given:

customer_ids = [1, 2, 3, 3, 4, 5, 5, 6]

Remove duplicates using a set.
Convert back to a list.
Compare lengths before and after deduplication.
Explain in Markdown why sets are unordered.

Part 5

Create a dictionary:

customer = {
    "name": "Anna",
    "revenue": 250,
    "city": "Yerevan"
}

Add a new key "segment" based on revenue.
Update revenue to 300.
Remove "city".
Loop over the dictionary and print key-value pairs.
Add Markdown explanation: Why are dictionaries useful for structured data?

Part 6

Using:

revenues = [120, 250, 80, 310, 95]

Create a list of revenues after 10% tax.
Create a list of revenues greater than 100.
Create a list labeling each revenue as:
- “High” if > 200
- “Normal” otherwise
Compare list comprehension vs loop in Markdown.

Part 7

Convert your data into a DataFrame.

import pandas as pd

data = {
    "revenue": revenues
}

df = pd.DataFrame(data)
df

Add column "revenue_after_tax".
Add column "segment" using conditional logic.
Filter rows where revenue > 100.
Remove the "revenue_after_tax" column.
Print the shape of the DataFrame.
Add Markdown explanation: Why is this easier than manual loops?

Bonus Reflection

Answer briefly in Markdown:

Difference between mutable and immutable objects
Why loops are less efficient than pandas vectorization
How Boolean logic relates to SQL WHERE
Why understanding lists helps understand DataFrames

Submission Requirements

Submit a .ipynb file
Use both code cells and Markdown cells
All code must execute without errors
Clearly label each section

Notebook structure should be clean and readable.

Analytical Flow

flowchart LR
    A[Raw Revenues] --> B[Arithmetic Transformations]
    B --> C[Conditional Classification]
    C --> D[Deduplication]
    D --> E[Dictionary Structure]
    E --> F[DataFrame]
    F --> G[Filtering & Transformation]

Arithmetic Operations

Basic Data Structures

List

Analytical Context

Tuple

Set

Key Properties

Creating Sets

Basic Operations

Handling elements

Membership Check

Set Algebra

Intersection (common elements)

Union (all unique elements)

Difference

Symmetric Difference

Analytical Use Cases

Removing Duplicates

Customer Overlap Between Campaigns

Finding New Customers

Data Validation / Integrity Checks

Limitations

When to Use Sets

Comparison with Lists

Dictionary

From Dictionary to Structured Data

Transition to Pandas

Creating a DataFrame

Inspecting Structure

Basic DataFrame Manipulation

Selecting Columns

Adding a Column

Removing a Column

Filtering Rows

Updating Values

Analytical Flow

Mutable Objects

Immutable Objects

Why This Matters in Data Analytics

Conditional Statements

The Basic if Statement

Indentation | Why It Matters

Using elif for Multiple Conditions

Boolean Expressions and Comparison Operators

Combining Conditions with Logical Operators

Nested Conditional Statements

Common Mistakes

Visualizing Conditional Flow

Why Conditional Logic Matters in Analytics

Loops

The Basic for Loop

How a for Loop Works

Loop With Accumulation

Loop With Conditional Logic

Looping Over Dictionaries

The range() Function

Nested Loops

Common Mistakes With Loops

Analytical Perspective

Summary

List Comprehension

Basic Structure

With Conditional Filtering

Mathematical Interpretation

Conditional Expression Inside Comprehension

When to Use List Comprehension

Conceptual Flow

Train Yourself

Why This Matters for Pandas

Homework

Scenario

Part 1

Tasks

Part 2

Part 3

Part 4

Part 5

Part 6

Part 7

Bonus Reflection

The Basic `if` Statement

Using `elif` for Multiple Conditions

The Basic `for` Loop

How a `for` Loop Works

The `range()` Function