Table of Contents

Best Ways to Download Python
Basic Syntax Rules
Import Statement
Introduction
Input and Output
Assignment Table
Garbage Collection
The For Loop
The While Loop
Loop Control Statements
Nested Loops
What is The Fastest Data Structure?
Specialized and External Structures
What is The Fastest Data Structure For Inserting More Than 1,000 Rows?
Missing Values
Identifying Missing Values
Handling Missing Values
Using pandas to Identify Missing Values
Using pandas to Count Missing Values
Using pandas to Handle Missing Values
Accessing Columns from df in pandas
Common Attribute Notation
Boolean
Namespace
Finding the Maximum in a Dataset
Dictionary
Linear Regression
Random Forest
Heteroscedasticity
Statistics: Standard Deviation
Linear Regression Deep Dive Part 1
NumPy
Matrices
Probability
Skewness
Gaussian Distribution
Probability Density Function
Gradient Descent
Limiting Data Ranges
Data Chunking
Splitting Data For Training and Testing
Root Mean Square Error
Outliers
Linear Regression Experiment Using Tensorflow
Loss Function
R-Squared
One-Sample Z-Test
Some Math




Best Ways to Download Python:



1. Use the official installer from the offical Python website
2. Use Homebrew and type in : brew install python
3. Be use to use pyenv to avoid issues with multiple Python versions.



Basic Syntax Rules:



1. Indentation: Python uses whitespace (indentation) instead of curly brackets to define the scope of loops, functions, and classes.
2. Case Sensitivity: Variable names are case-sensitive (e.g., name and Name are different).
3. Naming: Names must start with a letter or underscore and cannot start with a number or be a reserved keyword (like if, for, or return)


Import Statement


In Python, the import statement is used to bring code from one module or package into your current script, allowing you to reuse functions, classes, and variables.


Ways to Import

1.Standard Import: Imports the entire module. You must prefix names with the module name

import math
print(math.pi)

2.Specific Import: Imports only certain items directly into your namespace, so no prefix is needed

from math import pi
print(pi)

3.Import With Alias: Renames a module or function, often used for brevity or to avoid name conflicts

import pandas as pd
import numpy as np

4.Import All: mports everything from a module. This is generally not recommended as it can clutter your namespace and cause unexpected bugs

from math import *



Introduction



Variables and Data Types: Learn how to store information using variables and understand basic types like integers (int), decimal numbers (float), text (str), and true/false values (bool).

Common Data Types

Integers (int): Whole numbers like 10 or -5
Floating-point numbers (float): Decimals like 3.14 or 2.0.
Strings (str): Text enclosed in single or double quotes, like "Hello".
Booleans (bool): Logical values True or False.


Basic Operators: Master arithmetic operators (+, -, *, /) and comparison operators (==, !=, <, >) to perform calculations and check conditions.


Input and Output

Use print() to display information and input() to receive user data (which defaults to a string)


Assignment Table


In Python, variables are essentially names or labels that refer to objects or data values stored in computer memory. Unlike many other programming languages, you do not need to "declare" a variable with a specific type; a variable is created the moment you assign a value to it

Naming Rules

To create a valid variable name, you must follow these rules:

1. Must start with a letter or an underscore (_)
2. Cannot start with a number
3. Can only contain alpha-numeric characters and underscores (A-z, 0-9, and _).
4. Cannot be a Python keyword (like if, else, or while).


ActionExample Code
Basic AssignmentX = 5
Multiple Assignmentx, y, z = "Apple", "Banana", "Cherry"
Same Value Assignmentx = y = z = "Orange"
Type Checkingprint(type(x))
Casting (Manual Type)x = str(3) (stores "3" as text)


Basic Reassignment


To reassign, simply place the variable name on the left and the new value or expression on the right.
x = 10 # Initial assignment
x = "Hello" # Reassigned to a different type


Key Reassignment Techniques


1. Augmented Assignment: You can update a variable's value based on its current value using operators like +=, -=, or *=.

score = 5
score += 1 # Equivalent to score = score + 1

2. Multiple Assignment & Swapping: Python allows you to reassign multiple variables in one line, which is commonly used to swap values without a temporary variable.

a, b = 1, 2
a, b = b, a # a is now 2, b is now 1

3. Global and Nonlocal Reassignment: To reassign a variable from an outer scope inside a function, you must use the global or nonlocal keywords; otherwise, Python creates a new local variable instead.

count = 0
def increment():
global count
count += 1

Concepts

Reference vs. Value: Variables in Python are pointers to objects in memory. Reassigning a variable changes which object it points to; it does not change the original object itself.

Shared References: If two variables point to the same object (e.g., y = x), reassigning x will not affect y. y will continue to point to the original object.

Garbage Collection: When a value no longer has any variables pointing to it, it becomes eligible for garbage collection to free up memory








Python Garbage Collection



Python handles memory management automatically through a process called Garbage Collection (GC), which ensures that memory used by objects no longer needed by your program is reclaimed

Reference Counting: This is Python’s main management tool.Every object keeps track of how many other objects or variables point to it. As soon as this count hits zero, the object is immediately destroyed and its memory is freed.



The For Loop



The Python for statement iterates over the items of any sequence or iterable object, such as a list, string, or range. It is typically used when you know the number of iterations in advance.


fruits = ["apple", "banana", "cherry"]
for fruit in fruits:
print(fruit) # Runs once for each item in the list



The While Loop


The Python while loop repeats a block of code as long as a specified condition remains True. It is best suited for scenarios where the number of iterations is unknown and depends on a dynamic condition






Loop Control Statements


The Python while loop repeats a block of code as long as a specified condition remains True. It is best suited for scenarios where the number of iterations is unknown and depends on a dynamic condition



You can modify the behavior of a loop using specific keywords:

1. break: Immediately terminates the loop and moves to the next statement in the program.
2. continue: Skips the current iteration and jumps to the beginning of the next one.
3. else: An optional block that runs once after the loop finishes normally (i.e., it didn't hit a break).


Nested Loops


You can place one loop inside another, known as nested loops. This is often used for working with multi-dimensional data structures like matrices.



for x in range(2):
for y in range(3):
print(f"Coordinates: {x}, {y}")


What is The Fastest Data Structure?


The "fastest" data structure in Python depends entirely on the operation you are performing (e.g., searching, inserting, or bulk numerical processing)


1. Dictionaries (dict): The fastest for key-based lookups, insertions, and deletions. It offers average \(O(1)\) constant-time performance because it is implemented as a highly optimized hash table in C.

*Side Note:

\(O(1)\) (Constant Time) is a computational complexity term used in computer science to describe an algorithm or operation that takes the same amount of time to execute, regardless of the size of the input data. 

2. Sets (set): The fastest way to check for uniqueness or membership ("Is X in this group?"). Like dictionaries, they use hash tables and are significantly faster than lists for searching.


3. Tuples (tuple): Faster to instantiate than lists and slightly more memory-efficient. They are ideal for fixed data that won't change.


4. Lists (list): Fastest for simple indexing (accessing by position) but slow for searching (\(O(n)\)) or inserting/deleting at the beginning of the list.

Specialized and External Structures

1. collections.deque: The fastest for adding or removing items from both ends of a sequence. While lists are \(O(n)\) for pop(0), a deque is \(O(1)\).

2. NumPy Arrays (numpy.array): The fastest for numerical computations and massive datasets. They use contiguous memory and vectorized operations, often outperforming standard lists by 10–100x for math-heavy tasks.

3. Polars DataFrames: Currently one of the fastest libraries for structured data manipulation at scale, outperforming Pandas due to its Rust backend and lazy evaluation.

4. heapq: The fastest way to maintain a priority queue where you frequently need to retrieve the smallest (or largest) item.




What is The Fastest Data Structure For Inserting More Than 1,000 Rows?

For inserting more than 1,000 values in Python, lists (list.append()) are generally the fastest for bulk loading, followed by sets or dictionaries for unique, fast-lookup data. Use lists when order matters and you need high-speed insertion, or use generators/deque for memory-efficient handling of large streams.


Fastest In-Memory Data Structures

1. List (list.append()): Best for general-purpose fast, ordered, bulk insertion (\(O(1)\) average).

2. Deque (collections.deque): Faster than a list for adding data to the beginning (left side) of the data structure.

3. Set/Dict: Fastest for ensuring uniqueness (no duplicates) and fast membership lookups, despite a slightly slower insertion speed than lists due to hashing.

4. NumPy Arrays: If the data is numerical, pre-allocating a NumPy array is far faster than Python lists.




Missing Values


In Python, missing values are typically represented as NaN (Not a Number) for numerical data or None for object-type data. The Pandas library is the primary tool used to identify, remove, or fill these gaps during data cleaning.

Identifying Missing Values


Before handling data gaps, you must locate them.

- isnull() / isna(): These functions return a boolean mask where True indicates a missing value.

- sum() counts: Combining these with .sum() allows you to count missing values per column (e.g., df.isnull().sum())

- Visualizing: The Missingno library provides bar charts and matrix plots to help you see patterns in missing data


Handling Missing Values


Once identified, you can choose a strategy based on the data's nature:

1. Removal (Deletion)

dropna(): This method removes rows or columns containing any missing values.

Specific removal: Use how='all' to only drop rows where every value is missing, or subset=['column_name'] to drop based on specific features.

2. Filling (Imputation)

fillna(value): Replaces NaN with a constant, such as 0.

Statistical Imputation: Replaces gaps with the column's mean, median, or mode to maintain data distribution.

Time-Series Filling:

-ffill (Forward Fill): Carries the last valid observation forward.

-bfill (Backward Fill): Uses the next valid observation to fill backward

-Interpolation: Estimates values using linear or quadratic mathematical trends between data points.

Advanced Estimation

-For more complex datasets, the Scikit-learn Impute module offers tools like SimpleImputer or KNNImputer, which estimate missing values based on similar data points


Using pandas to Identify Missing Values


In Python's pandas library, df is the standard shorthand used by developers to represent a DataFrame object, which is a two-dimensional, table-like data structure with labeled rows and columns

- df.isna() or df.isnull(): Returns a DataFrame of the same shape where True indicates a missing value.

- df.notna() or df.notnull(): The opposite of above; returns True for valid, non-missing data

- df.info(): Provides a quick summary of the number of non-null entries in each column.


Using pandas to Count Missing Values


To get a quick overview of how much data is missing:

- Per column: df.isnull().sum()

- Total in DataFrame: df.isnull().sum().sum()

- Percentage per column: df.isnull().mean() * 100


Using pandas to Handle Missing Values


Once identified, you can either remove or replace the missing data.

A. Removing Data (dropna)

-Drop rows with any nulls: df.dropna()

-Drop columns with any nulls: df.dropna(axis=1)

-Drop rows where ALL values are null: df.dropna(how='all')

-Keep rows with a minimum number of non-nulls: df.dropna(thresh=2)

B. Filling Data (fillna)

Use fillna() to replace nulls with specific values:

-Constant value: df.fillna(0)

-Statistical imputation:

(*Sidenote: statistical imputation is the process of replacing missing data with estimated values using mathematical or machine learning techniques.)

-df['column'].fillna(df['column'].mean()) (can also use median or mode)

-Forward fill: df.ffill() (replaces null with the previous valid value).

-Backward fill: df.bfill() (replaces null with the next valid value).


Accessing Columns from df in pandas

There are two primary ways to select a column from a

-Bracket Notation (df['column_name']): The most robust method. It works for all column names, including those with spaces, special characters, or names that match existing DataFrame methods like count or mean.

-Dot Notation (df.column_name): A more concise, "Pythonic" syntax that treats the column like an attribute. It only works if the column name is a valid Python identifier (no spaces or special symbols) and does not conflict with existing DataFrame attributes or methods


Common Attribute Notation


df.shape: Returns a tuple representing the number of rows and columns (e.g., (100, 5))

df.columns: Accesses the labels of the columns

df.index: Accesses the labels for the rows

df.dtypes: Shows the data types of each column


Selection and Slicing in pandas


-df.loc[]: Used for label-based selection (accessing rows or columns by their names)

-df.iloc[]: Used for integer-location based selection (accessing data by its numerical position, starting from 0)

-df[0:5]: Standard Python slicing can be used on a df to select a range of rows


Boolean


In Python, a Boolean is a fundamental data type (bool) that represents one of two values: True or False. These values are essential for controlling the flow of a program through logic, comparisons, and conditions.

-Case Sensitivity: Boolean keywords must always be capitalized as True and False. Using lowercase true or false will result in an error.


Logical Operators: These allow you to combine or negate Boolean values:

and: Returns True only if both sides are true.

or: Returns True if at least one side is true.

not: Reverses the value (not True becomes False).


Comparison Operators: Expressions that compare values result in a Boolean:

== (Equal to), != (Not equal to)

>(Greater than), < (Less than)

>= (Greater than or equal to), <= (Less than or equal to)


Truth and False

In Python, almost any object can be evaluated as a Boolean using the bool() function. This is known as "truthiness."

Falsy Values: These always evaluate to False

- The constant False

- None (represents the absence of a value)

- The number zero (0, 0.0)

- Empty sequences or collections: "" (string), [] (list), () (tuple), {} (dictionary), set()

- Truthy Values: Almost everything else evaluates to True, including any non-zero number or non-empty string/collection



Booleans are primarily used in conditional statements (like if and else) to determine which code block should run

Example:

is_raining = True

if is_raining:
print("Take an umbrella!") # This runs because is_raining is True

else:
print("Enjoy the sun.")



Namespace

In Python, a namespace is a collection of names mapped to their corresponding objects. You can think of it as a dictionary where the keys are variable names and the values are the objects themselves

Namespaces ensure that names in a program are unique and do not conflict with each other. For example, you can have a variable named x in two different functions without them interfering because they exist in separate namespaces.

Types of Namespaces

Python manages several types of namespaces, which are typically searched in a specific order (the LEGB rule):

Built-in Namespace: Contains all of Python's built-in objects, like print(), len(), and exceptions. It is created when the interpreter starts and exists until it closes.

Global Namespace: Created for each module (file) when it is imported or run. It contains all names defined at the top level of the file.

Enclosing Namespace: Exists when there are nested functions. It contains names defined in the outer (enclosing) function that are accessible to inner functions

Local Namespace: Created whenever a function is called and contains names defined within that function. It is deleted once the function finishes executing.


Variable Scope and the LEGB Rule

When you reference a name, Python searches for it in this specific hierarchy:

1.Local: Inside the current function.

2.Enclosing: Inside any enclosing (outer) functions

3.Global: At the module level


Finding the Maximum in a Dataset

To find the largest item in an iterable (like a list) or among multiple arguments, use the built-in max() function

Basic usage: max([1, 5, 3]) returns 5

With a key: Use the key parameter to find the maximum based on a specific property, such as string length: max(["apple", "banana"], key=len)

Dictionary maximum: To find the key with the highest value, use max(my_dict, key=my_dict.get)

Finding the Maximum of a Mathematical Function

To find the local or global maximum of a continuous function \(f(x)\), the SciPy library is the standard professional tool.

scipy.optimize.minimize_scalar: Since optimization libraries usually look for a minimum, you find the maximum by negating your function (\(f(x)\rightarrow -f(x)\)).

Example

from scipy.optimize import minimize_scalar

def f(x):
  return -(x**2 - 4*x + 5) # Negated to find the maximum of (x**2 - 4x + 5)
res = minimize_scalar(f)
print(f"Maximum occurs at x = {res.x}")

Large Datasets (NumPy)

For multi-dimensional arrays or performance-critical applications, use the NumPy max() method, which is optimized for speed

Global Max: np.max(array)

Max along axis: array.max(axis=0) (finds maximum in each column)


Dictionary

In Python, a dictionary is a built-in data structure used to store data in key-value pairs. You can think of it like a real-life dictionary where you look up a word (the key) to find its definition (the value)

Dictionaries are written with curly brackets {}. Each pair is separated by a colon :, and pairs are separated by commas.

# Creating a dictionary

user = {

  "name": "Alice",
  "age": 25,
   "city": "New York"

# Accessing a value

print(user["name"]) # Output: Alice

Essential Methods

Commonly used methods for manipulating dictionaries include:

-.get(key): Safely retrieves a value without crashing if the key doesn't exist.

-.keys(): Returns a list-like view of all keys

-.values(): Returns a view of all values.

-.items(): Returns pairs as tuples (key, value), often used for looping

-.update({key: value}): Adds or updates multiple items at once

-.pop(key): Removes the specified key and returns its value


Linear Regression

Linear regression is a foundational machine learning algorithm used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. In Python, it is typically implemented using libraries like scikit-learn, Statsmodels, or SciPy

Types of Linear Regression

Simple Linear Regression: Uses a single independent variable to predict a outcome (e.g., predicting house price based only on square footage)

Multiple Linear Regression: Uses two or more independent variables (e.g., predicting price based on square footage, location, and number of bedrooms)

Example

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])

# Create and fit model
model = LinearRegression()
model.fit(X, y)

# Predict
y_pred = model.predict(X)

# Plot
plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.show()


Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize and train the model
# n_estimators is the number of trees in the forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 4. Make predictions
predictions = model.predict(X_test)

# 5. Evaluate performance
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Run this script in your terminal and you will see the output as

Model Accuracy: 100.00%


Let's try that with weights, which means we give more importance to certain data points.

import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([[1], [2], [4]])
y = np.array([10, 20, 60])

# Higher weights give more importance to specific points
weights = np.array([10, 10, 1])

model = LinearRegression()
model.fit(X, y, sample_weight=weights)

print(f"Intercept: {model.intercept_}")
print(f"Coefficient: {model.coef_}")

Running this script, we get the following output

Intercept: -6.086956521739133
Coefficient: [14.34782609]

Heteroscedasticity

Heteroscedasticity, or heteroskedasticity, occurs in statistics when the variance of residuals (errors) is not constant across all levels of an independent variable in a regression model. It often appears as a "fanning" shape in residual plots where data spread increases or decreases with the predictor. While estimates remain unbiased, they become inefficient and standard errors are unreliable

Statistics: Standard Deviation

In Python, you can calculate standard deviation using built-in modules like statistics or external libraries like NumPy and Pandas. Standard deviation measures how spread out data points are from their mean; a low value indicates data is clustered closely to the average, while a high value suggests a wider spread


import statistics
data = [1, 5, 8, 12, 12, 13]
# Calculate sample standard deviation
print(statistics.stdev(data))
# Calculate population standard deviation
print(statistics.pstdev(data))

Using NumPy

NumPy is highly optimized for performance and is the standard for scientific computing

import numpy as np

data = [1, 2, 3, 4, 5]

# Population SD (default)
print(np.std(data))

# Sample SD
print(np.std(data, ddof=1))

Linear Regression Deep Dive Part 1

Linear Regression is expressed through the following mathematical formula:

\(y=\beta _{0}+\beta _{1}X_{1}+\beta _{2}X_{2}+...+\beta _{p}X_{p}+\epsilon \)\(y\)

y: The dependent or target variable you are trying to predict

Xi: The independent variables or "features

B0 : The y-intercept (or bias term), representing the value of \(y\) when all \(X\) are zero

Bi:The coefficients (or weights) that indicate the influence of each feature on the prediction

ε: The irreducible error or "noise" term

Several principles must be followed to make sure that our model is reliable. The first principle we will talk about is

Ordinary Least Squares (OLS): The standard method for finding the best line by minimizing the Sum of Squared Residuals (the squared vertical distances between data points and the line)

How OLS Works

The Residual: The distance between the actual data point and the regression line.

The Goal: OLS squares these distances (so positive and negative errors don't cancel each other out) and sums them up. The line with the smallest possible sum of squared errors is chosen.

The Math: The estimator calculates the coefficients (weights) for the independent variables. In simple linear regression, the formula for the slope \(\beta \) and intercept \(\alpha \) can be directly calculated using calculus and linear algebra

The mathematical formula for a residual is:

Residual=Observed Value-Predicted Value

In linear regression, the observed value (denoted as \(y_{i}\)) is the actual, measured data point for the dependent variable collected in a study or experiment. It represents the "ground truth" that you use to train and evaluate your model

Optimizing residual values involves minimizing these prediction errors to achieve the best-fit line or optimal decision boundary

To find the best line, OLS defines a Cost Function,

(Sum of Squared Residuals, \(RSS\)):\(RSS(\beta_0, \beta_1) = \sum_{i=1}^{n} \epsilon_i^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2\)

The following example uses least_squares to find the best-fit parameters for a non-linear model by minimizing the residuals

import numpy as np
from scipy.optimize import least_squares

# 1. Define the model (e.g., a simple exponential decay)
def model(params, x):
a, b = params
return a * np.exp(b * x)

# 2. Define the residual function (Difference: observed - predicted)
def residuals(params, x, y):
return model(params, x) - y

# 3. Generate synthetic data with noise
x_data = np.linspace(0, 4, 50)
y_true = model([2.5, -0.5], x_data)
y_noisy = y_true + 0.2 * np.random.normal(size=len(x_data))

# 4. Optimize: Find params that minimize the sum of squares of residuals
initial_guess = [1.0, -1.0]
res = least_squares(residuals, initial_guess, args=(x_data, y_noisy))
print(f"Optimized Parameters: {res.x}")

Output:

Optimized Parameters: [ 2.67966801 -0.52148193]



Snapperdoodle

Linear Regression Training

Training data: x: [1, 2, 3, 4], y: [1, 3, 5, 7] (Pattern: y = 2x - 1)


Status: Ready to train...

Advanced Optimization Options

Robust Loss Functions: You can use parameters like loss='huber' or loss='soft_l1' in least_squares to reduce the influence of outliers.

Bounds: You can constrain parameters (e.g., bounds=(0, 100)) to ensure they stay within physically meaningful limits.

Weighted Least Squares: If certain data points are more reliable than others, you can scale the residuals by their inverse standard deviation to perform weighted optimization.

The Calculus: Finding the Minimum


To find the \(\beta _{0}\) and \(\beta _{1}\) that minimize this sum, we use multivariable calculus. We calculate the partial derivatives of \(RSS\) with respect to both parameters and set them equal to zero.
1. Derivative with respect to \(\beta _{0}\)\(\frac{\partial RSS}{\partial \beta_0} = \sum_{i=1}^{n} 2(y_i - \beta_0 - \beta_1 x_i)(-1) = 0\)
Dividing by \(-2\) and expanding the sum gives the first normal equation:\(\sum_{i=1}^{n} y_i - n\beta_0 - \beta_1 \sum_{i=1}^{n} x_i = 0\) Solving for \(\beta _{0}\) yields:\(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\)(where \( \bar{y} \) and \( \bar{x} \) are the sample means)

NumPy

NumPy (Numerical Python) is the foundational open-source library for scientific computing in Python. It provides the standard for handling large, multi-dimensional data structures and is a critical dependency for major data science tools like Pandas, SciPy, and scikit-learn. It is faster than Python lists for numerical data because it uses homogeneous, fixed-size memory blocks

Examples

1. Array Creation

-You can create arrays from existing Python lists or by using built-in functions for specific patterns

-From a list: Use np.array() to convert a standard list into a 1D or 2D array

-Constant values: np.zeros(shape) or np.ones(shape) creates arrays filled with 0s or 1s

-Numerical ranges: np.arange(start, stop, step) generates a sequence of numbers, similar to Python's range()


Core Features

ndarray Object: A powerful N-dimensional array object that is the center of the library.
Speed and Efficiency: NumPy arrays are stored in contiguous memory, allowing for operations up to 50x faster than standard Python lists.
Vectorization: Eliminates the need for explicit for loops by applying operations across entire arrays simultaneously.
Broadcasting: A set of rules that allows mathematical operations between arrays of different shapes.
Mathematical Library: Includes comprehensive routines for linear algebra, Fourier transforms, and random number generation

Basic Math Operations

NumPy performs "vectorized" operations, meaning mathematical functions are applied element-wise without the need for manual loops.

Arithmetic: Basic operators like +, -, *, and / work directly on arrays

Aggregations: Use np.mean(), np.sum(), np.min(), and np.max() to analyze data

Universal Functions: NumPy includes advanced math like np.sin(), np.log(), and np.exp()

Matrices

In Python, matrices are two-dimensional data structures represented as grids of rows and columns. While Python does not have a native "matrix" type, you can implement them using nested lists or the more powerful NumPy library.


# A 2x3 matrix

matrix = [
[1, 2, 3],
[4, 5, 6]
]

# Access element at row 1, column 2 (6)
print(matrix[1][2])

Comparison: Lists vs NumPy

FeatureNested ListsNumPy Arrays
PerformanceSlower(Uses Loops)Fast (vectorized C operations)
MemoryHigh OverheadHighly Efficient
SlicingDifficultPowerful and Intuitive
Use CaseSimple, small dataScientific computing, AI, ML

Probability

At its simplest, probability is the number of favorable outcomes divided by the total number of possible outcomes (sample space). You can calculate this using standard Python arithmetic or boolean logic, where True is treated as 1 and False as 0

For example,

# Probability of drawing a heart from a 52-card deck
total_cards = 52

hearts = 13
prob_heart = hearts / total_cards # Result: 0.25


Probability Calculator


Gaussian Distribution

The Gaussian distribution (or normal distribution) is a symmetric, bell-shaped probability distribution where most data points cluster around a central mean. Found everywhere in nature and statistics, it is heavily used because of the Central Limit Theorem, which shows that the sum of independent, random variables always approximates this shape

Key Indicators

-The Mean (\(\mu \)): Defines the center of the curve. The mean, median, and mode are all perfectly equal

-The Standard Deviation (\(\sigma \)): Defines the spread or width of the bell curve. A smaller \(\sigma \) results in a tall, narrow curve, while a larger \(\sigma \) results in a short, wide curve.

The Empirical Rule (68-95-99.7 Rule)

A defining feature of the Gaussian distribution is that data predictably falls within specific ranges of the mean:

-\(\sim 68\%\) of data falls within \(1\) standard deviation (\(\mu \pm 1\sigma\))

-\(\sim 95\%\) of data falls within \(2\) standard deviations (\(\mu \pm 2\sigma\))

-\(\sim 99.7\%\) of data falls within \(3\) standard deviations (\(\mu \pm 3\sigma\))

Probability Density Function

A probability density function (PDF) is a mathematical function that describes the relative likelihood of a continuous random variable falling within a specific range of values

Unlike discrete variables (like rolling a die) where specific outcomes have a measurable probability, continuous variables (like height or time) have an infinite number of possibilities, meaning the probability of landing on exactly one specific value is virtually zero.

To be considered a valid PDF, a function \(f(x)\) must meet two strict conditions:

-Non-negative: The function must be greater than or equal to zero for all possible values of \(x\) (i.e., \(f(x) \geq 0\))

-Total Area is 1: The total area underneath the curve of the function across its entire range must equal \(1\). This ensures that there is a \(100\%\) chance the variable will take on some value.

How Probabilities are Calculated

Instead of evaluating the function at a single point, you calculate probabilities over an interval \([a, b]\) by finding the area under the curve using calculus:

\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \,dx\)

The highest points on the PDF graph represent the most likely ranges for the variable, while the lower points represent less likely ranges. Common examples of PDFs include the normal distribution (bell curve) and the uniform distribution.

Core Libraries for Probability

Most complex probability work in Python relies on these specialized libraries:

SciPy (scipy.stats) The primary library for statistical functions. It includes a vast collection of probability distributions (Normal, Binomial, Poisson, etc.) and methods like pmf (Probability Mass Function), cdf (Cumulative Distribution Function), and rvs (Random Variate Sampling)

NumPy (numpy.random): Used for efficient random number generation and simulating thousands or millions of trials at once

Matplotlib & Seaborn: Essential for visualizing data shapes, such as plotting bell curves or histograms to understand distribution patterns.

Gradient Descent

Gradient descent is a fundamental iterative optimization algorithm used in machine learning and deep learning to minimize a model's error (cost or loss). It adjusts model parameters by moving in the direction of the steepest descent (the negative gradient) until the lowest possible error is found

Think of trying to walk down a foggy mountain into the lowest valley. You can't see the bottom, so you have to feel your way around, taking small steps in the direction that feels most steeply downhill.

1.Start: The model begins with random weights and biases

2. Calculate Gradient: It computes the "gradient," which is a mathematical vector representing the direction of steepest increase in error

3. Take a Step: To minimize the error, the algorithm takes a step in the exact opposite direction of the gradient

4. Repeat: This process is repeated iteratively until the algorithm reaches the bottom (the minimum error).

Gradient Descent Example

Gradient Descent Optimizer

This example minimizes the function $f(x) = x^2$ by iteratively finding where the derivative (gradient) is zero.

Optimization Log:

Click 'Run Optimizer' to start...

Limiting Data Ranges

Data can be limited or filtered the following three ways:

1. Clamping Values (Hard Limits)

To force a number \(x\) to stay between a min and max value, use the built-in min() and max() functions.

Logic: max(minimum, min(value, maximum))

Example

val = 150
# Limit val between 0 and 100
limited_val = max(0, min(val, 100)) # Results in 100

For NumPy arrays, use the efficient numpy.clip() function:

import numpy as np
arr = np.array([-10, 5, 50, 150])
clipped_arr = np.clip(arr, 0, 100) # [0, 5, 50, 100]

2. Filtering Data (Removing Out-of-Range Items)

If you need to remove data points that fall outside a specific range, use list comprehensions or library-specific methods.

List Comprehension:

data = [1, 15, 22, 5, 45, 10]
filtered = [x for x in data if 10 <= x <= 30] # [15, 22, 10]

Pandas DataFrame: Use .loc or .query() to filter rows.

import pandas as pd
df = pd.DataFrame({'val': [1, 15, 50]})
# Only keep values between 10 and 30
df_limited = df.query('10 <= val <= 30')

3. Limiting Visualization Axes

When plotting with Matplotlib, you can "zoom in" by setting the X or Y axis limits using plt.xlim() and plt.ylim().

Example

import matplotlib.pyplot as plt

plt.plot([0, 10, 20, 30], [0, 100, 400, 900])
plt.xlim(5, 25) # Focus x-axis from 5 to 25
plt.ylim(0, 500) # Focus y-axis from 0 to 500
plt.show()

4. Input Validation

To ensure a user provides a number within a specific range, use a while loop.

Example

while True:
try:
user_input = int(input("Enter a number (1-10): "))
if 1 <= user_input <= 10:
break
print("Out of range!")
except ValueError:
print("Invalid input!")

Data Chunking

To process a massive CSV file in Pandas, specify a chunksize and iterate through the results:

import pandas as pd
# Define the number of rows per chunk
chunk_size = 10000

# Create an iterator object
data_iterator = pd.read_csv('large_dataset.csv', chunksize=chunk_size)

# Process each chunk individually
for chunk in data_iterator:

# Perform operations (e.g., filtering, summing, or cleaning)
processed_chunk = chunk[chunk['column_name'] > 100]

# Save or append results to a new file
processed_chunk.to_csv('processed_data.csv', mode='a', index=False, header=False)

Advantages of Chunking

-Memory Efficiency: Dramatically reduces the memory footprint by only keeping a small fraction of the data in RAM at any given time.

-Parallel Processing: Some tools can process multiple chunks simultaneously across different processors to reduce overall computation time

-Scalability: Allows workflows originally built for small datasets to scale to much larger datasets by changing how the data is loaded


Splitting Data For Training and Testing

Using tools like train_test_split to divide data into a training set (to teach the model) and a testing set (to evaluate it).


Root Mean Square Error

Root Mean Square Error (RMSE) is a standard metric used to evaluate regression models in machine learning. It measures the average magnitude of the errors between predicted and actual values. By squaring the errors before averaging, RMSE heavily penalizes large errors (outliers)

Example

from sklearn.metrics import root_mean_squared_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
rmse = root_mean_squared_error(y_true, y_pred)
print(f"RMSE: {rmse}")

Outliers

These are data points that are outside the range of your other data points.

1. The Interquartile Range (IQR) Method

This is the best approach for finding and removing outliers by setting boundaries that are within the median.

import numpy as np
import pandas as pd

def remove_outliers_iqr(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter the DataFrame to only include non-outliers
filtered_df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
return filtered_df

2. Z-Score Method

This is another approach for finding outliers within normally distributed data. It measures how many standard deviations a data point is from the mean. A Z-Score greater than 3 or less than -3 are considered outliers

from scipy import stats
def remove_outliers_zscore(df, column, threshold=3):
# Calculate Z-scores and take the absolute value
z_scores = np.abs(stats.zscore(df[column]))
filtered_df = df[z_scores < threshold]
return filtered_df

3. Machine Learning: Isolation Forest

For datasets where outliers are easier to find, Scikit-Learn's IsolationForest is the best tool.

Example

from sklearn.ensemble import IsolationForest

def detect_outliers_isolation_forest(df, features): iso = IsolationForest(contamination=0.1, random_state=42) # Fit the model and predict (1 for inliers, -1 for outliers) preds = iso.fit_predict(df[features]) # Add predictions to the dataframe df_out = df.copy() df_out['is_outlier'] = preds return df_out


Linear Regression Experiment Using Tensorflow


Linear regression in TensorFlow is typically implemented using the high-level Keras API, where a single Dense layer with one neuron and no activation function represents the linear equation \(y = wx + b\).

The following example demonstrates how to train a model to learn the relationship between two sets of numbers (e.g., \(y = 2x - 1\))

import tensorflow as tf
import numpy as np

# 1. Prepare synthetic data
# X: Input features, y: Labels (y = 2x - 1)
X = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
y = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)

# 2. Define the model architecture
# A single neuron in a sequential model handles the linear calculation
model = tf.keras.Sequential([
tf.keras.layers.Dense(units=1, input_shape=[1])
])

# 3. Compile the model
# Use Mean Squared Error to measure loss and Adam for optimization
model.compile(optimizer='sgd', loss='mean_squared_error')

# 4. Train the model
# 'epochs' is the number of times the model sees the data
model.fit(X, y, epochs=500, verbose=0)

# 5. Make a prediction
print(f"Prediction for 10.0: {model.predict([10.0])[0]}")

Run this python script and you'll see the following:

An accuracy close to 1.0 (or 100%) in TensorFlow is usually good, but it can be a red flag depending on which dataset you are looking at

Loss Function

-Mean Squared Error (MSE / L2 Loss): The average of the squared differences between predicted ŷ and actual (\(y\)) values. It is highly sensitive to outliers.

-Mean Absolute Error (MAE / L1 Loss): The average of the absolute differences. It is more robust to outliers than MSE.

-Mean Squared Logarithmic Error (MSLE): Useful when targets have exponential growth or a long tail.

Run this python script and you'll see the following

MSE: 0.375
MAE: 0.5

The following graph demonstrates how MSE (quadratic) and MAE (linear) behave as the prediction error increases.

R-Squared

The following example demonstrates how to calculate \(R^{2}\) using Scikit-learn:


Example

from sklearn.metrics import r2_score

# Actual values
y_true = [3, -0.5, 2, 7]

# Predicted values from your regression model
y_pred = [2.5, 0.0, 2, 8]

# Calculate R-squared
r2 = r2_score(y_true, y_pred)
print(f"R-squared: {r2}")

Run this in your terminal and the output is:

R-squared: 0.9486081370449679

One-Sample Z-Test

A \(Z\)-test determines how many standard deviations your sample mean is from the hypothesized population mean

Z-Test Hypothesis Testing Calculator

One-Sample Z-Test

Results

Calculated Z-Score:

P-Value:

Conclusion:


Some Math