Historical context:
Analogy: Single chef preparing a three-course meal must finish appetizer completely before starting main course, then dessert.
Critical distinction for Python performance:
Concurrency:
Analogy: One chef making soup and steak, puts soup to simmer, sears steak while soup cooks, switches back to stir soup.
Core concept:
Analogy: Head chef brings two assistants, three separate stations working simultaneously on appetizer, main course, and dessert. Meal ready in fraction of the time.
Parallelism:
Main parts:
Definition:
Physical vs. Logical Cores:
Physical Cores:
Logical Cores:
What it is:
The setup:
The problem:
The solution:
Logical cores (total workers available to system):
import os
nombre_coeurs_logiques = os.cpu_count()
print(f"Number of logical cores: {nombre_coeurs_logiques}")
# Number of logical cores: 14
Physical cores (actual computing units):
Install external library:
pip install psutil
Then check:
import psutil
nb_cores = psutil.cpu_count(logical=False)
print(f"Number of physical cores: {nb_cores}")
# Number of physical cores: 14
We’ll compare 5 parallel processing approaches against sequential baseline:
1_single_process.py - Sequential
baseline1_multiprocessing.py - Built-in
multiprocessing module1_concurrent.py -
concurrent.futures.ProcessPoolExecutor1_joblib.py - Joblib library1_mpire.py - MPIRE librarywith Pool(workers) as p:
results = p.map(f, iterable)
What it is:
Key feature:
with concurrent.futures.ProcessPoolExecutor(max_workers=workers) as executor:
results = list(executor.map(f, iterable))
What it is:
Key feature:
submit(), map()results = Parallel(n_jobs=workers)(delayed(f)(i) for i in iterable)
What it is:
Key features:
with WorkerPool(n_jobs=workers) as pool:
results = pool.map(f, iterable)
What it is:
Key features:
Use a virtual environment.
python -m venv adv_prog
You can then activate your new env:
- On Windows: `adv_prog\Scripts\activate`
- On macOS/Linux: `source adv_prog/bin/activate`Once activated, install all the library necessary using :
pip3 install -r requirements.txt
import time
from utils import f, workers, iterable
run_all_1.sh - Shell script to execute
all parallel processing benchmarks#!/bin/bash
for f in 1_*.py; do
echo "--- Running $f ---"
python3 "$f"
echo
echo "=============================="
done
What it simulates:
Results:
def f(x):
time.sleep(1)
return x*x
n= 25
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 3.080 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 3.125 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 3.034 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 3.076 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 25.115 seconds
# ==============================
def f(x):
time.sleep(0.01)
return x*x
n = 2500
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 3.085 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 3.120 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 3.130 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 3.075 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 25.099 seconds
# ==============================
What it simulates:
Scenario:
Results:
def f(n):
n = 9999991
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
n=1000
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 0.111 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 0.135 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 0.119 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 0.051 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 0.037 seconds
# ==============================
n=100000
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 6.959 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 0.534 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 1.479 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 0.447 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 3.761 seconds
# ==============================
What it simulates:
Scenario:
Results:
def f(n):
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
n=1000
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 0.108 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 0.096 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 0.118 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 0.044 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 0.000 seconds
n=1000000
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 70.225 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 1.204 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 11.225 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 0.147 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 0.863 seconds
What it simulates:
Scenario:
import random
import string
import os
def f(x):
filename = f"{x}.txt"
path = 'data/'
letters = ''.join(random.choices(string.ascii_letters, k=200))
with open(path + filename, "w") as file:
file.write(letters)
os.remove(path + filename)
return filename
n = 1000
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 0.141 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 0.152 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 0.119 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 0.118 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 0.099 seconds
# ==============================
n = 100000
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 8.297 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 7.609 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 7.312 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 7.418 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 9.762 seconds
def f(x):
filename = f"{x}.txt"
path = 'data/'
letters = ''.join(random.choices(string.ascii_letters, k=20000))
with open(path + filename, "w") as file:
file.write(letters)
os.remove(path + filename)
return filename
n = 1000
iterable = range(n)
# -- Running 1_concurrent.py ---
# Took 0.162 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 0.184 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 0.225 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 0.140 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 0.630 seconds
# ==============================
n = 100000
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 10.911 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 8.809 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 10.175 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 9.417 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 61.839 seconds
# ==============================
What it simulates:
import numpy as np
def f(x):
size_in_mb = 500
num_elements = (size_in_mb * 1024 * 1024) // 8
big_array = np.random.rand(num_elements)
return len(big_array)
n = 100
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 2.660 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 1.822 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 2.086 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 2.052 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 13.789 seconds
# ==============================
def f(x):
size_in_mb = 5000
num_elements = (size_in_mb * 1024 * 1024) // 8
big_array = np.random.rand(num_elements)
return len(big_array)
n = 10
iterable = range(n)
# --- Running 1_concurrent.py ---
# Took 8.304 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 7.740 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 7.183 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 7.253 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 13.624 seconds
# ==============================
What it simulates:
Scenario:
import numpy as np
def f(x):
return np.mean(x)
n = 100000
big_array = np.random.rand(100000000)
iterable =np.array_split(big_array, n)
# --- Running 1_concurrent.py ---
# Took 7.952 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 1.563 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 1.475 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 0.985 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 0.162 seconds
# ==============================
matrix = big_array.reshape(n, -1)
means = matrix.mean(axis=1)
# -- Numpy
# Took 0.043 seconds
# ==============================
n = 1000
iterable = np.array_split(big_array, n)
# --- Running 1_concurrent.py ---
# Took 0.650 seconds
# ==============================
# --- Running 1_joblib.py ---
# Took 1.037 seconds
# ==============================
# --- Running 1_mpire.py ---
# Took 0.229 seconds
# ==============================
# --- Running 1_multiprocessing.py ---
# Took 0.865 seconds
# ==============================
# --- Running 1_single_process.py ---
# Took 0.034 seconds
# ==============================
# -- Numpy
# Took 0.026 seconds
# ==============================
from Part 1: CPU parallelism is not efficient due to data transfer overhead between cores.
n = 100000
size = 100000000
big_array = np.random.rand(size)
iterable =np.array_split(big_array, n)
matrix = big_array.reshape(n, -1)
means = matrix.mean(axis=1)
# Took 0.033 seconds
GPU solution for matrix operations:
import torch
import time
import numpy as np
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print("No GPU detected")
big_tensor = torch.rand(size, device=device)
t = time.time()
matrix_tensor = big_tensor.reshape(n, -1)
means_tensor = matrix_tensor.mean(dim=1)
# Took 0.003 seconds
10x more data:
n = 100000
size = 1000000000
big_array = np.random.rand(size)
iterable =np.array_split(big_array, n)
matrix = big_array.reshape(n, -1)
means = matrix.mean(axis=1)
# Took 0.330 seconds (x10)
big_tensor = torch.rand(size, device=device)
t = time.time()
matrix_tensor = big_tensor.reshape(n, -1)
means_tensor = matrix_tensor.mean(dim=1)
#Took 0.005 seconds
Core concept:
Analogy:
GPU strength: Massive parallelism for simple, uniform tasks GPU limitation: Not suited for complex, varied operations
Task: Find 3 most similar vectors for each new vector
3 approaches compared:
import time
import numpy as np
from utils_2 import nb_txt, dim, nb_new
def find_similar_numpy(new_texts_matrix, all_txt_matrix):
dot_product = np.dot(all_txt_matrix, new_texts_matrix.T)
all_txt_norm = np.linalg.norm(all_txt_matrix, axis=1)
new_texts_norm = np.linalg.norm(new_texts_matrix, axis=1)
denominator = all_txt_norm[:, np.newaxis] * new_texts_norm[np.newaxis, :]
similarity = dot_product / denominator
return np.argsort(-similarity, axis=1)[:, :3]
if __name__ == "__main__":
existing_txt_np = np.random.rand(nb_txt, dim).astype(np.float32)
new_txt_np = np.random.rand(nb_new,dim).astype(np.float32)
t = time.time()
closest_indices_np = find_similar_numpy(new_txt_np, existing_txt_np)
print(f"Took %.3f seconds" % (time.time() - t))
import torch
import time
import numpy as np
from utils_2 import nb_txt, dim, nb_new
device = torch.device("cpu")
def find_similar_pytorch(new_texts_tensor, all_txt_tensor):
dot_product = torch.matmul(all_txt_tensor, new_texts_tensor.T)
all_txt_norm = torch.linalg.norm(all_txt_tensor, dim=1)
new_texts_norm = torch.linalg.norm(new_texts_tensor, dim=1)
denominator = all_txt_norm.unsqueeze(1) * new_texts_norm.unsqueeze(0)
similarity = dot_product / denominator
return torch.argsort(similarity, dim=1, descending=True)[:, :3]
if __name__ == "__main__":
existing_txt_np = np.random.rand(nb_txt, dim).astype(np.float32)
new_txt_np = np.random.rand(nb_new,dim).astype(np.float32)
existing_txt_pt_cpu = torch.from_numpy(existing_txt_np)
new_txt_pt_cpu = torch.from_numpy(new_txt_np)
t = time.time()
closest_indices_pt_cpu = find_similar_pytorch(new_txt_pt_cpu, existing_txt_pt_cpu)
print(f"Took %.3f seconds" % (time.time() - t))
import torch
import time
import numpy as np
from utils_2 import nb_txt, dim, nb_new
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print("No GPU detected")
def find_similar_pytorch(new_texts_tensor, all_txt_tensor):
dot_product = torch.matmul(all_txt_tensor, new_texts_tensor.T)
all_txt_norm = torch.linalg.norm(all_txt_tensor, dim=1)
new_texts_norm = torch.linalg.norm(new_texts_tensor, dim=1)
denominator = all_txt_norm.unsqueeze(1) * new_texts_norm.unsqueeze(0)
similarity = dot_product / denominator
return torch.argsort(similarity, dim=1, descending=True)[:, :3]
if __name__ == "__main__":
existing_txt_np = np.random.rand(nb_txt, dim).astype(np.float32)
new_txt_np = np.random.rand(nb_new,dim).astype(np.float32)
existing_txt_gpu = torch.from_numpy(existing_txt_np).to(device)
new_txt_gpu = torch.from_numpy(new_txt_np).to(device)
t = time.time()
closest_indices_gpu = find_similar_pytorch(new_txt_gpu, existing_txt_gpu)
print(f"Took %.3f seconds" % (time.time() - t))
Parameters: nb_txt = 1000, dim = 100, nb_new = 100
# --- Running 2_numpy.py ---
# Took 0.002 seconds
# ==============================
# --- Running 2_pytorch_cpu.py ---
# Took 0.002 seconds
# ==============================
# --- Running 2_pytorch_gpu.py ---
# Took 1.196 seconds
Medium: nb_txt = 10000, dim = 100, nb_new = 1000
# --- Running 2_numpy.py ---
# Took 0.305 seconds
# ==============================
# --- Running 2_pytorch_cpu.py ---
# Took 0.054 seconds
# ==============================
# --- Running 2_pytorch_gpu.py ---
# Took 0.022 seconds
Large: nb_txt = 1000000, dim = 100, nb_new = 1000
# --- Running 2_numpy.py ---
# Took 30.882 seconds
# ==============================
# --- Running 2_pytorch_cpu.py ---
# Took 4.023 seconds
# ==============================
# --- Running 2_pytorch_gpu.py ---
# Took 0.026 seconds
Test 1: nb_txt = 100000, dim = 1000, nb_new = 1000
# --- Running 2_numpy.py ---
# Took 3.117 seconds
# ==============================
# --- Running 2_pytorch_cpu.py ---
# Took 0.444 seconds
# ==============================
# --- Running 2_pytorch_gpu.py ---
# Took 0.022 seconds
Test 2: nb_txt = 100000, dim = 1000, nb_new = 10000
# --- Running 2_numpy.py ---
# Took 41.750 seconds
# ==============================
# --- Running 2_pytorch_cpu.py ---
# Took 5.713 seconds
# ==============================
# --- Running 2_pytorch_gpu.py ---
# Took 0.263 seconds
Test: nb_txt = 100000, dim = 1000, nb_new = 100000
# --- Running 2_numpy.py ---
# ./run_all_2.sh: line 3: 35710 Killed: 9 python3 "$f"
# ==============================
# --- Running 2_pytorch_cpu.py ---
# ./run_all_2.sh: line 3: 35787 Killed: 9 python3 "$f"
# ==============================
# --- Running 2_pytorch_gpu.py ---
# Traceback (most recent call last):
# File "/Users/peltouz/Library/Mobile Documents/com~apple~CloudDocs/GitHub/Advanced Programming/2_pytorch_gpu.py", line 35, in <module>
# closest_indices_gpu = find_similar_pytorch(new_txt_gpu, existing_txt_gpu)
# File "/Users/peltouz/Library/Mobile Documents/com~apple~CloudDocs/GitHub/Advanced Programming/2_pytorch_gpu.py", line 18, in find_similar_pytorch
# dot_product = torch.matmul(all_txt_tensor, new_texts_tensor.T)
# RuntimeError: Invalid buffer size: 37.25 GiB
Killed: 9 (CPU):
RuntimeError: Invalid buffer size: 37.25 GiB (GPU):