This chapter is a very brief introduction to the concepts of generator and matrix sparse which can be useful to optimize RAM usage in some cases. Also it will show how to use the full power of cores present in our computer by initiating students to parallel algebra.
Generator usages in Python is a simple way of creating iterators. The goal of this kind of function is to return an object for which we can iterate over it. When the function is called, the function code is not executed. Instead, the function will return a generator object. The difference in terms of syntax hold in the fact that we will the statement ‘yield’ instead of ‘return’. In comparaison to ‘return’, using ‘yield’ will pause the function and the control of that generator will be trasnfered to the object which calls this generator.
Generators are very useful when we deal with objects which can be too heavy to store all the information in the memory. As a generators is an iterator, it will load in the RAM only the iteration that it is reading, meaning that we can have no size limit if we want to iterate over an object that is to heavy. This first example is made with a txt file with 1000 rows. We will see how RAM is used behind our machine to understand how it can be relevant to use generators.
# Python
import os
import psutil
import re
import time
process = psutil.Process(os.getpid())
t = time.time()
r1 = process.memory_info().rss
def read_file_to_lower():
with open("data/text.txt", "r") as txt:
txt_list = txt.read().split('\n')
txt_list_low = [re.sub(' ','',row.lower()) for row in txt_list]
return txt_list_low
txt_clean = read_file_to_lower()
process.memory_info().rss - r1 # in bytes
## 110592
time.time() - t
## 0.01718425750732422
type(txt_clean)
## <class 'list'>
# Python
process = psutil.Process(os.getpid())
t = time.time()
r1 = process.memory_info().rss
def read_to_lower():
with open("data/text.txt", "r") as txt:
txt_line = txt.readline()
while txt_line:
txt_line = txt.readline()
yield re.sub(' ','',txt_line.lower())
memory_friendly_txt_clean = read_to_lower()
txt_clean = [line_clean for line_clean in memory_friendly_txt_clean]
process.memory_info().rss - r1 # in bytes
## 0
time.time() - t
## 0.018773794174194336
type(memory_friendly_txt_clean)
## <class 'generator'>
Note how the memory used during the calculation is very different.
When dealing with large array that are sparse (lot of element equal to 0), it is inneficient to store this object as a dense matrix. Large matrix are sometimes difficult to store in the memory because they are weighty, when they dealing with sparse we can store it and do calculation with them in a very efficient way using the ‘Compressed Sparse Column’ (CSC) format. See how it change the weight of the matrix when changing the format to sparse matrix .
# Python
import numpy as np
from scipy.sparse import csr_matrix
#from pympler import asizeof
X = np.zeros(10000**2)
idx = np.random.randint(1,10000**2,1000)
X[idx] = 1
X = X.reshape(10000,10000)
#asizeof.asizeof(X)/1e+6
sparse_X = csr_matrix(X)
sparse_X
#asizeof.asizeof(sparse_X)/1e+6
## <10000x10000 sparse matrix of type '<class 'numpy.float64'>'
## with 1000 stored elements in Compressed Sparse Row format>
# Python
import time
t = time.time()
sum_x = X+X
time.time() - t
## 0.08066511154174805
t = time.time()
sum_sparse_x = sparse_X+sparse_X
time.time() - t
## 0.006093502044677734
# R
library(Matrix)
X = matrix(0,10000,10000)
idx = round(runif(1000,1,10000**2))
X[idx] = 1
print(object.size(X),units = 'auto')
## 762.9 Mb
sparse_X = Matrix(X,sparse = T)
print(object.size(sparse_X),units = 'auto')
## 52.2 Kb
# R
library(tictoc)
## Error in library(tictoc): there is no package called 'tictoc'
tic()
## Error in tic(): could not find function "tic"
sum_x = X+X
toc()
## Error in toc(): could not find function "toc"
tic()
## Error in tic(): could not find function "tic"
sum_sparse_x = sparse_X+sparse_X
toc()
## Error in toc(): could not find function "toc"
Sometimes the tasks performed successively in a loop are slow, as long as one operation is not finished the computer does not move on to the next one. When the iterations of a loop are independent of each other, i.e. the iteration i+1 is insensitive to the results of the previous iterations, we can use parallel calculation. Your computer is composed of physical processors and logical processors (cores and threads) on which the tasks will be distributed.
Let’s make an example where the goal will be to make a bootstrap with a linear model. We will resample 500 times our dataset and estimate a model for each sample.
Let’s create first the dataset
# Python
import numpy as np
X = np.random.rand(100000,2)
y = 2 + 0.5*X[:,0].reshape(100000,1) + 0.2*X[:,1].reshape(100000,1) + np.random.normal(0,3,100000).reshape(100000,1)
# Python
from sklearn.linear_model import LinearRegression
def lm_bs():
idx_sample = np.random.randint(0,len(y),len(y))
y_sample = y[idx_sample]
X_sample = X[idx_sample]
reg = LinearRegression().fit(X_sample, y_sample)
coef_s = reg.coef_[0]
return(coef_s)
# Python
import time
t = time.time()
coefs = np.empty((500,2), float)
for i in range(500):
coefs[i,:] = lm_bs()
time.time() - t
## 3.0123157501220703
# Python
from joblib import Parallel, delayed
import multiprocessing
num_cores = multiprocessing.cpu_count()
print(num_cores)
## 12
t = time.time()
coefs = Parallel(n_jobs=5)(delayed(lm_bs)() for i in range(500))
time.time() - t
## 7.585024118423462
# R
X = matrix(runif(100000*2),100000,2)
y = 2 + 0.5*X[,1] + 0.2*X[,2] + rnorm(100000,0,3)
# R
lm_bs <- function(){
idx_sample = sample(seq(1,length(y)))
y_sample = y[idx_sample]
X_sample = X[idx_sample,]
reg = lm(y_sample ~ X_sample)
coef_s = reg$coefficients[-1]
return(coef_s)
}
# R
tic()
## Error in tic(): could not find function "tic"
coefs = matrix(0,500,2)
for(i in 1:500){
coefs[i,] = lm_bs()
}
toc()
## Error in toc(): could not find function "toc"
# R
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
detectCores()
## [1] 12
registerDoParallel(5)
# R
tic()
## Error in tic(): could not find function "tic"
coefs = foreach(i = 1:500,.packages = c('base')) %dopar% {
lm_bs()
}
toc()
## Error in toc(): could not find function "toc"
It is important to understand why a given function may produce an error. Determining the situations where errors may occur is not always easy, but it’s important to not be too lenient in order to avoid errors. In the examples, we will see how the flexibility of a function can lead to different results, some of which may be more efficient than others.
For example, if we expect that 99% of the time the result will contain something iterable, we would use the try/except approach. This will be faster if exceptions are truly exceptional. However, if the result is None more than 50% of the time, then using an ‘if’ statement is probably a better approach.
While an ‘if’ statement always has a cost, setting up a try/except block is relatively inexpensive. However, when an exception does occur, the cost is much higher.
# Python
import numpy as np
import random
# Let's create a function that create a dirty list
def create_dirty_list(None_prop):
list_ = list()
for i in range(10000):
if i<None_prop*10000:
list_.append([None])
else:
values = random.sample(range(1, 1000), random.sample(range(1, 50),1)[0])
# introduce some character
if i%10==0 :
values = [str(i) for i in values]
list_.append(values)
return list_
# randomly take two values to compute the ratio
def calc1(values):
output = values[random.sample(range(1, len(values)+1),1)[0]-1]/values[random.sample(range(1, len(values)+1),1)[0]-1]
return output
list_ = create_dirty_list(0.5)
# Store it in a list
results = [calc1(values) for values in list_]
## unsupported operand type(s) for /: 'NoneType' and 'NoneType'
# Let's change the function
def calc2(values):
if not any(value is None for value in values):
output = values[random.sample(range(1, len(values)+1),1)[0]-1]/values[random.sample(range(1, len(values)+1),1)[0]-1]
return output
results = [calc2(values) for values in list_]
## unsupported operand type(s) for /: 'str' and 'str'
# Let's change the function
def calc3(values):
if not any(value is None for value in values):
if all(isinstance(value,int) for value in values):
output = values[random.sample(range(1, len(values)+1),1)[0]-1]/values[random.sample(range(1, len(values)+1),1)[0]-1]
return output
results = [calc3(values) for values in list_]
# using try
def calc_try(values):
try:
output = values[random.sample(range(1, len(values)+1),1)[0]-1]/values[random.sample(range(1, len(values)+1),1)[0]-1]
except:
output = None
return output
results = [calc_try(values) for values in list_]
list_ = create_dirty_list(0.01)
t= time.time()
results = [calc3(values) for values in list_]
time.time()-t
## 0.10365605354309082
t= time.time()
results = [calc_try(values) for values in list_]
time.time()-t
## 0.07570314407348633
list_ = create_dirty_list(0.33)
t= time.time()
results = [calc3(values) for values in list_]
time.time()-t
## 0.07312226295471191
t= time.time()
results = [calc_try(values) for values in list_]
time.time()-t
## 0.07549452781677246
list_ = create_dirty_list(0.75)
t= time.time()
results = [calc3(values) for values in list_]
time.time()-t
## 0.03282761573791504
t= time.time()
results = [calc_try(values) for values in list_]
time.time()-t
## 0.07580995559692383
# R
# Let's create a function that create a list
create_dirty_list <- function(None_prop){
list_ <- list()
for(i in seq(1,10000,1)){
if(i<None_prop*10000){
list_[[i]] <- c(NA)
} else {
values <- sample(seq(1,10000,1),round(runif(1,1,50)))
# introduce some character
if(i%%10==0){
values <- as.character(values)
list_[[i]] <- values
}
}
}
return(list_)
}
# randomly take two values to compute the ratio
calc1 <- function(values){
output <- values[round(runif(1,1,length(values)))]/values[round(runif(1,1,length(values)))]
return(output)
}
list_ <- create_dirty_list(0.5)
# Store it in a list
results <- lapply(list_,FUN = calc1)
## Error in values[round(runif(1, 1, length(values)))]/values[round(runif(1, : non-numeric argument to binary operator
# Let's change the function
calc2 <- function(values){
if(!any(is.na(values))){
output <- values[round(runif(1,1,length(values)))]/values[round(runif(1,1,length(values)))]
return(output)
}
}
results <- lapply(list_,FUN = calc2)
## Error in values[round(runif(1, 1, length(values)))]/values[round(runif(1, : non-numeric argument to binary operator
# Let's change the function
calc3 <- function(values){
if(!any(is.na(values))){
if(is.numeric(values)){
output <- values[round(runif(1,1,length(values)))]/values[round(runif(1,1,length(values)))]
return(output)
}
}
}
results <- lapply(list_,FUN = calc3)
# using try
calc_try <- function(values){
output <- tryCatch(
{
values[round(runif(1,1,length(values)))]/values[round(runif(1,1,length(values)))]
}
,error=function(e){
return(NA)
})
return(output)
}
results <- lapply(list_,FUN = calc_try)
list_ <- create_dirty_list(0.01)
t <- Sys.time()
results <- lapply(list_,FUN = calc3)
Sys.time()-t
## Time difference of 0.02626657 secs
t <- Sys.time()
results <- lapply(list_,FUN = calc_try)
Sys.time()-t
## Time difference of 0.7108152 secs
# compare it
list_ = create_dirty_list(0.33)
t <- Sys.time()
results <- lapply(list_,FUN = calc3)
Sys.time()-t
## Time difference of 0.02208614 secs
t <- Sys.time()
results <- lapply(list_,FUN = calc_try)
Sys.time()-t
## Time difference of 0.5026259 secs
# compare it
list_ = create_dirty_list(0.75)
t <- Sys.time()
results <- lapply(list_,FUN = calc3)
Sys.time()-t
## Time difference of 0.01321411 secs
t <- Sys.time()
results <- lapply(list_,FUN = calc_try)
Sys.time()-t
## Time difference of 0.2554872 secs