This chapter is a very brief introduction to the concepts of generator and matrix sparse which can be useful to optimize RAM usage in some cases. Also it will show how to use the full power of cores present in our computer by initiating students to parallel algebra.

1 Python Generators

Generator usages in Python is a simple way of creating iterators. The goal of this kind of function is to return an object for which we can iterate over it. When the function is called, the function code is not executed. Instead, the function will return a generator object. The difference in terms of syntax hold in the fact that we will the statement ‘yield’ instead of ‘return’. In comparaison to ‘return’, using ‘yield’ will pause the function and the control of that generator will be trasnfered to the object which calls this generator.
Generators are very useful when we deal with objects which can be too heavy to store all the information in the memory. As a generators is an iterator, it will load in the RAM only the iteration that it is reading, meaning that we can have no size limit if we want to iterate over an object that is to heavy. This first example is made with a txt file with 1000 rows. We will see how RAM is used behind our machine to understand how it can be relevant to use generators.

# Python 
import os
import psutil
import re
import time

process = psutil.Process(os.getpid())

t = time.time()
r1 = process.memory_info().rss 
def read_file_to_lower():
  with open("data/text.txt", "r") as txt:
    txt_list = txt.read().split('\n')
    txt_list_low = [re.sub(' ','',row.lower()) for row in txt_list]
  return txt_list_low

txt_clean = read_file_to_lower()

process.memory_info().rss - r1 # in bytes
## 110592
time.time() - t
## 0.01718425750732422
type(txt_clean)
## <class 'list'>
# Python

process = psutil.Process(os.getpid())

t = time.time()
r1 = process.memory_info().rss 

def read_to_lower():
  with open("data/text.txt", "r") as txt:
    txt_line = txt.readline()
    while txt_line:
      txt_line = txt.readline()
      yield re.sub(' ','',txt_line.lower())


memory_friendly_txt_clean = read_to_lower()
txt_clean = [line_clean for line_clean in memory_friendly_txt_clean]

process.memory_info().rss - r1 # in bytes
## 0
time.time() - t

## 0.018773794174194336
type(memory_friendly_txt_clean)

## <class 'generator'>

Note how the memory used during the calculation is very different.

 

2 Sparse Matrix

When dealing with large array that are sparse (lot of element equal to 0), it is inneficient to store this object as a dense matrix. Large matrix are sometimes difficult to store in the memory because they are weighty, when they dealing with sparse we can store it and do calculation with them in a very efficient way using the ‘Compressed Sparse Column’ (CSC) format. See how it change the weight of the matrix when changing the format to sparse matrix .

# Python 
import numpy as np
from scipy.sparse import csr_matrix
#from pympler import asizeof

X = np.zeros(10000**2)
idx = np.random.randint(1,10000**2,1000)
X[idx] = 1

X = X.reshape(10000,10000)

#asizeof.asizeof(X)/1e+6

sparse_X = csr_matrix(X)
sparse_X
#asizeof.asizeof(sparse_X)/1e+6
## <10000x10000 sparse matrix of type '<class 'numpy.float64'>'
##  with 1000 stored elements in Compressed Sparse Row format>
# Python 
import time

t = time.time()
sum_x = X+X
time.time() - t
## 0.08066511154174805
t = time.time()
sum_sparse_x = sparse_X+sparse_X
time.time() - t
## 0.006093502044677734
# R
library(Matrix)

X = matrix(0,10000,10000)
idx = round(runif(1000,1,10000**2))

X[idx] = 1

print(object.size(X),units = 'auto')
## 762.9 Mb

sparse_X = Matrix(X,sparse = T)
print(object.size(sparse_X),units = 'auto')
## 52.2 Kb
# R
library(tictoc)
## Error in library(tictoc): there is no package called 'tictoc'

tic()
## Error in tic(): could not find function "tic"
sum_x = X+X
toc()
## Error in toc(): could not find function "toc"

tic()
## Error in tic(): could not find function "tic"
sum_sparse_x = sparse_X+sparse_X
toc()
## Error in toc(): could not find function "toc"

3 Parallel computing

Sometimes the tasks performed successively in a loop are slow, as long as one operation is not finished the computer does not move on to the next one. When the iterations of a loop are independent of each other, i.e. the iteration i+1 is insensitive to the results of the previous iterations, we can use parallel calculation. Your computer is composed of physical processors and logical processors (cores and threads) on which the tasks will be distributed.

Let’s make an example where the goal will be to make a bootstrap with a linear model. We will resample 500 times our dataset and estimate a model for each sample.

Let’s create first the dataset

# Python 
import numpy as np

X = np.random.rand(100000,2)
y = 2 + 0.5*X[:,0].reshape(100000,1) + 0.2*X[:,1].reshape(100000,1) + np.random.normal(0,3,100000).reshape(100000,1)
  • Create a function to compute OLS coefficients for a given sample
# Python 
from sklearn.linear_model import LinearRegression

def lm_bs():
    idx_sample = np.random.randint(0,len(y),len(y))
    y_sample = y[idx_sample]
    X_sample = X[idx_sample]
    reg = LinearRegression().fit(X_sample, y_sample)
    coef_s = reg.coef_[0]
    return(coef_s)
  • Run bootstrap with a loop
# Python
import time 

t = time.time()
coefs = np.empty((500,2), float)
for i in range(500):
    coefs[i,:] = lm_bs()
time.time() - t
## 3.0123157501220703
  • Check how many cores you can use
# Python

from joblib import Parallel, delayed
import multiprocessing
num_cores = multiprocessing.cpu_count()
print(num_cores)
## 12
  • Run the loop in parallel !
t = time.time()
coefs = Parallel(n_jobs=5)(delayed(lm_bs)() for i in range(500))
time.time() - t
## 7.585024118423462
# R

X = matrix(runif(100000*2),100000,2)
y = 2 + 0.5*X[,1] + 0.2*X[,2] + rnorm(100000,0,3)
  • Create a function to compute OLS coefficients for a given sample
# R 

lm_bs <- function(){
    idx_sample = sample(seq(1,length(y)))
    y_sample = y[idx_sample]
    X_sample = X[idx_sample,]
    reg = lm(y_sample ~ X_sample)
    coef_s = reg$coefficients[-1]
    return(coef_s)
}
  • Run bootstrap with a loop
# R

tic()
## Error in tic(): could not find function "tic"
coefs = matrix(0,500,2)
for(i in 1:500){
  coefs[i,] = lm_bs()
}
toc()
## Error in toc(): could not find function "toc"
  • Check how many cores you can use
# R
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
detectCores()
## [1] 12
registerDoParallel(5)
  • Run the loop in parallel !
# R

tic()
## Error in tic(): could not find function "tic"
coefs =  foreach(i = 1:500,.packages = c('base')) %dopar% {
                   lm_bs()
                   }

toc()
## Error in toc(): could not find function "toc"

4 Error and Exception Handling

It is important to understand why a given function may produce an error. Determining the situations where errors may occur is not always easy, but it’s important to not be too lenient in order to avoid errors. In the examples, we will see how the flexibility of a function can lead to different results, some of which may be more efficient than others.

For example, if we expect that 99% of the time the result will contain something iterable, we would use the try/except approach. This will be faster if exceptions are truly exceptional. However, if the result is None more than 50% of the time, then using an ‘if’ statement is probably a better approach.

While an ‘if’ statement always has a cost, setting up a try/except block is relatively inexpensive. However, when an exception does occur, the cost is much higher.

# Python 
import numpy as np
import random

# Let's create a function that create a dirty list
def create_dirty_list(None_prop):
  list_ = list()
  for i in range(10000):
    if i<None_prop*10000:
      list_.append([None])
    else:
      values = random.sample(range(1, 1000), random.sample(range(1, 50),1)[0])
      # introduce some character 
      if i%10==0 :
        values = [str(i) for i in values]
      list_.append(values)
  return list_

# randomly take two values to compute the ratio 
def calc1(values):
  output = values[random.sample(range(1, len(values)+1),1)[0]-1]/values[random.sample(range(1, len(values)+1),1)[0]-1]
  return output

list_ = create_dirty_list(0.5)
# Store it in a list
results = [calc1(values) for  values in list_]
## unsupported operand type(s) for /: 'NoneType' and 'NoneType'
# Let's change the function

def calc2(values):
  if not any(value is None for value in values): 
    output = values[random.sample(range(1, len(values)+1),1)[0]-1]/values[random.sample(range(1, len(values)+1),1)[0]-1]
    return output

results = [calc2(values) for values in list_]
## unsupported operand type(s) for /: 'str' and 'str'
# Let's change the function

def calc3(values):
  if not any(value is None for value in values): 
    if all(isinstance(value,int) for value in values):
      output = values[random.sample(range(1, len(values)+1),1)[0]-1]/values[random.sample(range(1, len(values)+1),1)[0]-1]
      return output 

results = [calc3(values) for values in list_]
# using try

def calc_try(values):
  try:
    output = values[random.sample(range(1, len(values)+1),1)[0]-1]/values[random.sample(range(1, len(values)+1),1)[0]-1]
  except:
    output = None
  return output
    
results = [calc_try(values) for values in list_]
  • Compare execution time
list_ = create_dirty_list(0.01)
t= time.time()
results = [calc3(values) for values in list_]
time.time()-t
## 0.10365605354309082
t= time.time()
results = [calc_try(values) for values in list_]
time.time()-t
## 0.07570314407348633
list_ = create_dirty_list(0.33)
t= time.time()
results = [calc3(values) for values in list_]
time.time()-t
## 0.07312226295471191
t= time.time()
results = [calc_try(values) for values in list_]
time.time()-t
## 0.07549452781677246
list_ = create_dirty_list(0.75)
t= time.time()
results = [calc3(values) for values in list_]
time.time()-t
## 0.03282761573791504
t= time.time()
results = [calc_try(values) for values in list_]
time.time()-t
## 0.07580995559692383
# R
# Let's create a function that create a list 
create_dirty_list <- function(None_prop){
  list_ <- list()
  for(i in seq(1,10000,1)){
    if(i<None_prop*10000){
      list_[[i]] <- c(NA)
      } else {
      values <- sample(seq(1,10000,1),round(runif(1,1,50)))
      # introduce some character 
      if(i%%10==0){
        values <- as.character(values)
      list_[[i]] <- values
      }
      }
  }
  return(list_)
  }

# randomly take two values to compute the ratio 
calc1 <- function(values){
  output <- values[round(runif(1,1,length(values)))]/values[round(runif(1,1,length(values)))]
  return(output)
}

list_ <- create_dirty_list(0.5)
# Store it in a list
results <- lapply(list_,FUN = calc1)
## Error in values[round(runif(1, 1, length(values)))]/values[round(runif(1, : non-numeric argument to binary operator
# Let's change the function

calc2 <- function(values){
  if(!any(is.na(values))){
    output <- values[round(runif(1,1,length(values)))]/values[round(runif(1,1,length(values)))]
    return(output)
  }
}


results <- lapply(list_,FUN = calc2)
## Error in values[round(runif(1, 1, length(values)))]/values[round(runif(1, : non-numeric argument to binary operator
# Let's change the function

calc3 <- function(values){
  if(!any(is.na(values))){
    if(is.numeric(values)){
      output <- values[round(runif(1,1,length(values)))]/values[round(runif(1,1,length(values)))]
      return(output)
    }
  }
}



results <- lapply(list_,FUN = calc3)
# using try

calc_try <- function(values){
  output <- tryCatch(
    {
      values[round(runif(1,1,length(values)))]/values[round(runif(1,1,length(values)))]
    }
    ,error=function(e){
      return(NA)
    })
  return(output)
}
    
results <- lapply(list_,FUN = calc_try)
  • Compare execution time
list_ <- create_dirty_list(0.01)
t <- Sys.time()
results <- lapply(list_,FUN = calc3)
Sys.time()-t
## Time difference of 0.02626657 secs
t <- Sys.time()
results <- lapply(list_,FUN = calc_try)
Sys.time()-t
## Time difference of 0.7108152 secs
# compare it 
list_ = create_dirty_list(0.33)
t <- Sys.time()
results <- lapply(list_,FUN = calc3)
Sys.time()-t
## Time difference of 0.02208614 secs
t <- Sys.time()
results <- lapply(list_,FUN = calc_try)
Sys.time()-t
## Time difference of 0.5026259 secs
# compare it 
list_ = create_dirty_list(0.75)
t <- Sys.time()
results <- lapply(list_,FUN = calc3)
Sys.time()-t
## Time difference of 0.01321411 secs
t <- Sys.time()
results <- lapply(list_,FUN = calc_try)
Sys.time()-t
## Time difference of 0.2554872 secs