This chapter deepens the notions of arrays and vectors seen in the ‘Basics’ chapter. In python, numpy will be the main package to deal with arrays, while in R everything is contains in the base. The main advantage of using vectors is to vectorize operation, instead of using loops to run operation on each elements of an object, we can simply perform it on blocks of data in a more efficient way.

let’s first see an example to understand the benefit of using vectorized operation :

# Python 

import numpy as np
import time 

arr = np.arange(2000000)

lst = list(range(2000000))

t = time.time()
arr_2 = arr*2
time.time() - t
## 0.006739139556884766

t = time.time()
lst_2 = [i*2  for i in lst]
time.time() - t
## 0.021797895431518555

# R

vec = c(1:2000000)

t = Sys.time()
vec_2 = vec*2
Sys.time() - t
## Time difference of 0.00257206 secs

t = Sys.time()
vec_2_loop <- rep(0,length(vec))
for(i in 1:length(vec)){
  vec_2_loop[i] = vec[i]*2
    }
Sys.time() - t
## Time difference of 0.1197219 secs

1 Create arrays

The NumPy library contains multidimensional array and matrix data structures, it provides methods to efficiently operate on it. If you do not have numpy on you computer, open a terminal and run pip install numpy.

To use numpy one need to load the module by running import numpy as np, you can just import numpy but then you will need to write numpy.something to perform calculation. In order to keep code standardized we simply say that we will refers to numpy using np.

Numpy allows to perform fast and efficient calculation. The main difference between Numpy and python list is that all elements in a numpy array must be homogenous, Numpy uses less memory to store data.

Functions	Tasks
array	Create numpy array
ndim	Dimension of the array
shape	Size of the array (Number of rows and Columns)
size	Total number of elements in the array
dtype	Type of elements in the array, i.e., int64, character
reshape	Reshapes the array without changing the original shape
resize	Reshapes the array. Also change the original shape
arange	Create sequence of numbers in array
Itemsize	Size in bytes of each item
diag	Create a diagonal matrix
vstack	Stacking vertically
hstack	Stacking horizontally

Starting from existing list or vector :

# Python 

import numpy as np

arr = np.array([1, 2, 3, 4])
arr
## array([1, 2, 3, 4])

# R

arr = matrix(c(1, 2, 3, 4), 4, 1)
arr
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4

Create arrays with zero, one, sequence..

# Python 

np.zeros([2, 3])
## array([[0., 0., 0.],
##        [0., 0., 0.]])


np.ones((2, 3))
## array([[1., 1., 1.],
##        [1., 1., 1.]])


np.arange(1, 7)
## array([1, 2, 3, 4, 5, 6])


np.arange(1, 7).reshape(2, 3)
## array([[1, 2, 3],
##        [4, 5, 6]])

## remember from first chapter that array in R and Python aren't store the same way, set order = 'F' to get the same results


np.arange(1, 7).reshape([2, 3], order = 'F')
## array([[1, 3, 5],
##        [2, 4, 6]])

np.linspace(1, 4,num = 10)
## array([1.        , 1.33333333, 1.66666667, 2.        , 2.33333333,
##        2.66666667, 3.        , 3.33333333, 3.66666667, 4.        ])

# R

matrix(rep(0,2*3), 2, 3)
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0

matrix(rep(1,2*3), 2, 3)
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1

seq(1,6)
## [1] 1 2 3 4 5 6

matrix(seq(1,6),2,3)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

seq(1, 4, length.out = 10)
##  [1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333
##  [9] 3.666667 4.000000

1.1 Random array

# Python 

#generate an array
arr_rd = np.random.randn(4,5)

arr_rd
## array([[-1.33952175e+00,  1.22071591e+00,  5.76721935e-01,
##         -6.26253351e-01, -2.61185704e-03],
##        [ 7.05617478e-01,  1.13717427e-01,  1.74653248e-01,
##         -6.82634913e-01, -2.68689462e-01],
##        [ 1.12199458e+00,  3.49450329e-01,  2.66535086e-01,
##          2.50150232e-01, -2.58821150e-01],
##        [-5.52368746e-01,  1.05893184e-01, -9.28211938e-01,
##         -2.68808278e+00,  4.82327953e-01]])

# R

#generate a matrix
mat_rd <- matrix(rnorm(4*5),4,5)

mat_rd
##            [,1]       [,2]        [,3]       [,4]        [,5]
## [1,]  0.2843290 -1.9360957 -1.73290934 -0.7167748 -1.81599807
## [2,]  1.2245251  2.6357024 -2.30157759 -0.7038928  0.74748374
## [3,]  0.3941497  1.3683239 -0.20105003 -0.1347477 -0.04645494
## [4,] -2.2583160  0.5984619 -0.01868813 -0.8490810 -1.03206005

1.2 Indexing and slicing

# Python 

arr = np.arange(1,7).reshape(2,3)
arr[0]
## array([1, 2, 3])
arr[:2]
## array([[1, 2, 3],
##        [4, 5, 6]])
arr[1:]
## array([[4, 5, 6]])

# R


mat <- matrix(1:6,2,3,byrow = T)
mat[1,]
## [1] 1 2 3
mat[1:2,]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
mat[2:dim(mat)[1],]
## [1] 4 5 6

1.3 Shape and size

# Python 

# .ndim gives the number of axes, or dimensions, of the array.
arr_rd.ndim
## 2

# Python 

# .size gives the total number of elements of the array. 
arr_rd.size
## 20

# Python 

# .shape display a tuple of integers with the number of elements stored along each dimension of the array
arr_rd.shape
## (4, 5)

# R

# length gives the total number of elements of the array.
length(mat_rd)
## [1] 20

#R

# dim display a vector of integers with the number of elements stored along each dimension of the array
dim(mat_rd)
## [1] 4 5

2 Modifiying arrays

2.1 Add elements

# Python 
arr
## array([[1, 2, 3],
##        [4, 5, 6]])

# /!\ using append it convert the array to a 1d array
arr_1d = np.append(arr, [7, 8, 9])
arr_1d
## array([1, 2, 3, 4, 5, 6, 7, 8, 9])

# using insert you can add rows and columns 
np.insert(arr, len(arr), [7, 8, 9], axis = 0)
## array([[1, 2, 3],
##        [4, 5, 6],
##        [7, 8, 9]])

np.insert(arr, 2, [7, 8], axis = 1)
## array([[1, 2, 7, 3],
##        [4, 5, 8, 6]])

# R
mat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

rbind(mat,7:9)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

cbind(mat,7:8)
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    7
## [2,]    4    5    6    8

Combining/splitting arrays

# Python 
arr0 = np.zeros([2,3])
arr1 = np.ones([2,3])

arr01 = np.vstack([arr0,arr1])
arr01
## array([[0., 0., 0.],
##        [0., 0., 0.],
##        [1., 1., 1.],
##        [1., 1., 1.]])

#or 
np.concatenate([arr0,arr1], axis = 0)
## array([[0., 0., 0.],
##        [0., 0., 0.],
##        [1., 1., 1.],
##        [1., 1., 1.]])

np.hstack([arr0,arr1])
## array([[0., 0., 0., 1., 1., 1.],
##        [0., 0., 0., 1., 1., 1.]])

#or
np.concatenate([arr0,arr1], axis = 1)
## array([[0., 0., 0., 1., 1., 1.],
##        [0., 0., 0., 1., 1., 1.]])

np.hsplit(arr01,3)
## [array([[0.],
##        [0.],
##        [1.],
##        [1.]]), array([[0.],
##        [0.],
##        [1.],
##        [1.]]), array([[0.],
##        [0.],
##        [1.],
##        [1.]])]
np.vsplit(arr01,4)
## [array([[0., 0., 0.]]), array([[0., 0., 0.]]), array([[1., 1., 1.]]), array([[1., 1., 1.]])]

# R

mat0 = matrix(0,2,3)
mat1 = matrix(1,2,3)

mat01 = rbind(mat0,mat1)
mat01
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    1    1    1
## [4,]    1    1    1

cbind(mat0,mat1)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    1    1    1
## [2,]    0    0    0    1    1    1

# horizontal
asplit(mat01,2)
## [[1]]
## [1] 0 0 1 1
## 
## [[2]]
## [1] 0 0 1 1
## 
## [[3]]
## [1] 0 0 1 1

#vertical
asplit(mat01,1)
## [[1]]
## [1] 0 0 0
## 
## [[2]]
## [1] 0 0 0
## 
## [[3]]
## [1] 1 1 1
## 
## [[4]]
## [1] 1 1 1

2.2 Delete elements

# Python 

np.delete(arr,1 , axis = 1)
## array([[1, 3],
##        [4, 6]])

np.delete(arr,1 , axis = 0)
## array([[1, 2, 3]])

# R

mat[,-2]
##      [,1] [,2]
## [1,]    1    3
## [2,]    4    6

mat[-2,]
## [1] 1 2 3

2.3 Sorting

# Python 

arr = np.random.randn(10)

arr
## array([ 0.55362894, -0.92812522, -1.06856489,  0.99160095,  0.88314684,
##        -1.47997287,  0.15545915, -1.50116766,  0.48026097,  0.61767938])
arr.sort()

# Python 

arr = np.random.randn(4,3)

arr
## array([[-0.48539526, -0.06614056,  0.78217971],
##        [-0.0455197 , -0.73038539,  1.63142109],
##        [-0.96818455,  1.92365276,  2.28668514],
##        [-0.15303005,  0.98473509, -0.47546953]])
arr.sort(1)

# R

arr <- rnorm(10)

arr
##  [1]  0.8549672 -0.4093803  0.7205933  0.9980313  0.9626051 -0.5713735
##  [7] -0.2263444 -0.6893314 -0.7362330 -0.6039860
sort(arr)
##  [1] -0.7362330 -0.6893314 -0.6039860 -0.5713735 -0.4093803 -0.2263444
##  [7]  0.7205933  0.8549672  0.9626051  0.9980313

# R

arr <- matrix(rnorm(4*3),4,3)

arr
##             [,1]       [,2]       [,3]
## [1,] -0.63021358 -1.5681169 -0.3527171
## [2,]  0.06050562  1.9075538  0.5152889
## [3,] -0.07583089 -0.8443039 -0.4779911
## [4,]  0.88827136 -0.5283868  2.0469722
apply(arr,MARGIN = 2, FUN = sort)
##             [,1]       [,2]       [,3]
## [1,] -0.63021358 -1.5681169 -0.4779911
## [2,] -0.07583089 -0.8443039 -0.3527171
## [3,]  0.06050562 -0.5283868  0.5152889
## [4,]  0.88827136  1.9075538  2.0469722

3 Conditional Logic on array

Imagine two square matrices representing interaction between entities. Each matrix is a basic network: having interaction = 1, no interaction = 0. You want to know if two entities that are linked in one network are also linked in an other network. Instead of checking each entities interaction sequentially in a loop, one can simply use matrix to do element-wise calculation and get the same answer.

# Python 

#generate a list
list_1 = [[0,1,0,1],[1,0,1,0],[0,1,0,1],[1,0,1,0]]
network_1 = np.array(list_1)

#generate a sequence
list_2 = [[0,1,0,0],[1,0,0,0],[0,0,0,1],[0,0,1,0]]
network_2 = np.array(list_2)

network_12 = network_1*network_2
network_12
## array([[0, 1, 0, 0],
##        [1, 0, 0, 0],
##        [0, 0, 0, 1],
##        [0, 0, 1, 0]])

# R

#generate a sequence
list_1 <- list(c(0,1,0,1),c(1,0,1,0),c(0,1,0,1),c(1,0,1,0))

network_1 <- do.call(rbind,list_1)

#generate a sequence
list_2 = list(c(0,1,0,0),c(1,0,0,0),c(0,0,0,1),c(0,0,1,0))

network_2 <- do.call(rbind,list_2)

network_12 = network_1*network_2
network_12
##      [,1] [,2] [,3] [,4]
## [1,]    0    1    0    0
## [2,]    1    0    0    0
## [3,]    0    0    0    1
## [4,]    0    0    1    0

3.1 Select specific elements

# Python 

network_12 > 0
## array([[False,  True, False, False],
##        [ True, False, False, False],
##        [False, False, False,  True],
##        [False, False,  True, False]])

network_12[network_12>0]
## array([1, 1, 1, 1])

# R
 
network_12 > 0
##       [,1]  [,2]  [,3]  [,4]
## [1,] FALSE  TRUE FALSE FALSE
## [2,]  TRUE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE  TRUE
## [4,] FALSE FALSE  TRUE FALSE

network_12[network_12>0]
## [1] 1 1 1 1

On thing that we often need is to modify specific value in a array that are satisfiying a condition. The first thing we want to know is which cells is satisfying the condition. In Python np.where is very common while in R we use the which function.

See how the execution time is changing depending on the size of the vector

# Python 

arr = np.arange(200000)


t = time.time()
results = [(i if i%2==0 else 0) for i in arr]
time.time() - t
## 0.009582757949829102


t = time.time()
results = np.where(arr%2!=0,0,arr)
time.time() - t
## 0.0020558834075927734

# Python 

arr = np.arange(20000000)


t = time.time()
results = [(i if i%2==0 else 0) for i in arr]
time.time() - t
## 0.841174840927124


t = time.time()
results = np.where(arr%2!=0,0,arr)
time.time() - t
## 0.09565281867980957

# R

vec <- seq(1,200000)

t <- Sys.time()
results <- rep(0,length(vec))
for(i in 1:length(vec)){
  if(vec[i]%%2!=0){
    results[i] = 0
  } else {
    results[i] = vec[i]
  }
}
Sys.time() - t
## Time difference of 0.02257609 secs


t <- Sys.time()
results <- ifelse(vec%%2!=0,vec,0)
Sys.time() - t
## Time difference of 0.00338006 secs

# R

vec <- seq(1,20000000)

t <- Sys.time()
results <- rep(0,length(vec))
for(i in 1:length(vec)){
  if(vec[i]%%2!=0){
    results[i] = 0
  } else {
    results[i] = vec[i]
  }
}
Sys.time() - t
## Time difference of 1.775931 secs


t <- Sys.time()
results <- ifelse(vec%%2!=0,vec,0)
Sys.time() - t
## Time difference of 0.4151349 secs


## other way using which

t <- Sys.time()
vec[which(vec%%2==0)] = 0
Sys.time() - t 
## Time difference of 0.172961 secs

3.2 Other conditional vectorization

# R
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Fast version of ifelse()
x <- c(-3:3, NA)
if_else(condition = x < 0,
        true      = "neg",
        false     = "pos",
        missing   = "NA")
## [1] "neg" "neg" "neg" "pos" "pos" "pos" "pos" "NA"


# Vectorised ifelse statements


x <- 1:10
case_when(
  x %% 6 == 0 ~ "fizz buzz",
  x %% 2 == 0 ~ "fizz",
  x %% 3 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)
##  [1] "1"         "fizz"      "buzz"      "fizz"      "5"         "fizz buzz"
##  [7] "7"         "fizz"      "buzz"      "fizz"

4 Algebra

# Python 
from numpy.linalg import inv, qr 
 
X = np.array([[0,1,5,1],[2,1,3,1],[2,1,9,6],[7,2,1,0],[8,3,5,5]]) 
Y = np.array([0,1,2,5,4])

X
## array([[0, 1, 5, 1],
##        [2, 1, 3, 1],
##        [2, 1, 9, 6],
##        [7, 2, 1, 0],
##        [8, 3, 5, 5]])
X.T.dot(X) 
## array([[121,  42,  71,  54],
##        [ 42,  16,  34,  23],
##        [ 71,  34, 141,  87],
##        [ 54,  23,  87,  63]])
# same as 
XtX = np.dot(X.T,X)
# same as 
X.T @ X
## array([[121,  42,  71,  54],
##        [ 42,  16,  34,  23],
##        [ 71,  34, 141,  87],
##        [ 54,  23,  87,  63]])

# inverse Matrix
inv(XtX)
## array([[ 0.23199214, -0.71424803,  0.11636773, -0.09879148],
##        [-0.71424803,  2.33481201, -0.37284193,  0.27469787],
##        [ 0.11636773, -0.37284193,  0.10787934, -0.11260311],
##        [-0.09879148,  0.27469787, -0.11260311,  0.15576444]])

# Identity matrix
np.diag(np.ones(3))
## array([[1., 0., 0.],
##        [0., 1., 0.],
##        [0., 0., 1.]])

# Python 

XtY = np.dot(X.T,Y)

# OLS 
Beta = inv(XtX).dot(XtY)
Beta
## array([ 1.17202187, -1.85550547,  0.42034337, -0.38384807])

# R

 
X <- do.call(rbind,list(c(0,1,5,1),c(2,1,3,1),c(2,1,9,6),c(7,2,1,0),c(8,3,5,5))) 
Y <- c(0,1,2,5,4)
 
t(X)%*%X
##      [,1] [,2] [,3] [,4]
## [1,]  121   42   71   54
## [2,]   42   16   34   23
## [3,]   71   34  141   87
## [4,]   54   23   87   63

XtX <- t(X)%*%X

# inverse Matrix
solve(XtX)
##             [,1]       [,2]       [,3]        [,4]
## [1,]  0.23199214 -0.7142480  0.1163677 -0.09879148
## [2,] -0.71424803  2.3348120 -0.3728419  0.27469787
## [3,]  0.11636773 -0.3728419  0.1078793 -0.11260311
## [4,] -0.09879148  0.2746979 -0.1126031  0.15576444

# Identity matrix
diag(3)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

# R


XtY = t(X)%*%Y

# OLS 
Beta = solve(XtX) %*% XtY
Beta
##            [,1]
## [1,]  1.1720219
## [2,] -1.8555055
## [3,]  0.4203434
## [4,] -0.3838481

5 Operation on array

Operator	Equivalent	Description
+	np.add	Addition (e.g., 1 + 1 = 2)
-	np.subtract	Subtraction (e.g., 3 - 2 = 1)
-	np.negative	Unary negation (e.g., -2)
*	np.multiply	Multiplication (e.g., 2 * 3 = 6)
/	np.divide	Division (e.g., 3 / 2 = 1.5)
//	np.floor_divide	Floor division (e.g., 3 // 2 = 1)
**	np.power	Exponentiation (e.g., 2 ** 3 = 8)
%	np.mod	Modulus/remainder (e.g., 9 % 4 = 1)

# Python 

# nb of link
network_12.sum()/2
## np.float64(2.0)
# same since there is only 0 and 1 
(network_12>0).sum()/2
## np.float64(2.0)

# share of link 
network_12.mean()
## np.float64(0.25)


# nb link by entities
network_12.sum(axis=1)
## array([1, 1, 1, 1])

# share of link across entities (by columns)
network_12.mean(1)
## array([0.25, 0.25, 0.25, 0.25])

# R

# nb of link
sum(network_12)
## [1] 4
# same since there is only 0 and 1 
sum(network_12[which(network_12>0,arr.ind = T)]) # see the arr.ind to get the two coordinates
## [1] 4

# share of link 
mean(network_12)
## [1] 0.25


# nb link by entities
apply(network_12,MARGIN = 2,FUN = sum)
## [1] 1 1 1 1

# share of link across entities (by columns)
apply(network_12,MARGIN = 2,FUN = mean)
## [1] 0.25 0.25 0.25 0.25

5.1 Apply familly (R)

The apply() family allows to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. It avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a function with one or several optional arguments.

5.1.1 apply

apply takes a matrix as input, transform it by row or by columns and returns a matrix

# R

mat_1 = matrix(1:(4*4),4,4)

# by row
apply(mat_1,MARGIN = 1,FUN = sum)
## [1] 28 32 36 40

# by columns
apply(mat_1,MARGIN = 2,FUN = function(x){x**2})
##      [,1] [,2] [,3] [,4]
## [1,]    1   25   81  169
## [2,]    4   36  100  196
## [3,]    9   49  121  225
## [4,]   16   64  144  256

5.1.2 lapply

lapply takes a list as input, transform it and returns a list.

# R

list_1 = list(mat_1,seq(1,8,0.5))
list_1
## [[1]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## 
## [[2]]
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

# by row
lapply(list_1,FUN = function(x){x**2})
## [[1]]
##      [,1] [,2] [,3] [,4]
## [1,]    1   25   81  169
## [2,]    4   36  100  196
## [3,]    9   49  121  225
## [4,]   16   64  144  256
## 
## [[2]]
##  [1]  1.00  2.25  4.00  6.25  9.00 12.25 16.00 20.25 25.00 30.25 36.00 42.25
## [13] 49.00 56.25 64.00

# by columns
lapply(list_1,FUN = sum)
## [[1]]
## [1] 136
## 
## [[2]]
## [1] 67.5

5.1.3 sapply

sapply takes a list as input, transform it and returns a matrix.

# R

list_1 = list(mat_1,seq(1,8,length.out = length(mat_1)))
list_1
## [[1]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## 
## [[2]]
##  [1] 1.000000 1.466667 1.933333 2.400000 2.866667 3.333333 3.800000 4.266667
##  [9] 4.733333 5.200000 5.666667 6.133333 6.600000 7.066667 7.533333 8.000000

# by row
sapply(list_1,FUN = function(x){x**2})
##       [,1]      [,2]
##  [1,]    1  1.000000
##  [2,]    4  2.151111
##  [3,]    9  3.737778
##  [4,]   16  5.760000
##  [5,]   25  8.217778
##  [6,]   36 11.111111
##  [7,]   49 14.440000
##  [8,]   64 18.204444
##  [9,]   81 22.404444
## [10,]  100 27.040000
## [11,]  121 32.111111
## [12,]  144 37.617778
## [13,]  169 43.560000
## [14,]  196 49.937778
## [15,]  225 56.751111
## [16,]  256 64.000000

# by columns
sapply(list_1,FUN = sum)
## [1] 136  72

5.1.4 mapply

mapply is used for ‘multivariate’ apply. The main goal is to vectorize arguments to a function that is not usually accepting vectors as arguments. Depending on the size of the outputs it return a matrix or a list.

# R

# returns a matrix (all length.out = 5)
mapply(FUN = function(x,y,z){seq(x,y,length.out = z)},1,1:5,5)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1 1.00  1.0 1.00    1
## [2,]    1 1.25  1.5 1.75    2
## [3,]    1 1.50  2.0 2.50    3
## [4,]    1 1.75  2.5 3.25    4
## [5,]    1 2.00  3.0 4.00    5

# returns a list (length.out goes from 1 to 5)
mapply(FUN = function(x,y,z){seq(x,y,length.out = z)},1,1:5,1:5)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] 1 2 3
## 
## [[4]]
## [1] 1 2 3 4
## 
## [[5]]
## [1] 1 2 3 4 5

5.2 Map (Python)

In python we can use map to do the same thing than apply familly in R with a bit of manipulation.

The function map takes different kind of input, it transforms it by row if the object is an array and returns an iterator.You can then transform this into a list or an array.

# Python

arr_1 = np.arange(1,17).reshape([4,4], order = 'F')
arr_1
## array([[ 1,  5,  9, 13],
##        [ 2,  6, 10, 14],
##        [ 3,  7, 11, 15],
##        [ 4,  8, 12, 16]])

# by row, by columns you need to tranpose the array
map(np.sum,arr_1)
## <map object at 0x148296b30>
list(map(np.sum,arr_1))
## [np.int64(28), np.int64(32), np.int64(36), np.int64(40)]
np.fromiter(map(sum,arr_1), dtype = int)
## array([28, 32, 36, 40])

# with a list
lst_1 = [arr_1,np.linspace(1,8,num=arr_1.size)]
list(map(lambda x: x**2,lst_1))
## [array([[  1,  25,  81, 169],
##        [  4,  36, 100, 196],
##        [  9,  49, 121, 225],
##        [ 16,  64, 144, 256]]), array([ 1.        ,  2.15111111,  3.73777778,  5.76      ,  8.21777778,
##        11.11111111, 14.44      , 18.20444444, 22.40444444, 27.04      ,
##        32.11111111, 37.61777778, 43.56      , 49.93777778, 56.75111111,
##        64.        ])]

Pandas’ apply function:

Using pandas one can also use apply functions, the difference with map is that by default it works by columns since variables in a df are store by colmuns.

# Python

import pandas as pd

arr_1 = np.arange(1,17).reshape([4,4], order = 'F')

#transform the array into a DataFrame
df_1 = pd.DataFrame(arr_1)
df_1
##    0  1   2   3
## 0  1  5   9  13
## 1  2  6  10  14
## 2  3  7  11  15
## 3  4  8  12  16


# Apply the function by column
df_1.apply(sum)
## 0    10
## 1    26
## 2    42
## 3    58
## dtype: int64

6 Exercises

6.1 Exercise 1

The series, \(1^{1} + 2^{2} + 3^{3} + ... + 10^{10} = 10405071317\).

Find the last ten digits of the series, \(1^{1} + 2^{2} + 3^{3} + ... + 1000^{1000}\).

6.2 Exercise 2

Try to vectorize exercices 1 and 2 from chapter 1, you can also compare it with apply/map functions.

6.3 Exercise 3

A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 × 99.

Find the largest palindrome made from the product of two 3-digit numbers.

Solutions

Arrays