This chapter delves deeper into the concepts of arrays and vectors that were introduced in the ‘Basics’ chapter. In Python, the primary package for working with arrays is numpy. The main advantage of using vectors is that it allows for vectorized operations, rather than using loops to perform operations on each element of an object. Vectorized operations allow for more efficient processing of blocks of data.

 

 

Let’s first take a look at an example to understand the benefits of using vectorized operations:

# Python 

import numpy as np
import time 

arr = np.arange(200000)

lst = list(range(200000))

t = time.time()
arr_2 = arr*2
time.time() - t
## 0.006211757659912109
t = time.time()
lst_2 = [i*2  for i in lst]
time.time() - t

## 0.013716936111450195

1 Create arrays

The NumPy library contains multidimensional array and matrix data structures and provides methods to efficiently operate on them. If you do not have numpy installed on your computer, you can install it by running pip install numpy in a terminal.

To use numpy, you need to load the module by running import numpy as np. This allows you to refer to numpy as “np” throughout your code, which keeps it standardized.

Numpy allows for fast and efficient calculations. The main difference between Numpy and python lists is that all elements in a numpy array must be homogeneous and numpy uses less memory to store data.

Functions Tasks
array Create numpy array
ndim Dimension of the array
shape Size of the array (Number of rows and Columns)
size Total number of elements in the array
dtype Type of elements in the array, i.e., int64, character
reshape Reshapes the array without changing the original shape
resize Reshapes the array. Also change the original shape
arange Create sequence of numbers in array
Itemsize Size in bytes of each item
diag Create a diagonal matrix
vstack Stacking vertically
hstack Stacking horizontally
  • Starting from existing list or vector :
# Python 

import numpy as np

arr = np.array([1, 2, 3, 4])
arr
## array([1, 2, 3, 4])
  • Create arrays with zero, one, sequence..
# Python 

np.zeros([2, 3])

## array([[0., 0., 0.],
##        [0., 0., 0.]])
np.ones((2, 3))

## array([[1., 1., 1.],
##        [1., 1., 1.]])
np.arange(1, 7)

## array([1, 2, 3, 4, 5, 6])
np.arange(1, 7).reshape(2, 3)

## remember from first chapter that array in R and Python aren't store the same way, set order = 'F' to get the same results

## array([[1, 2, 3],
##        [4, 5, 6]])
np.arange(1, 7).reshape([2, 3], order = 'F')
## array([[1, 3, 5],
##        [2, 4, 6]])
np.linspace(1, 4,num = 10)
## array([1.        , 1.33333333, 1.66666667, 2.        , 2.33333333,
##        2.66666667, 3.        , 3.33333333, 3.66666667, 4.        ])

1.1 Random array

# Python 

#generate an array
arr_rd = np.random.randn(4,5)

arr_rd
## array([[ 0.70182419,  0.8359941 ,  0.99714696, -0.50124705, -1.68598557],
##        [-1.52689203, -0.51066088, -0.64745007, -0.32815722, -0.08960339],
##        [-0.42389929,  1.29003683,  0.85168218, -0.52424361, -2.57908576],
##        [-0.96121576,  0.44830129, -1.04795886,  0.23979259, -0.65850916]])

1.2 Indexing and slicing

# Python 

arr = np.arange(1,7).reshape(2,3)
arr[0]
## array([1, 2, 3])
arr[:2]
## array([[1, 2, 3],
##        [4, 5, 6]])
arr[1:]
## array([[4, 5, 6]])

1.3 Shape and size

# Python 

# .ndim gives the number of axes, or dimensions, of the array.
arr_rd.ndim
## 2
# Python 

# .size gives the total number of elements of the array. 
arr_rd.size
## 20
# Python 

# .shape display a tuple of integers with the number of elements stored along each dimension of the array
arr_rd.shape
## (4, 5)

2 Modifiying arrays

2.1 Add elements

# Python 
arr

# /!\ using append it convert the array to a 1d array
## array([[1, 2, 3],
##        [4, 5, 6]])
arr_1d = np.append(arr, [7, 8, 9])
arr_1d

# using insert you can add rows and columns 
## array([1, 2, 3, 4, 5, 6, 7, 8, 9])
np.insert(arr, len(arr), [7, 8, 9], axis = 0)
## array([[1, 2, 3],
##        [4, 5, 6],
##        [7, 8, 9]])
np.insert(arr, 2, [7, 8], axis = 1)
## array([[1, 2, 7, 3],
##        [4, 5, 8, 6]])
  • Combining/splitting arrays
# Python 
arr0 = np.zeros([2,3])
arr1 = np.ones([2,3])

arr01 = np.vstack([arr0,arr1])
arr01

#or 
## array([[0., 0., 0.],
##        [0., 0., 0.],
##        [1., 1., 1.],
##        [1., 1., 1.]])
np.concatenate([arr0,arr1], axis = 0)
## array([[0., 0., 0.],
##        [0., 0., 0.],
##        [1., 1., 1.],
##        [1., 1., 1.]])
np.hstack([arr0,arr1])

#or
## array([[0., 0., 0., 1., 1., 1.],
##        [0., 0., 0., 1., 1., 1.]])
np.concatenate([arr0,arr1], axis = 1)
## array([[0., 0., 0., 1., 1., 1.],
##        [0., 0., 0., 1., 1., 1.]])
np.hsplit(arr01,3)
## [array([[0.],
##        [0.],
##        [1.],
##        [1.]]), array([[0.],
##        [0.],
##        [1.],
##        [1.]]), array([[0.],
##        [0.],
##        [1.],
##        [1.]])]
np.vsplit(arr01,4)
## [array([[0., 0., 0.]]), array([[0., 0., 0.]]), array([[1., 1., 1.]]), array([[1., 1., 1.]])]

2.2 Delete elements

# Python 

np.delete(arr,1 , axis = 1)
## array([[1, 3],
##        [4, 6]])
np.delete(arr,1 , axis = 0)
## array([[1, 2, 3]])

2.3 Sorting

# Python 

arr = np.random.randn(10)

arr
## array([ 0.17616954, -0.02056164,  0.89068703, -0.66124081,  1.22587833,
##         0.28249346, -1.18120814,  0.74530569, -0.10474117, -0.64490294])
arr.sort()
# Python 

arr = np.random.randn(4,3)

arr
## array([[ 0.01665733,  0.4053966 , -0.96790957],
##        [ 0.0319518 ,  0.2796655 ,  0.84289947],
##        [ 0.55898062,  0.75405859, -0.39571195],
##        [ 0.53092467,  0.04941761,  0.4985309 ]])
arr.sort(1)

3 Conditional Logic on array

Imagine you have two square matrices representing interactions between entities. Each matrix represents a basic network, with a value of 1 indicating interaction and 0 indicating no interaction. You want to determine if two entities that are linked in one network are also linked in another network. Instead of checking each entity’s interaction sequentially in a loop, you can use matrix operations to perform element-wise calculations and get the same result in a more efficient manner.

# Python 

#generate a list
list_1 = [[0,1,0,1],[1,0,1,0],[0,1,0,1],[1,0,1,0]]
network_1 = np.array(list_1)

#generate a sequence
list_2 = [[0,1,0,0],[1,0,0,0],[0,0,0,1],[0,0,1,0]]
network_2 = np.array(list_2)

network_12 = network_1*network_2
network_12
## array([[0, 1, 0, 0],
##        [1, 0, 0, 0],
##        [0, 0, 0, 1],
##        [0, 0, 1, 0]])

3.1 Select specific elements

# Python 

network_12 > 0
## array([[False,  True, False, False],
##        [ True, False, False, False],
##        [False, False, False,  True],
##        [False, False,  True, False]])
network_12[network_12>0]
## array([1, 1, 1, 1])

One thing that is commonly needed is to modify specific values in an array that meet a certain condition. The first step is to identify which cells meet the condition. In Python, the function np.where is commonly used.

It is important to note that the execution time can change depending on the size of the vector.

# Python 

arr = np.arange(200000)


t = time.time()
results = [(i if i%2==0 else 0) for i in arr]
time.time() - t

## 0.0794527530670166
t = time.time()
results = np.where(arr%2!=0,0,arr)
time.time() - t
## 0.018316268920898438
# Python 

arr = np.arange(20000000)


t = time.time()
results = [(i if i%2==0 else 0) for i in arr]
time.time() - t

## 7.452648401260376
t = time.time()
results = np.where(arr%2!=0,0,arr)
time.time() - t
## 0.5659301280975342

Exercise 1: Create two 4x4 NumPy arrays with random integers between 1 and 10. Perform element-wise multiplication of these arrays. Then, replace all elements greater than 5 in the resulting array with 0.

Click here to see the solution
arr1 = np.random.randint(1, 11, (4, 4))
arr2 = np.random.randint(1, 11, (4, 4))
arr_12 = arr1 * arr2
print(np.where(arr_12 > 5, 0, arr_12))
## [[3 0 0 0]
##  [0 1 0 0]
##  [0 0 0 0]
##  [0 0 0 4]]

4 Algebra

# Python 
from numpy.linalg import inv, qr 
 
X = np.array([[0,1,5,1],[2,1,3,1],[2,1,9,6],[7,2,1,0],[8,3,5,5]]) 
Y = np.array([0,1,2,5,4])

X
## array([[0, 1, 5, 1],
##        [2, 1, 3, 1],
##        [2, 1, 9, 6],
##        [7, 2, 1, 0],
##        [8, 3, 5, 5]])
X.T.dot(X) 
# same as 
## array([[121,  42,  71,  54],
##        [ 42,  16,  34,  23],
##        [ 71,  34, 141,  87],
##        [ 54,  23,  87,  63]])
XtX = np.dot(X.T,X)
# same as 
X.T @ X

# inverse Matrix
## array([[121,  42,  71,  54],
##        [ 42,  16,  34,  23],
##        [ 71,  34, 141,  87],
##        [ 54,  23,  87,  63]])
inv(XtX)

## array([[ 0.23199214, -0.71424803,  0.11636773, -0.09879148],
##        [-0.71424803,  2.33481201, -0.37284193,  0.27469787],
##        [ 0.11636773, -0.37284193,  0.10787934, -0.11260311],
##        [-0.09879148,  0.27469787, -0.11260311,  0.15576444]])
# Python 

XtY = np.dot(X.T,Y)

# OLS 
Beta = inv(XtX).dot(XtY)
Beta
## array([ 1.17202187, -1.85550547,  0.42034337, -0.38384807])

5 Operation on array

Operator Equivalent Description
+ np.add Addition (e.g., 1 + 1 = 2)
- np.subtract Subtraction (e.g., 3 - 2 = 1)
- np.negative Unary negation (e.g., -2)
* np.multiply Multiplication (e.g., 2 * 3 = 6)
/ np.divide Division (e.g., 3 / 2 = 1.5)
// np.floor_divide Floor division (e.g., 3 // 2 = 1)
** np.power Exponentiation (e.g., 2 ** 3 = 8)
% np.mod Modulus/remainder (e.g., 9 % 4 = 1)
# Python 

network_12 = np.triu(network_12,1)

# nb of link
network_12.sum()
# same since there is only 0 and 1 
## 2
(network_12>0).sum()

# share of link 
## 2
network_12.mean()


# nb link by entities
## 0.125
network_12.sum(axis=1)

# share of link across entities (by columns)
## array([1, 0, 1, 0])
network_12.mean(1)

## array([0.25, 0.  , 0.25, 0.  ])

5.1 Map

In python we can use map to do the same thing than apply familly in R with a bit of manipulation.

map takes different types of input and transforms it by row if the object is an array, returning an iterator. You can then transform this iterator into a list or an array.

# Python

arr_1 = np.arange(1,17).reshape([4,4], order = 'F')
arr_1


# by row, by columns you need to tranpose the array
## array([[ 1,  5,  9, 13],
##        [ 2,  6, 10, 14],
##        [ 3,  7, 11, 15],
##        [ 4,  8, 12, 16]])
map(np.sum,arr_1)
## <map object at 0x7f15f99e0470>
list(map(np.sum,arr_1))


# with a list
## [28, 32, 36, 40]
lst_1 = [arr_1,np.linspace(1,8,num=arr_1.size)]

list(map(lambda x: x**2,lst_1))
## [array([[  1,  25,  81, 169],
##        [  4,  36, 100, 196],
##        [  9,  49, 121, 225],
##        [ 16,  64, 144, 256]]), array([ 1.        ,  2.15111111,  3.73777778,  5.76      ,  8.21777778,
##        11.11111111, 14.44      , 18.20444444, 22.40444444, 27.04      ,
##        32.11111111, 37.61777778, 43.56      , 49.93777778, 56.75111111,
##        64.        ])]
  • Pandas’ apply function:

When using pandas, you can also use apply functions. The difference between “apply” and “map” is that, by default, “apply” works on columns, because variables in a dataframe are stored by columns.

# Python

import pandas as pd

arr_1 = np.arange(1,17).reshape([4,4], order = 'F')

#transform the array into a DataFrame
df_1 = pd.DataFrame(arr_1)
df_1


# Apply the function by column
##    0  1   2   3
## 0  1  5   9  13
## 1  2  6  10  14
## 2  3  7  11  15
## 3  4  8  12  16
df_1.apply(sum)

## 0    10
## 1    26
## 2    42
## 3    58
## dtype: int64

Exercise 1: Given a NumPy array of integers, create a new array where each element is the square of the original element if it’s even, and the original element if it’s odd. Solve this using a loop, NumPy’s vectorized operations, and map.

Click here to see the solution
arr = np.array([1, 2, 3, 4, 5, 6])

result_loop = []
for num in arr:
    if num % 2 == 0:
        result_loop.append(num ** 2)
    else:
        result_loop.append(num)
print(result_loop)
## [1, 4, 3, 16, 5, 36]
result_numpy = np.where(arr % 2 == 0, arr ** 2, arr)
print(result_numpy)
## [ 1  4  3 16  5 36]
result_map = list(map(lambda x: x ** 2 if x % 2 == 0 else x, arr))
print(result_map)
## [1, 4, 3, 16, 5, 36]