This chapter delves deeper into the concepts of arrays and vectors that were introduced in the ‘Basics’ chapter. In Python, the primary package for working with arrays is numpy. The main advantage of using vectors is that it allows for vectorized operations, rather than using loops to perform operations on each element of an object. Vectorized operations allow for more efficient processing of blocks of data.
Let’s first take a look at an example to understand the benefits of using vectorized operations:
# Python
import numpy as np
import time
arr = np.arange(200000)
lst = list(range(200000))
t = time.time()
arr_2 = arr*2
time.time() - t
## 0.006211757659912109
t = time.time()
lst_2 = [i*2 for i in lst]
time.time() - t
## 0.013716936111450195
The NumPy library contains multidimensional array and matrix data structures and provides methods to efficiently operate on them. If you do not have numpy installed on your computer, you can install it by running pip install numpy
in a terminal.
To use numpy, you need to load the module by running import numpy as np
. This allows you to refer to numpy as “np” throughout your code, which keeps it standardized.
Numpy allows for fast and efficient calculations. The main difference between Numpy and python lists is that all elements in a numpy array must be homogeneous and numpy uses less memory to store data.
Functions | Tasks |
---|---|
array | Create numpy array |
ndim | Dimension of the array |
shape | Size of the array (Number of rows and Columns) |
size | Total number of elements in the array |
dtype | Type of elements in the array, i.e., int64, character |
reshape | Reshapes the array without changing the original shape |
resize | Reshapes the array. Also change the original shape |
arange | Create sequence of numbers in array |
Itemsize | Size in bytes of each item |
diag | Create a diagonal matrix |
vstack | Stacking vertically |
hstack | Stacking horizontally |
# Python
import numpy as np
arr = np.array([1, 2, 3, 4])
arr
## array([1, 2, 3, 4])
# Python
np.zeros([2, 3])
## array([[0., 0., 0.],
## [0., 0., 0.]])
np.ones((2, 3))
## array([[1., 1., 1.],
## [1., 1., 1.]])
np.arange(1, 7)
## array([1, 2, 3, 4, 5, 6])
np.arange(1, 7).reshape(2, 3)
## remember from first chapter that array in R and Python aren't store the same way, set order = 'F' to get the same results
## array([[1, 2, 3],
## [4, 5, 6]])
np.arange(1, 7).reshape([2, 3], order = 'F')
## array([[1, 3, 5],
## [2, 4, 6]])
np.linspace(1, 4,num = 10)
## array([1. , 1.33333333, 1.66666667, 2. , 2.33333333,
## 2.66666667, 3. , 3.33333333, 3.66666667, 4. ])
# Python
#generate an array
arr_rd = np.random.randn(4,5)
arr_rd
## array([[ 0.70182419, 0.8359941 , 0.99714696, -0.50124705, -1.68598557],
## [-1.52689203, -0.51066088, -0.64745007, -0.32815722, -0.08960339],
## [-0.42389929, 1.29003683, 0.85168218, -0.52424361, -2.57908576],
## [-0.96121576, 0.44830129, -1.04795886, 0.23979259, -0.65850916]])
# Python
arr = np.arange(1,7).reshape(2,3)
arr[0]
## array([1, 2, 3])
arr[:2]
## array([[1, 2, 3],
## [4, 5, 6]])
arr[1:]
## array([[4, 5, 6]])
# Python
# .ndim gives the number of axes, or dimensions, of the array.
arr_rd.ndim
## 2
# Python
# .size gives the total number of elements of the array.
arr_rd.size
## 20
# Python
# .shape display a tuple of integers with the number of elements stored along each dimension of the array
arr_rd.shape
## (4, 5)
# Python
arr
# /!\ using append it convert the array to a 1d array
## array([[1, 2, 3],
## [4, 5, 6]])
arr_1d = np.append(arr, [7, 8, 9])
arr_1d
# using insert you can add rows and columns
## array([1, 2, 3, 4, 5, 6, 7, 8, 9])
np.insert(arr, len(arr), [7, 8, 9], axis = 0)
## array([[1, 2, 3],
## [4, 5, 6],
## [7, 8, 9]])
np.insert(arr, 2, [7, 8], axis = 1)
## array([[1, 2, 7, 3],
## [4, 5, 8, 6]])
# Python
arr0 = np.zeros([2,3])
arr1 = np.ones([2,3])
arr01 = np.vstack([arr0,arr1])
arr01
#or
## array([[0., 0., 0.],
## [0., 0., 0.],
## [1., 1., 1.],
## [1., 1., 1.]])
np.concatenate([arr0,arr1], axis = 0)
## array([[0., 0., 0.],
## [0., 0., 0.],
## [1., 1., 1.],
## [1., 1., 1.]])
np.hstack([arr0,arr1])
#or
## array([[0., 0., 0., 1., 1., 1.],
## [0., 0., 0., 1., 1., 1.]])
np.concatenate([arr0,arr1], axis = 1)
## array([[0., 0., 0., 1., 1., 1.],
## [0., 0., 0., 1., 1., 1.]])
np.hsplit(arr01,3)
## [array([[0.],
## [0.],
## [1.],
## [1.]]), array([[0.],
## [0.],
## [1.],
## [1.]]), array([[0.],
## [0.],
## [1.],
## [1.]])]
np.vsplit(arr01,4)
## [array([[0., 0., 0.]]), array([[0., 0., 0.]]), array([[1., 1., 1.]]), array([[1., 1., 1.]])]
# Python
np.delete(arr,1 , axis = 1)
## array([[1, 3],
## [4, 6]])
np.delete(arr,1 , axis = 0)
## array([[1, 2, 3]])
# Python
arr = np.random.randn(10)
arr
## array([ 0.17616954, -0.02056164, 0.89068703, -0.66124081, 1.22587833,
## 0.28249346, -1.18120814, 0.74530569, -0.10474117, -0.64490294])
arr.sort()
# Python
arr = np.random.randn(4,3)
arr
## array([[ 0.01665733, 0.4053966 , -0.96790957],
## [ 0.0319518 , 0.2796655 , 0.84289947],
## [ 0.55898062, 0.75405859, -0.39571195],
## [ 0.53092467, 0.04941761, 0.4985309 ]])
arr.sort(1)
Imagine you have two square matrices representing interactions between entities. Each matrix represents a basic network, with a value of 1 indicating interaction and 0 indicating no interaction. You want to determine if two entities that are linked in one network are also linked in another network. Instead of checking each entity’s interaction sequentially in a loop, you can use matrix operations to perform element-wise calculations and get the same result in a more efficient manner.
# Python
#generate a list
list_1 = [[0,1,0,1],[1,0,1,0],[0,1,0,1],[1,0,1,0]]
network_1 = np.array(list_1)
#generate a sequence
list_2 = [[0,1,0,0],[1,0,0,0],[0,0,0,1],[0,0,1,0]]
network_2 = np.array(list_2)
network_12 = network_1*network_2
network_12
## array([[0, 1, 0, 0],
## [1, 0, 0, 0],
## [0, 0, 0, 1],
## [0, 0, 1, 0]])
# Python
network_12 > 0
## array([[False, True, False, False],
## [ True, False, False, False],
## [False, False, False, True],
## [False, False, True, False]])
network_12[network_12>0]
## array([1, 1, 1, 1])
One thing that is commonly needed is to modify specific values in an array that meet a certain condition. The first step is to identify which cells meet the condition. In Python, the function np.where
is commonly used.
It is important to note that the execution time can change depending on the size of the vector.
# Python
arr = np.arange(200000)
t = time.time()
results = [(i if i%2==0 else 0) for i in arr]
time.time() - t
## 0.0794527530670166
t = time.time()
results = np.where(arr%2!=0,0,arr)
time.time() - t
## 0.018316268920898438
# Python
arr = np.arange(20000000)
t = time.time()
results = [(i if i%2==0 else 0) for i in arr]
time.time() - t
## 7.452648401260376
t = time.time()
results = np.where(arr%2!=0,0,arr)
time.time() - t
## 0.5659301280975342
Exercise 1: Create two 4x4 NumPy arrays with random integers between 1 and 10. Perform element-wise multiplication of these arrays. Then, replace all elements greater than 5 in the resulting array with 0.
arr1 = np.random.randint(1, 11, (4, 4))
arr2 = np.random.randint(1, 11, (4, 4))
arr_12 = arr1 * arr2
print(np.where(arr_12 > 5, 0, arr_12))
## [[3 0 0 0]
## [0 1 0 0]
## [0 0 0 0]
## [0 0 0 4]]
# Python
from numpy.linalg import inv, qr
X = np.array([[0,1,5,1],[2,1,3,1],[2,1,9,6],[7,2,1,0],[8,3,5,5]])
Y = np.array([0,1,2,5,4])
X
## array([[0, 1, 5, 1],
## [2, 1, 3, 1],
## [2, 1, 9, 6],
## [7, 2, 1, 0],
## [8, 3, 5, 5]])
X.T.dot(X)
# same as
## array([[121, 42, 71, 54],
## [ 42, 16, 34, 23],
## [ 71, 34, 141, 87],
## [ 54, 23, 87, 63]])
XtX = np.dot(X.T,X)
# same as
X.T @ X
# inverse Matrix
## array([[121, 42, 71, 54],
## [ 42, 16, 34, 23],
## [ 71, 34, 141, 87],
## [ 54, 23, 87, 63]])
inv(XtX)
## array([[ 0.23199214, -0.71424803, 0.11636773, -0.09879148],
## [-0.71424803, 2.33481201, -0.37284193, 0.27469787],
## [ 0.11636773, -0.37284193, 0.10787934, -0.11260311],
## [-0.09879148, 0.27469787, -0.11260311, 0.15576444]])
# Python
XtY = np.dot(X.T,Y)
# OLS
Beta = inv(XtX).dot(XtY)
Beta
## array([ 1.17202187, -1.85550547, 0.42034337, -0.38384807])
Operator | Equivalent | Description |
---|---|---|
+ | np.add | Addition (e.g., 1 + 1 = 2) |
- | np.subtract | Subtraction (e.g., 3 - 2 = 1) |
- | np.negative | Unary negation (e.g., -2) |
* | np.multiply | Multiplication (e.g., 2 * 3 = 6) |
/ | np.divide | Division (e.g., 3 / 2 = 1.5) |
// | np.floor_divide | Floor division (e.g., 3 // 2 = 1) |
** | np.power | Exponentiation (e.g., 2 ** 3 = 8) |
% | np.mod | Modulus/remainder (e.g., 9 % 4 = 1) |
# Python
network_12 = np.triu(network_12,1)
# nb of link
network_12.sum()
# same since there is only 0 and 1
## 2
(network_12>0).sum()
# share of link
## 2
network_12.mean()
# nb link by entities
## 0.125
network_12.sum(axis=1)
# share of link across entities (by columns)
## array([1, 0, 1, 0])
network_12.mean(1)
## array([0.25, 0. , 0.25, 0. ])
In python we can use map
to do the same thing than apply familly in R with a bit of manipulation.
map
takes different types of input and transforms it by row if the object is an array, returning an iterator. You can then transform this iterator into a list or an array.
# Python
arr_1 = np.arange(1,17).reshape([4,4], order = 'F')
arr_1
# by row, by columns you need to tranpose the array
## array([[ 1, 5, 9, 13],
## [ 2, 6, 10, 14],
## [ 3, 7, 11, 15],
## [ 4, 8, 12, 16]])
map(np.sum,arr_1)
## <map object at 0x7f15f99e0470>
list(map(np.sum,arr_1))
# with a list
## [28, 32, 36, 40]
lst_1 = [arr_1,np.linspace(1,8,num=arr_1.size)]
list(map(lambda x: x**2,lst_1))
## [array([[ 1, 25, 81, 169],
## [ 4, 36, 100, 196],
## [ 9, 49, 121, 225],
## [ 16, 64, 144, 256]]), array([ 1. , 2.15111111, 3.73777778, 5.76 , 8.21777778,
## 11.11111111, 14.44 , 18.20444444, 22.40444444, 27.04 ,
## 32.11111111, 37.61777778, 43.56 , 49.93777778, 56.75111111,
## 64. ])]
When using pandas, you can also use apply functions. The difference between “apply” and “map” is that, by default, “apply” works on columns, because variables in a dataframe are stored by columns.
# Python
import pandas as pd
arr_1 = np.arange(1,17).reshape([4,4], order = 'F')
#transform the array into a DataFrame
df_1 = pd.DataFrame(arr_1)
df_1
# Apply the function by column
## 0 1 2 3
## 0 1 5 9 13
## 1 2 6 10 14
## 2 3 7 11 15
## 3 4 8 12 16
df_1.apply(sum)
## 0 10
## 1 26
## 2 42
## 3 58
## dtype: int64
Exercise 1: Given a NumPy array of integers, create a new array where each element is the square of the original element if it’s even, and the original element if it’s odd. Solve this using a loop, NumPy’s vectorized operations, and map
.
arr = np.array([1, 2, 3, 4, 5, 6])
result_loop = []
for num in arr:
if num % 2 == 0:
result_loop.append(num ** 2)
else:
result_loop.append(num)
print(result_loop)
## [1, 4, 3, 16, 5, 36]
result_numpy = np.where(arr % 2 == 0, arr ** 2, arr)
print(result_numpy)
## [ 1 4 3 16 5 36]
result_map = list(map(lambda x: x ** 2 if x % 2 == 0 else x, arr))
print(result_map)
## [1, 4, 3, 16, 5, 36]