This capter introduces the basics of R and Python. The goal of this chapter is to discover the different types of object that we can deal with for each language, as well as understanding the logic of control flows and function writing. The chapter ends with 3 Exercices from Euler Project. By forming a group of 2, solve at least one exercice in the language of your choice and try to compare different approaches.

 

1 R/Python

Both data science languages are great for business analysis. They can both be used for similar purpose when viewed from a purely machine learning perspective. Both have packages or libraries dedicated to wrangling, preprocessing, and applying machine learning to data. Both are great choices for reproducibile research. Where things become really interesting is their differences, as well as combining languages.

1.1 R Strengths

The S language was a precursor to R developed by John Chambers (statistician) at Bell Labs in 1976 as a programming language designed to implement statistics. The R statistical programming language was developed at the University of Auckland, New Zealand, to increase S capacities. S and R developers were not software engineers or computer scientists, they were researchers that developed tools in order carry out research and communicate results more efficiently.

R is a language with roots in statistics, data analysis, data exploration, and data visualization. It has utilities for reporting and communicate including RMarkdown and Shiny.

R is expanding rapidly with the emergence of the tidyverse (tidyverse.org), a set of tools with a common programming-interface that use functional verbs (functions like mutate() and summarize()) to perform intuitive operations connected by the pipe (%>%). The tidyverse is very advantageous because it makes data mining very efficient. Iterating through your exploratory analysis is as simple as writing a paragraph describing what you intend to do to the data.

1.2 Python Strengths

The Python language is a general-purpose programming language created by Guido van Rossum (Computer Scientist) in 1991. The language was developed to be readable and to cover many programming paradigms. One of Python’s greatest strengths is its flexibility which allows to handle web frameworks, data base connectivity, networking, web scraping, text and image processing, many of those features are very useful to various tasks in machine learning.

In essence, the foundations of Python are in computer science and mathematics. With over 100,000 open source libraries, Python has the largest ecosystem of any programming language, making it a unique choice for those who want high flexibility.

Python has great data science libraries including Scikit Learn, the most popular machine learning library which is easy to pick up, includes support for pipelines to simplify the machine learning workflow, and has almost all of the algorithms one needs in one place. Another very common library is TensorFlow, developed by software engineers at Google to perform deep learning, commonly used for image recognition and natural language processing tasks (NLTK is also very useful for NLP). Facebook also have his own deep learning framework named PyTorch which is a concurrent of Tenserflow and Keras (which is designed for efficiently building neural networks).

  • R is selected for exploration due to the tidyverse and datatable readability and efficiency.

  • Python is preferred for machine learning because of Scikit Learn, TenserFlow and Keras, PyTorch machine learning pipeline capability.

  • R is favored for communication because of the advanced reporting utilities including RMarkdown and Shiny (interactive web apps) and ggplot2 for visualization.

2 Operations

2.1 Arithmetic Operations

Python and R work as classical calculator, using “+”, “-”, “*” and “/” we can do arithmetic operations in both languages.

# Python 

1+2
## 3
1-2
## -1
1/2
## 0.5
1*2
## 2
# R

1+2
## [1] 3
1-2
## [1] -1
1/2
## [1] 0.5
1*2
## [1] 2

 

We can also apply exponentiation, Modulo and floor division easily in both language. Note that in R we can write the exponentiation “^”. It is used for exclusive-or (XOR) in Python.

# Python 

2**8 # exponentiation
## 256
2^8 == 2**8 # False
## False
8%3 # modulo
## 2
8//3 # floor division
## 2
# R

2**8 # exponentiation
## [1] 256
2^8 == 2**8 # TRUE
## [1] TRUE
8%%3 # modulo
## [1] 2
8%/%3 # floor division
## [1] 2
Operator (R) Operator (Python) Description
+ + Addition
Subtraction
* * Multiplication
/ / Division
^ / ** ** Exponent
%% % Modulo
%/% // floor Division

 

2.2 Comparison Operators

In order to compare some values, we can use comparison operators to investigate if a given value is equal to, not equal to, greater than, etc.

Theses operators are the same in both language.

# Python 

2==8
## False
2!=8
## True
2<8
## True
2>8
## False
2<=8
## True
# R

2==8
## [1] FALSE
2!=8
## [1] TRUE
2<8
## [1] TRUE
2>8
## [1] FALSE
2<=8
## [1] TRUE
Operator (R/Python) Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

 

2.3 Logical operators

# Python 
x = [True,True]
y = [True,False]

not x[0]
## False
x and y
## [True, False]
x or y
## [True, True]
# R

x = c(TRUE,TRUE)
y = c(TRUE,FALSE)

!x[1]
## [1] FALSE
x & y
## [1]  TRUE FALSE
x | y
## [1] TRUE TRUE
Operator (R) Operator (Python) Description
! not not
& and and
| or or

 

Exercise 1: Write a Python expression that checks if a number is both greater than 10 and less than 20.

Click here to see the solution
number = 15
result = 10 < number < 20
print(result)
## True

 

Exercise 2: Write a Python expression that checks if a string is either “yes” or “no”.

Click here to see the solution
string = "yes"
result = string == "yes" or string == "no"
print(result)
## True

 

2.4 Membership operators

‘In’ and ‘not in’ are membership operators in Python. They are used to test whether a value or variable is found in a sequence. In R ‘%in%’ allows you to use the python in, but you have to be aware that they may not react the same way, the reason is that list of character can be directly decomposed in a character list in python, which is not the case in R.
See how we can check that ‘Hello’ is in ‘Hello world’ using just in with python. In R we look if the string exactly match one element.

# Python 
x = 'Hello world'
y = {1:'a',2:'b'}

print('world' in x)
## True

print(1 in y)
## True

print('a' in y)
## False
# R

x = 'Hello World'

print('Hello' %in% x)
## [1] FALSE
stringr::str_detect(x,'Hello')
## [1] TRUE

 

# Python 
x = ['Hello','World']

print('Hello' in x)
## True
# R

x = c('Hello','World')

print('Hello' %in% x)
## [1] TRUE
stringr::str_detect('Hello',x)
## [1]  TRUE FALSE

 

Exercise 1: Write a Python expression that checks if the letter “a” is present in the string “banana”.

Click here to see the solution
string = "banana"
result = "a" in string
print(result)
## True

 

Exercise 2: Write a Python expression that checks if the number 5 is not present in the list [1, 2, 3, 4].

Click here to see the solution
list_ = [1, 2, 3, 4]
result = 5 not in list_
print(result)
## True

3 Objects & Variables

In Python and R, we do not need to declare a variable before assigning a value to this variable. we can think of a variable as a name to refers to an object. There is a difference between Python and R on the way the computer is storing objects and variables.

 

3.1 Some differences between R and Python

While Python is much appreciated for being a versatile language with an easy-to-understand syntax, R’s functionalities are developed by statisticians, thus giving it field-specific advantages.

In R, a variable and an object are the same things, they refer to the same entities. If we assign to a new variable a variable that already exist, it will refers to two different objects.

Things in Python are more conventional, but since most of you learned R before Python it is important to notice that variables refers to an object but they are still two entities. It means that two variables can refer to a single object. If two variables are pointing the same object, the modification made with one variable will be also available when using the other variable.

a = 1
b = a
a += 1
print(a)
## 2
print(b)
## 1

a = [1,2]
b = a
a.append(3)
print(a)
## [1, 2, 3]
print(b)
## [1, 2, 3]

a = (1,2)
b = a
a += (3,)
print(a)
## (1, 2, 3)
print(b)
## (1, 2)

We mostly should never use for loops in R - they’re very slow because they execute a function call with every iteration. One should vectorize and use the apply family of functions instead. Vectorization is king in R if we want fast code. Assuming we vectorize both our R and Python codes (and other factors), we should probably get the same order of magnitude in speed. For data larger than memory (we can specify the limit), R starts to become a bad choice.

# Python 
import numpy as np
only_nb = [1,2,3,4]
np.std(only_nb)
## np.float64(1.118033988749895)
np.std(only_nb,ddof=1)
## np.float64(1.2909944487358056)
with_none = [1,2,3,None]
np.std(with_none)
## TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
with_char = [1,2,'3',4]
np.std(with_char)
## TypeError: the resolved dtypes are not compatible with add.reduce. Resolved (dtype('<U21'), dtype('<U21'), dtype('<U42'))
# R
only_nb <- c(1,2,3,4)
sd(only_nb)
## [1] 1.290994
with_na <- c(1,2,3,NA)
sd(with_na)
## [1] NA
with_char <- c(1,2,'3',4)
sd(with_char)
## [1] 1.290994

with_char <- c(1,2,'a',4)
sd(with_char)
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
## [1] NA

 

  • When computing a standard deviation R does not show any error using NA and returns NA, see how it also converts string to numeric if possible. If converting the string to numeric is not feasible, R finally return a NA. This is not an error message but a warning message.
    We need to write our code carefully since R can do calculation for a while and just return NA without stopping the process which is not the case in Python.

  • R calculates the standard deviation with N - 1 as the denominator, and numpy with N. To get the same result we need to use ‘ddof=1’ as argument in std.

 

3.2 Textual and numerical variables

In R textual data are called ‘character’, in Python it is a string, abbreviate as ‘str’. ” or ’ are used when we set a textual variable in both language, but we can also set a variable to textual variable if needed.

Numerical variables a decomposed in three types. In Python we have integer, float and complex. In R float is called ‘double’, referring to double precision floating point. Without entering into details, it means that the number have a precision up to 15 decimals. Float can refers to a precision of 7 decimals, however in Python the data-type ‘float’ has a precision of 15 numbers. They are ways to get more precision using Numpy for example.

Finally we can check easily the data-type of the variable that we are dealing by asking the language with the command ‘type’ in Python, or ‘typeof’ in R.

# Python 

import sys 
sys.float_info.dig # number of decimals
## 15

a = 1
type(a)
## <class 'int'>

b = 1.1
type(b)
## <class 'float'>

c = 1.1+2j
type(c)
## <class 'complex'>

d = 'd'
type(d)
## <class 'str'>

# change the type
e = 2
f = str(e)
f
## '2'
float(f)
## 2.0
# R

a = 1
typeof(a)
## [1] "double"
a_int = 1L
typeof(a_int)
## [1] "integer"

b = 1.1
typeof(b)
## [1] "double"

c = 1.1+2i
typeof(c)
## [1] "complex"

d = 'd'
typeof(d)
## [1] "character"

# change the type
e = 2
f = as.character(e)
f
## [1] "2"
as.numeric(f)
## [1] 2

 

Exercise 1: Create a variable containing your full name, then extract and print only your first name using string slicing.

Click here to see the solution
full_name = "John Smith"
first_name = full_name.split()[0] 
print(first_name)
## John

 

Exercise 2: Write a Python expression that checks if ‘ab’ is present in the list [‘aa’,‘bb’,‘ab’,‘ba’].

Click here to see the solution
list_ = ['aa','bb','ab','ba']
result = 'ab' in list_
print(result)
## True

 

3.3 Other Data-Type (R)

There is a lot of datatypes in both language that we will discover along this course, for now, we will focus only on basic data-type that are directly available without using any other package or library,in R common data-type are Vector, Matrix, List, Dataframe.

 

3.3.1 Vectors and Matrices

# R

vec = c(1, 2, 3)
length(which(vec==2)) # count elements of the list which are exactly equal to 2
## [1] 1

vec = sort(vec,
           decreasing = TRUE)
vec
## [1] 3 2 1
# access the element of a list
vec[1]
## [1] 3
vec[which(vec==3)]
## [1] 3
vec[2:length(vec)]
## [1] 2 1
vec[length(vec):2]
## [1] 1 2
vec[-1] # throwaway value
## [1] 2 1
# R

mat = matrix(c(1, 2, 3, 4, 5, 6),3,2)
dim(mat)
## [1] 3 2
mat
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

length(which(mat==2)) # count elements of the list which are exactly equal to 2
## [1] 1

vec = sort(mat,
           decreasing = TRUE)
vec
## [1] 6 5 4 3 2 1
# access the element of a list
mat[1] # first element
## [1] 1
which(mat==3,
      arr.ind = TRUE) # get row and columns for each value that match
##      row col
## [1,]   3   1

mat[2:length(mat)] # gives us a vector
## [1] 2 3 4 5 6
mat[2:dim(mat)[1],]
##      [,1] [,2]
## [1,]    2    5
## [2,]    3    6
mat[length(mat):1]
## [1] 6 5 4 3 2 1

mat[1,-1] # throwaway values
## [1] 4
# modify

b = rep(0,3)
vec = c(vec,b)
vec
## [1] 6 5 4 3 2 1 0 0 0
mat_2 = rbind(vec,vec)
mat_2 
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## vec    6    5    4    3    2    1    0    0    0
## vec    6    5    4    3    2    1    0    0    0
mat_2 = cbind(mat_2,mat_2)
mat_2
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## vec    6    5    4    3    2    1    0    0    0     6     5     4     3     2
## vec    6    5    4    3    2    1    0    0    0     6     5     4     3     2
##     [,15] [,16] [,17] [,18]
## vec     1     0     0     0
## vec     1     0     0     0

t_mat = t(mat) # transposition
t_mat[2,3] = as.character(t_mat[2,3]) # all values turn to character


i_mat = diag(1, 3)
i_mat
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

 

3.3.2 Data frame

A Data frame in R is a matrix which can store different types of data, characters, numeric or factors. Contrary to the matrix, it does not have a fixed dimension which means that we can create a column very easily using the operator ‘$’.

# R

df = data.frame(chr = letters[1:5], # letters is a vector of letter provided directly in R
                num = seq(1,5,1),
                fac = factor(c(rep('a',3),rep('b',2)))) 
class(df$chr) # automatically converted to factor, which can be very inefficient ! UP: Not any more with R >= 4.0 !
## [1] "character"
df$chr = as.character(df$chr)
class(df$chr)
## [1] "character"
colnames(df)
## [1] "chr" "num" "fac"
df
##   chr num fac
## 1   a   1   a
## 2   b   2   a
## 3   c   3   a
## 4   d   4   b
## 5   e   5   b
# access the element of a variable in a df
df$chr[3]
## [1] "c"
which(df==3,
      arr.ind = TRUE) 
##      row col
## [1,]   3   2

df[1:2] # gives us a df with the first two columns !
##   chr num
## 1   a   1
## 2   b   2
## 3   c   3
## 4   d   4
## 5   e   5
df[3:dim(df)[1],]
##   chr num fac
## 3   c   3   a
## 4   d   4   b
## 5   e   5   b

df[1,-1] # throwaway values
##   num fac
## 1   1   a
# modify

b = rep(0,3)
df = rbind(df,b) # introduce a new row in the data frame, note that we get a NA in the factor column because '0' is not a level
## Warning in `[<-.factor`(`*tmp*`, ri, value = 0): invalid factor level, NA
## generated
df$new_col = rep(2,6) # we can introduce new columns by specifying the name and set a value


df
##   chr num  fac new_col
## 1   a   1    a       2
## 2   b   2    a       2
## 3   c   3    a       2
## 4   d   4    b       2
## 5   e   5    b       2
## 6   0   0 <NA>       2
t_df = t(df) # transposition convert the df into a matrix
class(t_df)
## [1] "matrix" "array"

 

3.3.3 Arrays

Arrays is a generalization of the matrix, meaning that we can have a matrix with more than two dimensions. We can modify them as we modify a matrix.

# R

arr = array(seq(1,12,1), dim = c(2,2,3)) # create a matrix with 3 dimension
arr
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    9   11
## [2,]   10   12
# access the element of a dictionary
arr[1,,]
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    3    7   11
arr[,1,]
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
arr[,,1]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

 

3.3.4 Lists

A list permits us to assemble an assortment of disconnected objects behind one object, it is also possible to store objects which can be of different types.

# R

lst = list(df = df,
           mat = mat,
           arr = arr) # create a matrix with 3 dimension

lst
## $df
##   chr num  fac new_col
## 1   a   1    a       2
## 2   b   2    a       2
## 3   c   3    a       2
## 4   d   4    b       2
## 5   e   5    b       2
## 6   0   0 <NA>       2
## 
## $mat
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## 
## $arr
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    9   11
## [2,]   10   12
lst[[1]] # first element of the list, here it is the df
##   chr num  fac new_col
## 1   a   1    a       2
## 2   b   2    a       2
## 3   c   3    a       2
## 4   d   4    b       2
## 5   e   5    b       2
## 6   0   0 <NA>       2
lst[[1]][1,] # first row of first element of the list
##   chr num fac new_col
## 1   a   1   a       2

 

3.4 Other Data-Type (Python)

In Python the most common are List, tuple, sets, Dictionary. Note that for the moment, for modifying a variable we always use ‘=’. Keep in mind that using the ‘.’ can modify the object behind the variable in Python.

 

3.4.1 Lists

Keep in mind that Lists are mutable, as discussed in the Section “Difference between R and Python”

# Python

a = [1, 2, 3]
a.count(2) # count elements of the list which are exactly equal to 2
## 1

a.sort(reverse = True)
a
## [3, 2, 1]
# access the element of a list
a[0]
## 3
a.index(3)
## 0
a[1:]
## [2, 1]
a[:1]
## [3]
a[0:-1]
## [3, 2]
a[:]
## [3, 2, 1]
# modify

b = [0, 0, 0]
list(zip(a,b)) # zip will pairs the ellements, it works also with more than 2 element ex: zip(a,b,c)
## [(3, 0), (2, 0), (1, 0)]
a.append(b)
a
## [3, 2, 1, [0, 0, 0]]
a[4:5] = ['a','b']
a
## [3, 2, 1, [0, 0, 0], 'a', 'b']
a.extend([4, 5, 6])
a
## [3, 2, 1, [0, 0, 0], 'a', 'b', 4, 5, 6]
a += [7,8] # works as extend
a
## [3, 2, 1, [0, 0, 0], 'a', 'b', 4, 5, 6, 7, 8]
a.insert(2,[1,2])
a
## [3, 2, [1, 2], 1, [0, 0, 0], 'a', 'b', 4, 5, 6, 7, 8]
a.remove(b)
a
## [3, 2, [1, 2], 1, 'a', 'b', 4, 5, 6, 7, 8]

a = a*2 # replicate the list n times
len(a) # number of elements in the list
## 22
# mutable

a = [1,2,3]
b = a
b[0] = 12
a
## [12, 2, 3]

Exercise 1: Given a list [5, 2, 8, 1, 9], sort the list in ascending order.

Click here to see the solution
my_list = [5, 2, 8, 1, 9]
my_list.sort()
print(my_list)
## [1, 2, 5, 8, 9]

 

Exercise 2: Given a list [‘apple’, ‘banana’, ‘apple’, ‘orange’], write code to count the number of times “apple” appears.

Click here to see the solution
my_list = ['apple', 'banana', 'apple', 'orange']
count = my_list.count('apple')
print(count)
## 2

 

3.4.2 tuples

The main difference between Lists and tuples is the fact that tuples is an immutable type of data, making it faster to use.

# Python

a = (1, 2, 3)
a.count(2) # count elements of the tuple which are exactly equal to 2
## 1

a
## (1, 2, 3)
# access the element of a tuple
a[0]
## 1
a.index(3)
## 2
a[1:]
## (2, 3)
a[:1]
## (1,)
a[0:-1]
## (1, 2)
a[:]
## (1, 2, 3)

# modify
a += (4,5) 
a
## (1, 2, 3, 4, 5)
a = a*2 # replicate the tuple n times
len(a) # number of elements in the tuple
## 10
# immutable

a = (1,2,3)
b = a
b += (4,5)
a
## (1, 2, 3)

 

3.4.3 Dictionaries

Dictionary refers to a way of storing data that is not sorted. It works with key and value associate with this key.

# Python

a = {'a':1, 'b':2, 'c':3}
# access the element of a dictionary
a.keys()
## dict_keys(['a', 'b', 'c'])
a['a']
## 1
a.values()
## dict_values([1, 2, 3])
a.items()
## dict_items([('a', 1), ('b', 2), ('c', 3)])
a.get('a')
## 1
a.get('d',4) # set to 4 if the key 'd' is not detected
## 4

a.pop('a') # pop will use the corresponding value to the key a and remove the pair (key, value).
## 1
a
## {'b': 2, 'c': 3}
# modify
a['a'] =1 
a.setdefault('d',0) # create new item with a default value
## 0

b = {'d':4,'e':5}
a.update(b) # update values from other dict
a
## {'b': 2, 'c': 3, 'a': 1, 'd': 4, 'e': 5}
a.clear() # remove all items
# mutable

a = {'a':1, 'b':2, 'c':3}
b = a
b['b'] = [12,14]
a
## {'a': 1, 'b': [12, 14], 'c': 3}

Exercise 1: Create a dictionary that assign the keys “name”, “age”, and “city” to some values. Change then the value of ‘city’ and assign a list of two cities to this key. Finaly add a third city to the list by using the append method.

Click here to see the solution
my_dict = {"name": "Pierre", "age": 29, "city": "Strasbourg"}
my_dict['city'] = ["Strasbourg","Schiltigheim"]
print(my_dict)
## {'name': 'Pierre', 'age': 29, 'city': ['Strasbourg', 'Schiltigheim']}
my_dict['city'].append('Colmar')

 

Exercise 2: Given a dictionary {‘a’: 1, ‘b’: 2, ‘c’: 3}, write code to add a new key-value pair “d”: 4.

Click here to see the solution
my_dict = {'a': 1, 'b': 2, 'c': 3}
my_dict["d"] = 4
print(my_dict)
## {'a': 1, 'b': 2, 'c': 3, 'd': 4}

 

3.4.4 Sets

Sets are unordered collection of unique elements. If we give to a set multiple time the same element, it will automatically delete duplicated values.

# Python

a = {1, 2, 3}
a
## {1, 2, 3}
# access the element of a set

a[0] # since it unordered, we can not access to a given element of a set
## TypeError: 'set' object is not subscriptable
# modify
b = {3,4,5}
a.update(b)  # update values from other set
a
## {1, 2, 3, 4, 5}
# mutable

a = {1, 2, 3}
b = a
b.update([12,14])
a
## {1, 2, 3, 12, 14}

Exercise 1: Create two sets, set1 and set2, with some overlapping elements. Then, find the intersection of the two sets.

Click here to see the solution
set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6}
intersection = set1.intersection(set2)
print(intersection)
## {3, 4}

 

Exercise 2: Create a set from the list [1, 2, 2, 3, 4, 4, 5] and observe how duplicate values are handled.

Click here to see the solution
my_list = [1, 2, 2, 3, 4, 4, 5]
my_set = set(my_list)
print(my_set)
## {1, 2, 3, 4, 5}

 

3.4.5 Arrays

In order to manipulate arrays in Python we need to use package numpy, this package is very useful and will be covered in other chapters.

First see that we can make an array using R by starting from a vector. R represents all arrays in column-major order, which is not the case in python.

# Python

import numpy
arr = numpy.array([[1,4],[2,5],[3,6]])
arr
## array([[1, 4],
##        [2, 5],
##        [3, 6]])

type(arr)
## <class 'numpy.ndarray'>

vec = [1,2,3,4,5,6]

arr = numpy.reshape(vec,(3,2))
arr
## array([[1, 2],
##        [3, 4],
##        [5, 6]])
arr = numpy.reshape(vec,(3,2), order = 'F')
arr
## array([[1, 4],
##        [2, 5],
##        [3, 6]])
# R
arr = as.array(rbind(c(1,4),c(2,5),c(3,6)))
arr
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

vec = c(1,2,3,4,5,6)

arr = array(vec,dim = c(3,2))
arr
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

As you can see, R and Python does not store vector in the same way: R is storing them by column while python does it by row. Using “order = ‘F’” allows us to store vector into matrix by column.

# Python

# access the element of an array

arr[0] # access directly to the raw 1
## array([1, 4])
# modify
vec = [7,8]
arr = numpy.insert(arr, len(arr),vec,axis = 0)  # update values from other set
arr
## array([[1, 4],
##        [2, 5],
##        [3, 6],
##        [7, 8]])
# mutable

arr2 = arr
arr2[0] = [12,14]
arr
## array([[12, 14],
##        [ 2,  5],
##        [ 3,  6],
##        [ 7,  8]])

Exercise 1: Create a NumPy array with the values [[1, 2, 3], [4, 5, 6]]. Then, print the element at row 1, column 2.

Click here to see the solution
import numpy as np
my_array = np.array([[1, 2, 3], [4, 5, 6]])
print(my_array[1, 2])
## 6

 

Exercise 2: Create a NumPy array with the values [1, 2, 3, 4, 5, 6]. Then, reshape it into a 2x3 matrix.

Click here to see the solution
import numpy as np
my_array = np.array([1, 2, 3, 4, 5, 6])
reshaped_array = my_array.reshape(2, 3)
print(reshaped_array)
## [[1 2 3]
##  [4 5 6]]

 

3.4.6 Data Frame

Pandas Data Frames are also very common data-type in Python. The package Pandas is also view deeper in following chapters.

# Python

import pandas

df = pandas.DataFrame(arr)
df
##     0   1
## 0  12  14
## 1   2   5
## 2   3   6
## 3   7   8

vec = [1,2,3,4,5,6]

df = pandas.DataFrame({'vec':vec,'vec1':range(2,8)})
df
##    vec  vec1
## 0    1     2
## 1    2     3
## 2    3     4
## 3    4     5
## 4    5     6
## 5    6     7
# Python

# access element of a Pandas Data Frame

df['vec'] 
## 0    1
## 1    2
## 2    3
## 3    4
## 4    5
## 5    6
## Name: vec, dtype: int64
# modify
vec2 = range(3,9)
df['vec2'] = vec2 # add values from other vector
a
## {1, 2, 3, 12, 14}
# mutable

df2 = df
df['vec'][0] = 30
df2
##    vec  vec1  vec2
## 0   30     2     3
## 1    2     3     4
## 2    3     4     5
## 3    4     5     6
## 4    5     6     7
## 5    6     7     8

Exercise 1: Create a Pandas DataFrame from a dictionary with two columns: “Name” and “Age”.

Click here to see the solution
import pandas as pd
data = {"Name": ["Alice", "Bob"], "Age": [25, 24]}
df = pd.DataFrame(data)
print(df)
##     Name  Age
## 0  Alice   25
## 1    Bob   24

 

Exercise 2: Add a new column to a Pandas DataFrame that contains the square of ‘Age’.

Click here to see the solution
df["Squared_Age"] = df["Age"] ** 2
print(df)
##     Name  Age  Squared_Age
## 0  Alice   25          625
## 1    Bob   24          576

 

4 Control flow

There are two main control flow tools: choices, and loops. ‘Choices’ are very useful for establishing rules or conditions. ‘Choices’ can be used to modify a value according to a certain condition, generally it allows to launch certain actions in specific cases.

The ‘loops’ when it allows to execute sequentially actions, it can be to interactively modify an object, more generally it allows to launch a procedure several times. For example, we can create N similar objects but we can also modify the N lines of an object.

4.1 Choices

# Python 

# if, elif, else

n = 12

if n%2 == 0 :  
  print('n is an even number')
## n is an even number
  



if n != int(n): 
  print('n is not a integer')
elif  n%2 == 0 :
  print('n is an even number')
else:
  print('n is not an even number')
## n is an even number
# R

# if, elif, else

n = 12L

if(n%%2 == 0){  
  print('n is an even number')
}
## [1] "n is an even number"
  



if(!is.integer(n)){ 
  print('n is not a integer')
  } else if(n%%2 == 0){
    print('n is an even number')
    } else {
      print('n is not an even number')
      }
## [1] "n is an even number"

Exercise 1: Write a condition that returns “there is an ‘e’” if there is an ‘e’ in a given word, and “there is no ‘e’” otherwise.

Click here to see the solution
word = 'hello'
if 'e' in word:
  print("there is an 'e'")
else:
  print("there is no 'e'")
## there is an 'e'

 

Exercise 2: categorizes a word based on its length: “short” if the word has 3 or fewer characters, “medium” if the word has between 4 and 6 characters and “long” if the word has 7 or more characters

Click here to see the solution
if len(word) <= 3:
    print("short")
elif len(word) <= 6:
    print("medium")
else:
    print("long")
## medium

 

4.2 Loops

Using ‘Loops’, we iterate over a predefined number of iterations. Sometimes we do not know how many steps we need to perform a given task.

Imagine that we want to optimize a function, we don’t know how many steps we need until reaching an optimum, but we can set a condition for which we will consider that the algorithm converged. For this kind of exercise we can use ‘while’ loops, it will iterate until a given condition is satisfied.

# Python 

seq = [1,2,None,4,None,6]
total = 0

for val in seq:
  if val is not None:
    total += val
    
total
## 13
# Python 
import random

total = 0
while total < 1:
  rnd = random.gauss(mu = 0, sigma = 1)
  if rnd < 0:
    pass
  else:
    total += rnd
  
total
## 2.106540591465749
# R
seq = c(1,2,NA,4,NA,6)
total = 0

for(val in seq){
  if(!is.na(val)){
    total = total + val
  }
}
    
total
## [1] 13
# R

total = 0

while(total < 1){
  rnd = rnorm(n=1, mean = 0, sd = 1)
  if(rnd < 0){
    # in R we can specify nothing instead of specifying 'pass'
    } else {
    total = total + rnd }
}
  
total
## [1] 1.315373

Exercise 1: Write a for loop that prints the numbers from 1 to 10.

Click here to see the solution
for i in range(1, 11):
    print(i)
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10

 

Exercise 2: Write a while loop that keeps prompting the user for input until they enter “quit”.

Click here to see the solution
user_input = ""
while user_input != "quit":
    user_input = input("Enter something (or 'quit'): ")

 

4.3 List, Set and Dict comprehensions (Python)

List comprehension is very common and appreciate in the python language features, think of it as a loop for which we will directly store output in a list, set, or dict. we can use it as a filter for example.

# Python 
# List
import time

lst = [1,2,3,4]

t = time.time() 
results = []
for val in lst:
  if val > 2:
    results.append(val)
time.time()-t
## 0.0019481182098388672

results
## [3, 4]

t = time.time() 
# this loop their will produce the same output than a using List comprehension.
results = [val for val in lst if val>2]
time.time()-t
## 0.0014700889587402344

results
## [3, 4]
# Python 
# Set
import time

st = {1,2,3,4}

t = time.time() 
results = set([])
for val in st:
  if val > 2:
    results.add(val)
time.time()-t
## 0.001961946487426758

results
## {3, 4}

t = time.time() 

# this loop their will produce the same output than a using Set comprehension.
results = {val for val in st if val>2}
time.time()-t
## 0.0013701915740966797

results
## {3, 4}
# Python 
# Dict
import time

dct = {'a':1,'b':2,'c':3,'d':4}

t = time.time() 
results = dict([])
for val in dct:
  if dct[val] > 2:
    results.update({str(val): dct[val]})
time.time()-t
## 0.0019779205322265625

results
## {'c': 3, 'd': 4}

t = time.time() 
# this loop their will produce the same output than a using Dict comprehension.
results = {str(val): dct[val] for val in dct if dct[val]>2}
time.time()-t
## 0.0012369155883789062

results
## {'c': 3, 'd': 4}

Exercise 1: Use list comprehension to create a new list containing only the even numbers from an existing list.

Click here to see the solution
numbers = [1, 2, 3, 4, 5, 6]
even_numbers = [x for x in numbers if x % 2 == 0]
print(even_numbers)
## [2, 4, 6]

 

Exercise 2: Use dictionary comprehension to create a dictionary where the keys are numbers from 1 to 5 and the values are their squares.

Click here to see the solution
squares = {x: x**2 for x in range(1, 6)}
print(squares)
## {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

 

5 Functions

Functions are very important in both languages, being able to code ourselves things are sometimes more efficient than looking for a package and understand it.
Being able to write our own functions gives us more flexibility and a better understanding of what we are actually doing.

It is important to not reinvent the wheel, but it is important to be able to construct it our-self. Read documentation carefully while using packages, sometimes packages are misleading and we can spend a lot of time understanding how they work. Look at the packages source code when we are not sure of what the function does behind.
The full power of programming comes is the fact that we can be autonomous by reading, modifying and writing codes and pre-existing codes.

 

5.1 General Functions

# Python 
seq = [1,2,None,4,None,6]*120

def clean_sum(seq):
  total = 0

  for val in seq:
    if val is not None:
      total += val
  return total
  
t = time.time()
clean_sum(seq = seq)
## 1560
time.time() - t
## 0.0013480186462402344


def clean_sum2(seq):
  total = sum(filter(None,seq))#[val for val in seq if val is not None])
  return total
  
t = time.time()
clean_sum2(seq = seq)
## 1560
time.time() - t
## 0.001300811767578125
# R
seq = rep(c(1,2,NA,4,NA,6),120)


clean_sum <- function(seq){
  total = 0
  for(val in seq){
    if(!is.na(val)){
      total = total + val
    }
  }
  return(total)
}

t = Sys.time()
clean_sum(seq = seq)
## [1] 1560
Sys.time() - t
## Time difference of 0.004907131 secs



clean_sum2 <- function(seq){
  total <- sum(na.rm(seq))
  return(total)
}

t <- Sys.time()
clean_sum(seq = seq)
## [1] 1560
Sys.time() - t
## Time difference of 0.0006980896 secs

Functions in Python are objects, they can have attributes and methods like objects. The functions can have data variables and even functions written inside of them.
Suppose we want to apply several transformation to data, we can create multiple functions to do the different tasks that we want to perform. We can stock functions in list a apply them sequentially very easily.

Note that we can separate the output into several variables by specifying the different values before the assignation. We cannot do that in R directly but the zeallot package can tackle this problem.

# Python 

def add_two(nb):
   nb = [i+2 for i in nb]
   return nb
  
def square_nb(nb):
   nb = [i**2 for i in nb]
   return nb

def global_function(nb):
   for function in func_list:
      nb = function(nb)
   return nb

func_list = [add_two,square_nb]

x1, x2 = global_function([2,3])
x1
## 16
x2
## 25
# R
library(zeallot)

add_two <- function(nb){
  nb = nb + 2
  return(nb)
}
  
square_nb <- function(nb){
  nb = nb**2
  return(nb)
}

global_function <- function(nb){
  for(func in func_list){
    nb = func(nb)
      }
  return(nb)
}

func_list = c(add_two,square_nb)

c(x1, x2) %<-% global_function(c(2,3))
x1
## [1] 16
x2
## [1] 25

Exercise 1: Write a function that takes a list of numbers as input and returns the average of the numbers. Check first that the list is not None before computing the average.

Click here to see the solution
def average_list(numbers):
    if not numbers:
        return 0
    return sum(numbers) / len(numbers)

my_list = [1, 2, 3, 4, 5]
average = average_list(my_list)
print(average)
## 3.0

 

6 Exercises

6.1 Exercise 1

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 10000.

6.2 Exercise 2

By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.

What is the 10 001st prime number?

6.3 Exercise 3

You are given the following information, but you may prefer to do some research for yourself.

  • 1 Jan 1900 was a Monday.
  • Thirty days has September, April, June and November. All the rest have thirty-one, Saving February alone, Which has twenty-eight, rain or shine. And on leap years, twenty-nine.
  • A leap year occurs on any year evenly divisible by 4, but not on a century unless it is divisible by 400.

How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)?

 

Possible solutions in R and Python without vectorisation