This capter introduces the basics of R and Python. The goal of this chapter is to discover the different types of object that we can deal with for each language, as well as understanding the logic of control flows and function writing. The chapter ends with 3 Exercices from Euler Project. By forming a group of 2, solve at least one exercice in the language of your choice and try to compare different approaches.
Both data science languages are great for business analysis. They can both be used for similar purpose when viewed from a purely machine learning perspective. Both have packages or libraries dedicated to wrangling, preprocessing, and applying machine learning to data. Both are great choices for reproducibile research. Where things become really interesting is their differences, as well as combining languages.
The S language was a precursor to R developed by John Chambers (statistician) at Bell Labs in 1976 as a programming language designed to implement statistics. The R statistical programming language was developed at the University of Auckland, New Zealand, to increase S capacities. S and R developers were not software engineers or computer scientists, they were researchers that developed tools in order carry out research and communicate results more efficiently.
R is a language with roots in statistics, data analysis, data exploration, and data visualization. It has utilities for reporting and communicate including RMarkdown and Shiny.
R is expanding rapidly with the emergence of the tidyverse (tidyverse.org), a set of tools with a common programming-interface that use functional verbs (functions like mutate() and summarize()) to perform intuitive operations connected by the pipe (%>%). The tidyverse is very advantageous because it makes data mining very efficient. Iterating through your exploratory analysis is as simple as writing a paragraph describing what you intend to do to the data.
The Python language is a general-purpose programming language created by Guido van Rossum (Computer Scientist) in 1991. The language was developed to be readable and to cover many programming paradigms. One of Python’s greatest strengths is its flexibility which allows to handle web frameworks, data base connectivity, networking, web scraping, text and image processing, many of those features are very useful to various tasks in machine learning.
In essence, the foundations of Python are in computer science and mathematics. With over 100,000 open source libraries, Python has the largest ecosystem of any programming language, making it a unique choice for those who want high flexibility.
Python has great data science libraries including Scikit Learn, the most popular machine learning library which is easy to pick up, includes support for pipelines to simplify the machine learning workflow, and has almost all of the algorithms one needs in one place. Another very common library is TensorFlow, developed by software engineers at Google to perform deep learning, commonly used for image recognition and natural language processing tasks (NLTK is also very useful for NLP). Facebook also have his own deep learning framework named PyTorch which is a concurrent of Tenserflow and Keras (which is designed for efficiently building neural networks).
R is selected for exploration due to the tidyverse and datatable readability and efficiency.
Python is preferred for machine learning because of Scikit Learn, TenserFlow and Keras, PyTorch machine learning pipeline capability.
R is favored for communication because of the advanced reporting utilities including RMarkdown and Shiny (interactive web apps) and ggplot2 for visualization.
Python and R work as classical calculator, using “+”, “-”, “*” and “/” we can do arithmetic operations in both languages.
# Python
1+2
## 3
1-2
## -1
1/2
## 0.5
1*2
## 2
# R
1+2
## [1] 3
1-2
## [1] -1
1/2
## [1] 0.5
1*2
## [1] 2
We can also apply exponentiation, Modulo and floor division easily in both language. Note that in R we can write the exponentiation “^”. It is used for exclusive-or (XOR) in Python.
# Python
2**8 # exponentiation
## 256
2^8 == 2**8 # False
## False
8%3 # modulo
## 2
8//3 # floor division
## 2
# R
2**8 # exponentiation
## [1] 256
2^8 == 2**8 # TRUE
## [1] TRUE
8%%3 # modulo
## [1] 2
8%/%3 # floor division
## [1] 2
| Operator (R) | Operator (Python) | Description |
|---|---|---|
| + | + | Addition |
| – | – | Subtraction |
| * | * | Multiplication |
| / | / | Division |
| ^ / ** | ** | Exponent |
| %% | % | Modulo |
| %/% | // | floor Division |
In order to compare some values, we can use comparison operators to investigate if a given value is equal to, not equal to, greater than, etc.
Theses operators are the same in both language.
# Python
2==8
## False
2!=8
## True
2<8
## True
2>8
## False
2<=8
## True
# R
2==8
## [1] FALSE
2!=8
## [1] TRUE
2<8
## [1] TRUE
2>8
## [1] FALSE
2<=8
## [1] TRUE
| Operator (R/Python) | Description |
|---|---|
| < | Less than |
| > | Greater than |
| <= | Less than or equal to |
| >= | Greater than or equal to |
| == | Equal to |
| != | Not equal to |
# Python
x = [True,True]
y = [True,False]
not x[0]
## False
x and y
## [True, False]
x or y
## [True, True]
# R
x = c(TRUE,TRUE)
y = c(TRUE,FALSE)
!x[1]
## [1] FALSE
x & y
## [1] TRUE FALSE
x | y
## [1] TRUE TRUE
| Operator (R) | Operator (Python) | Description |
|---|---|---|
| ! | not | not |
| & | and | and |
| | | or | or |
Exercise 1: Write a Python expression that checks if a number is both greater than 10 and less than 20.
number = 15
result = 10 < number < 20
print(result)
## True
Exercise 2: Write a Python expression that checks if a string is either “yes” or “no”.
string = "yes"
result = string == "yes" or string == "no"
print(result)
## True
‘In’ and ‘not in’ are membership operators in Python. They are used
to test whether a value or variable is found in a sequence. In R ‘%in%’
allows you to use the python in, but you have to be aware that they may
not react the same way, the reason is that list of character can be
directly decomposed in a character list in python, which is not the case
in R.
See how we can check that ‘Hello’ is in ‘Hello world’ using just in with
python. In R we look if the string exactly match one element.
# Python
x = 'Hello world'
y = {1:'a',2:'b'}
print('world' in x)
## True
print(1 in y)
## True
print('a' in y)
## False
# R
x = 'Hello World'
print('Hello' %in% x)
## [1] FALSE
stringr::str_detect(x,'Hello')
## [1] TRUE
# Python
x = ['Hello','World']
print('Hello' in x)
## True
# R
x = c('Hello','World')
print('Hello' %in% x)
## [1] TRUE
stringr::str_detect('Hello',x)
## [1] TRUE FALSE
Exercise 1: Write a Python expression that checks if the letter “a” is present in the string “banana”.
string = "banana"
result = "a" in string
print(result)
## True
Exercise 2: Write a Python expression that checks if the number 5 is not present in the list [1, 2, 3, 4].
list_ = [1, 2, 3, 4]
result = 5 not in list_
print(result)
## True
In Python and R, we do not need to declare a variable before assigning a value to this variable. we can think of a variable as a name to refers to an object. There is a difference between Python and R on the way the computer is storing objects and variables.
While Python is much appreciated for being a versatile language with an easy-to-understand syntax, R’s functionalities are developed by statisticians, thus giving it field-specific advantages.
In R, a variable and an object are the same things, they refer to the same entities. If we assign to a new variable a variable that already exist, it will refers to two different objects.
Things in Python are more conventional, but since most of you learned R before Python it is important to notice that variables refers to an object but they are still two entities. It means that two variables can refer to a single object. If two variables are pointing the same object, the modification made with one variable will be also available when using the other variable.
a = 1
b = a
a += 1
print(a)
## 2
print(b)
## 1
a = [1,2]
b = a
a.append(3)
print(a)
## [1, 2, 3]
print(b)
## [1, 2, 3]
a = (1,2)
b = a
a += (3,)
print(a)
## (1, 2, 3)
print(b)
## (1, 2)
We mostly should never use for loops in R - they’re very slow because they execute a function call with every iteration. One should vectorize and use the apply family of functions instead. Vectorization is king in R if we want fast code. Assuming we vectorize both our R and Python codes (and other factors), we should probably get the same order of magnitude in speed. For data larger than memory (we can specify the limit), R starts to become a bad choice.
# Python
import numpy as np
only_nb = [1,2,3,4]
np.std(only_nb)
## np.float64(1.118033988749895)
np.std(only_nb,ddof=1)
## np.float64(1.2909944487358056)
with_none = [1,2,3,None]
np.std(with_none)
## TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
with_char = [1,2,'3',4]
np.std(with_char)
## TypeError: the resolved dtypes are not compatible with add.reduce. Resolved (dtype('<U21'), dtype('<U21'), dtype('<U42'))
# R
only_nb <- c(1,2,3,4)
sd(only_nb)
## [1] 1.290994
with_na <- c(1,2,3,NA)
sd(with_na)
## [1] NA
with_char <- c(1,2,'3',4)
sd(with_char)
## [1] 1.290994
with_char <- c(1,2,'a',4)
sd(with_char)
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
## [1] NA
When computing a standard deviation R does not show any error
using NA and returns NA, see how it also converts string to numeric if
possible. If converting the string to numeric is not feasible, R finally
return a NA. This is not an error message but a warning message.
We need to write our code carefully since R can do calculation for a
while and just return NA without stopping the process which is not the
case in Python.
R calculates the standard deviation with N - 1 as the denominator, and numpy with N. To get the same result we need to use ‘ddof=1’ as argument in std.
In R textual data are called ‘character’, in Python it is a string, abbreviate as ‘str’. ” or ’ are used when we set a textual variable in both language, but we can also set a variable to textual variable if needed.
Numerical variables a decomposed in three types. In Python we have integer, float and complex. In R float is called ‘double’, referring to double precision floating point. Without entering into details, it means that the number have a precision up to 15 decimals. Float can refers to a precision of 7 decimals, however in Python the data-type ‘float’ has a precision of 15 numbers. They are ways to get more precision using Numpy for example.
Finally we can check easily the data-type of the variable that we are dealing by asking the language with the command ‘type’ in Python, or ‘typeof’ in R.
# Python
import sys
sys.float_info.dig # number of decimals
## 15
a = 1
type(a)
## <class 'int'>
b = 1.1
type(b)
## <class 'float'>
c = 1.1+2j
type(c)
## <class 'complex'>
d = 'd'
type(d)
## <class 'str'>
# change the type
e = 2
f = str(e)
f
## '2'
float(f)
## 2.0
# R
a = 1
typeof(a)
## [1] "double"
a_int = 1L
typeof(a_int)
## [1] "integer"
b = 1.1
typeof(b)
## [1] "double"
c = 1.1+2i
typeof(c)
## [1] "complex"
d = 'd'
typeof(d)
## [1] "character"
# change the type
e = 2
f = as.character(e)
f
## [1] "2"
as.numeric(f)
## [1] 2
Exercise 1: Create a variable containing your full name, then extract and print only your first name using string slicing.
full_name = "John Smith"
first_name = full_name.split()[0]
print(first_name)
## John
Exercise 2: Write a Python expression that checks if ‘ab’ is present in the list [‘aa’,‘bb’,‘ab’,‘ba’].
list_ = ['aa','bb','ab','ba']
result = 'ab' in list_
print(result)
## True
There is a lot of datatypes in both language that we will discover along this course, for now, we will focus only on basic data-type that are directly available without using any other package or library,in R common data-type are Vector, Matrix, List, Dataframe.
# R
vec = c(1, 2, 3)
length(which(vec==2)) # count elements of the list which are exactly equal to 2
## [1] 1
vec = sort(vec,
decreasing = TRUE)
vec
## [1] 3 2 1
# access the element of a list
vec[1]
## [1] 3
vec[which(vec==3)]
## [1] 3
vec[2:length(vec)]
## [1] 2 1
vec[length(vec):2]
## [1] 1 2
vec[-1] # throwaway value
## [1] 2 1
# R
mat = matrix(c(1, 2, 3, 4, 5, 6),3,2)
dim(mat)
## [1] 3 2
mat
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
length(which(mat==2)) # count elements of the list which are exactly equal to 2
## [1] 1
vec = sort(mat,
decreasing = TRUE)
vec
## [1] 6 5 4 3 2 1
# access the element of a list
mat[1] # first element
## [1] 1
which(mat==3,
arr.ind = TRUE) # get row and columns for each value that match
## row col
## [1,] 3 1
mat[2:length(mat)] # gives us a vector
## [1] 2 3 4 5 6
mat[2:dim(mat)[1],]
## [,1] [,2]
## [1,] 2 5
## [2,] 3 6
mat[length(mat):1]
## [1] 6 5 4 3 2 1
mat[1,-1] # throwaway values
## [1] 4
# modify
b = rep(0,3)
vec = c(vec,b)
vec
## [1] 6 5 4 3 2 1 0 0 0
mat_2 = rbind(vec,vec)
mat_2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## vec 6 5 4 3 2 1 0 0 0
## vec 6 5 4 3 2 1 0 0 0
mat_2 = cbind(mat_2,mat_2)
mat_2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## vec 6 5 4 3 2 1 0 0 0 6 5 4 3 2
## vec 6 5 4 3 2 1 0 0 0 6 5 4 3 2
## [,15] [,16] [,17] [,18]
## vec 1 0 0 0
## vec 1 0 0 0
t_mat = t(mat) # transposition
t_mat[2,3] = as.character(t_mat[2,3]) # all values turn to character
i_mat = diag(1, 3)
i_mat
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
A Data frame in R is a matrix which can store different types of data, characters, numeric or factors. Contrary to the matrix, it does not have a fixed dimension which means that we can create a column very easily using the operator ‘$’.
# R
df = data.frame(chr = letters[1:5], # letters is a vector of letter provided directly in R
num = seq(1,5,1),
fac = factor(c(rep('a',3),rep('b',2))))
class(df$chr) # automatically converted to factor, which can be very inefficient ! UP: Not any more with R >= 4.0 !
## [1] "character"
df$chr = as.character(df$chr)
class(df$chr)
## [1] "character"
colnames(df)
## [1] "chr" "num" "fac"
df
## chr num fac
## 1 a 1 a
## 2 b 2 a
## 3 c 3 a
## 4 d 4 b
## 5 e 5 b
# access the element of a variable in a df
df$chr[3]
## [1] "c"
which(df==3,
arr.ind = TRUE)
## row col
## [1,] 3 2
df[1:2] # gives us a df with the first two columns !
## chr num
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
## 5 e 5
df[3:dim(df)[1],]
## chr num fac
## 3 c 3 a
## 4 d 4 b
## 5 e 5 b
df[1,-1] # throwaway values
## num fac
## 1 1 a
# modify
b = rep(0,3)
df = rbind(df,b) # introduce a new row in the data frame, note that we get a NA in the factor column because '0' is not a level
## Warning in `[<-.factor`(`*tmp*`, ri, value = 0): invalid factor level, NA
## generated
df$new_col = rep(2,6) # we can introduce new columns by specifying the name and set a value
df
## chr num fac new_col
## 1 a 1 a 2
## 2 b 2 a 2
## 3 c 3 a 2
## 4 d 4 b 2
## 5 e 5 b 2
## 6 0 0 <NA> 2
t_df = t(df) # transposition convert the df into a matrix
class(t_df)
## [1] "matrix" "array"
Arrays is a generalization of the matrix, meaning that we can have a matrix with more than two dimensions. We can modify them as we modify a matrix.
# R
arr = array(seq(1,12,1), dim = c(2,2,3)) # create a matrix with 3 dimension
arr
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
##
## , , 3
##
## [,1] [,2]
## [1,] 9 11
## [2,] 10 12
# access the element of a dictionary
arr[1,,]
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 3 7 11
arr[,1,]
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
arr[,,1]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
A list permits us to assemble an assortment of disconnected objects behind one object, it is also possible to store objects which can be of different types.
# R
lst = list(df = df,
mat = mat,
arr = arr) # create a matrix with 3 dimension
lst
## $df
## chr num fac new_col
## 1 a 1 a 2
## 2 b 2 a 2
## 3 c 3 a 2
## 4 d 4 b 2
## 5 e 5 b 2
## 6 0 0 <NA> 2
##
## $mat
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
##
## $arr
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
##
## , , 3
##
## [,1] [,2]
## [1,] 9 11
## [2,] 10 12
lst[[1]] # first element of the list, here it is the df
## chr num fac new_col
## 1 a 1 a 2
## 2 b 2 a 2
## 3 c 3 a 2
## 4 d 4 b 2
## 5 e 5 b 2
## 6 0 0 <NA> 2
lst[[1]][1,] # first row of first element of the list
## chr num fac new_col
## 1 a 1 a 2
In Python the most common are List, tuple, sets, Dictionary. Note that for the moment, for modifying a variable we always use ‘=’. Keep in mind that using the ‘.’ can modify the object behind the variable in Python.
Keep in mind that Lists are mutable, as discussed in the Section “Difference between R and Python”
# Python
a = [1, 2, 3]
a.count(2) # count elements of the list which are exactly equal to 2
## 1
a.sort(reverse = True)
a
## [3, 2, 1]
# access the element of a list
a[0]
## 3
a.index(3)
## 0
a[1:]
## [2, 1]
a[:1]
## [3]
a[0:-1]
## [3, 2]
a[:]
## [3, 2, 1]
# modify
b = [0, 0, 0]
list(zip(a,b)) # zip will pairs the ellements, it works also with more than 2 element ex: zip(a,b,c)
## [(3, 0), (2, 0), (1, 0)]
a.append(b)
a
## [3, 2, 1, [0, 0, 0]]
a[4:5] = ['a','b']
a
## [3, 2, 1, [0, 0, 0], 'a', 'b']
a.extend([4, 5, 6])
a
## [3, 2, 1, [0, 0, 0], 'a', 'b', 4, 5, 6]
a += [7,8] # works as extend
a
## [3, 2, 1, [0, 0, 0], 'a', 'b', 4, 5, 6, 7, 8]
a.insert(2,[1,2])
a
## [3, 2, [1, 2], 1, [0, 0, 0], 'a', 'b', 4, 5, 6, 7, 8]
a.remove(b)
a
## [3, 2, [1, 2], 1, 'a', 'b', 4, 5, 6, 7, 8]
a = a*2 # replicate the list n times
len(a) # number of elements in the list
## 22
# mutable
a = [1,2,3]
b = a
b[0] = 12
a
## [12, 2, 3]
Exercise 1: Given a list [5, 2, 8, 1, 9], sort the list in ascending order.
my_list = [5, 2, 8, 1, 9]
my_list.sort()
print(my_list)
## [1, 2, 5, 8, 9]
Exercise 2: Given a list [‘apple’, ‘banana’, ‘apple’, ‘orange’], write code to count the number of times “apple” appears.
my_list = ['apple', 'banana', 'apple', 'orange']
count = my_list.count('apple')
print(count)
## 2
The main difference between Lists and tuples is the fact that tuples is an immutable type of data, making it faster to use.
# Python
a = (1, 2, 3)
a.count(2) # count elements of the tuple which are exactly equal to 2
## 1
a
## (1, 2, 3)
# access the element of a tuple
a[0]
## 1
a.index(3)
## 2
a[1:]
## (2, 3)
a[:1]
## (1,)
a[0:-1]
## (1, 2)
a[:]
## (1, 2, 3)
# modify
a += (4,5)
a
## (1, 2, 3, 4, 5)
a = a*2 # replicate the tuple n times
len(a) # number of elements in the tuple
## 10
# immutable
a = (1,2,3)
b = a
b += (4,5)
a
## (1, 2, 3)
Dictionary refers to a way of storing data that is not sorted. It works with key and value associate with this key.
# Python
a = {'a':1, 'b':2, 'c':3}
# access the element of a dictionary
a.keys()
## dict_keys(['a', 'b', 'c'])
a['a']
## 1
a.values()
## dict_values([1, 2, 3])
a.items()
## dict_items([('a', 1), ('b', 2), ('c', 3)])
a.get('a')
## 1
a.get('d',4) # set to 4 if the key 'd' is not detected
## 4
a.pop('a') # pop will use the corresponding value to the key a and remove the pair (key, value).
## 1
a
## {'b': 2, 'c': 3}
# modify
a['a'] =1
a.setdefault('d',0) # create new item with a default value
## 0
b = {'d':4,'e':5}
a.update(b) # update values from other dict
a
## {'b': 2, 'c': 3, 'a': 1, 'd': 4, 'e': 5}
a.clear() # remove all items
# mutable
a = {'a':1, 'b':2, 'c':3}
b = a
b['b'] = [12,14]
a
## {'a': 1, 'b': [12, 14], 'c': 3}
Exercise 1: Create a dictionary that assign the keys “name”, “age”, and “city” to some values. Change then the value of ‘city’ and assign a list of two cities to this key. Finaly add a third city to the list by using the append method.
my_dict = {"name": "Pierre", "age": 29, "city": "Strasbourg"}
my_dict['city'] = ["Strasbourg","Schiltigheim"]
print(my_dict)
## {'name': 'Pierre', 'age': 29, 'city': ['Strasbourg', 'Schiltigheim']}
my_dict['city'].append('Colmar')
Exercise 2: Given a dictionary {‘a’: 1, ‘b’: 2, ‘c’: 3}, write code to add a new key-value pair “d”: 4.
my_dict = {'a': 1, 'b': 2, 'c': 3}
my_dict["d"] = 4
print(my_dict)
## {'a': 1, 'b': 2, 'c': 3, 'd': 4}
Sets are unordered collection of unique elements. If we give to a set multiple time the same element, it will automatically delete duplicated values.
# Python
a = {1, 2, 3}
a
## {1, 2, 3}
# access the element of a set
a[0] # since it unordered, we can not access to a given element of a set
## TypeError: 'set' object is not subscriptable
# modify
b = {3,4,5}
a.update(b) # update values from other set
a
## {1, 2, 3, 4, 5}
# mutable
a = {1, 2, 3}
b = a
b.update([12,14])
a
## {1, 2, 3, 12, 14}
Exercise 1: Create two sets, set1 and set2, with some overlapping elements. Then, find the intersection of the two sets.
set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6}
intersection = set1.intersection(set2)
print(intersection)
## {3, 4}
Exercise 2: Create a set from the list [1, 2, 2, 3, 4, 4, 5] and observe how duplicate values are handled.
my_list = [1, 2, 2, 3, 4, 4, 5]
my_set = set(my_list)
print(my_set)
## {1, 2, 3, 4, 5}
In order to manipulate arrays in Python we need to use package numpy, this package is very useful and will be covered in other chapters.
First see that we can make an array using R by starting from a vector. R represents all arrays in column-major order, which is not the case in python.
# Python
import numpy
arr = numpy.array([[1,4],[2,5],[3,6]])
arr
## array([[1, 4],
## [2, 5],
## [3, 6]])
type(arr)
## <class 'numpy.ndarray'>
vec = [1,2,3,4,5,6]
arr = numpy.reshape(vec,(3,2))
arr
## array([[1, 2],
## [3, 4],
## [5, 6]])
arr = numpy.reshape(vec,(3,2), order = 'F')
arr
## array([[1, 4],
## [2, 5],
## [3, 6]])
# R
arr = as.array(rbind(c(1,4),c(2,5),c(3,6)))
arr
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
vec = c(1,2,3,4,5,6)
arr = array(vec,dim = c(3,2))
arr
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
As you can see, R and Python does not store vector in the same way: R is storing them by column while python does it by row. Using “order = ‘F’” allows us to store vector into matrix by column.
# Python
# access the element of an array
arr[0] # access directly to the raw 1
## array([1, 4])
# modify
vec = [7,8]
arr = numpy.insert(arr, len(arr),vec,axis = 0) # update values from other set
arr
## array([[1, 4],
## [2, 5],
## [3, 6],
## [7, 8]])
# mutable
arr2 = arr
arr2[0] = [12,14]
arr
## array([[12, 14],
## [ 2, 5],
## [ 3, 6],
## [ 7, 8]])
Exercise 1: Create a NumPy array with the values [[1, 2, 3], [4, 5, 6]]. Then, print the element at row 1, column 2.
import numpy as np
my_array = np.array([[1, 2, 3], [4, 5, 6]])
print(my_array[1, 2])
## 6
Exercise 2: Create a NumPy array with the values [1, 2, 3, 4, 5, 6]. Then, reshape it into a 2x3 matrix.
import numpy as np
my_array = np.array([1, 2, 3, 4, 5, 6])
reshaped_array = my_array.reshape(2, 3)
print(reshaped_array)
## [[1 2 3]
## [4 5 6]]
Pandas Data Frames are also very common data-type in Python. The package Pandas is also view deeper in following chapters.
# Python
import pandas
df = pandas.DataFrame(arr)
df
## 0 1
## 0 12 14
## 1 2 5
## 2 3 6
## 3 7 8
vec = [1,2,3,4,5,6]
df = pandas.DataFrame({'vec':vec,'vec1':range(2,8)})
df
## vec vec1
## 0 1 2
## 1 2 3
## 2 3 4
## 3 4 5
## 4 5 6
## 5 6 7
# Python
# access element of a Pandas Data Frame
df['vec']
## 0 1
## 1 2
## 2 3
## 3 4
## 4 5
## 5 6
## Name: vec, dtype: int64
# modify
vec2 = range(3,9)
df['vec2'] = vec2 # add values from other vector
a
## {1, 2, 3, 12, 14}
# mutable
df2 = df
df['vec'][0] = 30
df2
## vec vec1 vec2
## 0 30 2 3
## 1 2 3 4
## 2 3 4 5
## 3 4 5 6
## 4 5 6 7
## 5 6 7 8
Exercise 1: Create a Pandas DataFrame from a dictionary with two columns: “Name” and “Age”.
import pandas as pd
data = {"Name": ["Alice", "Bob"], "Age": [25, 24]}
df = pd.DataFrame(data)
print(df)
## Name Age
## 0 Alice 25
## 1 Bob 24
Exercise 2: Add a new column to a Pandas DataFrame that contains the square of ‘Age’.
df["Squared_Age"] = df["Age"] ** 2
print(df)
## Name Age Squared_Age
## 0 Alice 25 625
## 1 Bob 24 576
There are two main control flow tools: choices, and loops. ‘Choices’ are very useful for establishing rules or conditions. ‘Choices’ can be used to modify a value according to a certain condition, generally it allows to launch certain actions in specific cases.
The ‘loops’ when it allows to execute sequentially actions, it can be to interactively modify an object, more generally it allows to launch a procedure several times. For example, we can create N similar objects but we can also modify the N lines of an object.
# Python
# if, elif, else
n = 12
if n%2 == 0 :
print('n is an even number')
## n is an even number
if n != int(n):
print('n is not a integer')
elif n%2 == 0 :
print('n is an even number')
else:
print('n is not an even number')
## n is an even number
# R
# if, elif, else
n = 12L
if(n%%2 == 0){
print('n is an even number')
}
## [1] "n is an even number"
if(!is.integer(n)){
print('n is not a integer')
} else if(n%%2 == 0){
print('n is an even number')
} else {
print('n is not an even number')
}
## [1] "n is an even number"
Exercise 1: Write a condition that returns “there is an ‘e’” if there is an ‘e’ in a given word, and “there is no ‘e’” otherwise.
word = 'hello'
if 'e' in word:
print("there is an 'e'")
else:
print("there is no 'e'")
## there is an 'e'
Exercise 2: categorizes a word based on its length: “short” if the word has 3 or fewer characters, “medium” if the word has between 4 and 6 characters and “long” if the word has 7 or more characters
if len(word) <= 3:
print("short")
elif len(word) <= 6:
print("medium")
else:
print("long")
## medium
Using ‘Loops’, we iterate over a predefined number of iterations. Sometimes we do not know how many steps we need to perform a given task.
Imagine that we want to optimize a function, we don’t know how many steps we need until reaching an optimum, but we can set a condition for which we will consider that the algorithm converged. For this kind of exercise we can use ‘while’ loops, it will iterate until a given condition is satisfied.
# Python
seq = [1,2,None,4,None,6]
total = 0
for val in seq:
if val is not None:
total += val
total
## 13
# Python
import random
total = 0
while total < 1:
rnd = random.gauss(mu = 0, sigma = 1)
if rnd < 0:
pass
else:
total += rnd
total
## 2.106540591465749
# R
seq = c(1,2,NA,4,NA,6)
total = 0
for(val in seq){
if(!is.na(val)){
total = total + val
}
}
total
## [1] 13
# R
total = 0
while(total < 1){
rnd = rnorm(n=1, mean = 0, sd = 1)
if(rnd < 0){
# in R we can specify nothing instead of specifying 'pass'
} else {
total = total + rnd }
}
total
## [1] 1.315373
Exercise 1: Write a for loop that prints the numbers from 1 to 10.
for i in range(1, 11):
print(i)
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
Exercise 2: Write a while loop that keeps prompting the user for input until they enter “quit”.
user_input = ""
while user_input != "quit":
user_input = input("Enter something (or 'quit'): ")
List comprehension is very common and appreciate in the python language features, think of it as a loop for which we will directly store output in a list, set, or dict. we can use it as a filter for example.
# Python
# List
import time
lst = [1,2,3,4]
t = time.time()
results = []
for val in lst:
if val > 2:
results.append(val)
time.time()-t
## 0.0019481182098388672
results
## [3, 4]
t = time.time()
# this loop their will produce the same output than a using List comprehension.
results = [val for val in lst if val>2]
time.time()-t
## 0.0014700889587402344
results
## [3, 4]
# Python
# Set
import time
st = {1,2,3,4}
t = time.time()
results = set([])
for val in st:
if val > 2:
results.add(val)
time.time()-t
## 0.001961946487426758
results
## {3, 4}
t = time.time()
# this loop their will produce the same output than a using Set comprehension.
results = {val for val in st if val>2}
time.time()-t
## 0.0013701915740966797
results
## {3, 4}
# Python
# Dict
import time
dct = {'a':1,'b':2,'c':3,'d':4}
t = time.time()
results = dict([])
for val in dct:
if dct[val] > 2:
results.update({str(val): dct[val]})
time.time()-t
## 0.0019779205322265625
results
## {'c': 3, 'd': 4}
t = time.time()
# this loop their will produce the same output than a using Dict comprehension.
results = {str(val): dct[val] for val in dct if dct[val]>2}
time.time()-t
## 0.0012369155883789062
results
## {'c': 3, 'd': 4}
Exercise 1: Use list comprehension to create a new list containing only the even numbers from an existing list.
numbers = [1, 2, 3, 4, 5, 6]
even_numbers = [x for x in numbers if x % 2 == 0]
print(even_numbers)
## [2, 4, 6]
Exercise 2: Use dictionary comprehension to create a dictionary where the keys are numbers from 1 to 5 and the values are their squares.
squares = {x: x**2 for x in range(1, 6)}
print(squares)
## {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
Functions are very important in both languages, being able to code
ourselves things are sometimes more efficient than looking for a package
and understand it.
Being able to write our own functions gives us more flexibility and a
better understanding of what we are actually doing.
It is important to not reinvent the wheel, but it is important to be
able to construct it our-self. Read documentation carefully while using
packages, sometimes packages are misleading and we can spend a lot of
time understanding how they work. Look at the packages source code when
we are not sure of what the function does behind.
The full power of programming comes is the fact that we can be
autonomous by reading, modifying and writing codes and pre-existing
codes.
# Python
seq = [1,2,None,4,None,6]*120
def clean_sum(seq):
total = 0
for val in seq:
if val is not None:
total += val
return total
t = time.time()
clean_sum(seq = seq)
## 1560
time.time() - t
## 0.0013480186462402344
def clean_sum2(seq):
total = sum(filter(None,seq))#[val for val in seq if val is not None])
return total
t = time.time()
clean_sum2(seq = seq)
## 1560
time.time() - t
## 0.001300811767578125
# R
seq = rep(c(1,2,NA,4,NA,6),120)
clean_sum <- function(seq){
total = 0
for(val in seq){
if(!is.na(val)){
total = total + val
}
}
return(total)
}
t = Sys.time()
clean_sum(seq = seq)
## [1] 1560
Sys.time() - t
## Time difference of 0.004907131 secs
clean_sum2 <- function(seq){
total <- sum(na.rm(seq))
return(total)
}
t <- Sys.time()
clean_sum(seq = seq)
## [1] 1560
Sys.time() - t
## Time difference of 0.0006980896 secs
Functions in Python are objects, they can have attributes and methods
like objects. The functions can have data variables and even functions
written inside of them.
Suppose we want to apply several transformation to data, we can create
multiple functions to do the different tasks that we want to perform. We
can stock functions in list a apply them sequentially very easily.
Note that we can separate the output into several variables by specifying the different values before the assignation. We cannot do that in R directly but the zeallot package can tackle this problem.
# Python
def add_two(nb):
nb = [i+2 for i in nb]
return nb
def square_nb(nb):
nb = [i**2 for i in nb]
return nb
def global_function(nb):
for function in func_list:
nb = function(nb)
return nb
func_list = [add_two,square_nb]
x1, x2 = global_function([2,3])
x1
## 16
x2
## 25
# R
library(zeallot)
add_two <- function(nb){
nb = nb + 2
return(nb)
}
square_nb <- function(nb){
nb = nb**2
return(nb)
}
global_function <- function(nb){
for(func in func_list){
nb = func(nb)
}
return(nb)
}
func_list = c(add_two,square_nb)
c(x1, x2) %<-% global_function(c(2,3))
x1
## [1] 16
x2
## [1] 25
Exercise 1: Write a function that takes a list of numbers as input and returns the average of the numbers. Check first that the list is not None before computing the average.
def average_list(numbers):
if not numbers:
return 0
return sum(numbers) / len(numbers)
my_list = [1, 2, 3, 4, 5]
average = average_list(my_list)
print(average)
## 3.0
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 10000.
By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.
What is the 10 001st prime number?
You are given the following information, but you may prefer to do some research for yourself.
How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)?