This chapter introduces the basics of R, focusing on the different types of objects that can be used, as well as control flow and function writing. The chapter concludes with three exercises from the Euler Project. Students are encouraged to work in pairs.
Data science languages such as R and Python are highly effective for business analysis and machine learning. Both languages offer a wide range of tools and libraries for data wrangling, preprocessing, and machine learning, making them ideal for reproducible research. Additionally, the ability to combine languages can open up new possibilities for even more advanced analysis. Despite their similarities, each language also has its own unique features and strengths, making it worth considering both options when choosing a language for a specific project.
The R programming language has its origins in the S language, which was developed by statistician John Chambers at Bell Labs in 1976. The R language was later developed at the University of Auckland, New Zealand, to expand upon the capabilities of S. Both S and R were created by researchers for the purpose of more efficiently conducting research and communicating results.
R is a powerful language for statistics, data analysis, data exploration, and data visualization. It is equipped with tools for reporting and communication, such as RMarkdown and Shiny.
In recent years, R has seen significant growth with the emergence of the tidyverse (tidyverse.org), a collection of tools with a common programming interface that use functional verbs to perform intuitive operations, connected by the pipe operator. The tidyverse is particularly advantageous as it makes data mining more efficient and allows for easy iteration through exploratory analysis. The interface is designed in a way that it makes it easy to write a paragraph describing the steps that you intend to take with the data.
Exploration is done using R for its readability and efficiency in the tidyverse and datatable.
R is used for communication because of the advanced reporting utilities such as RMarkdown, Shiny, and ggplot2 for visualization.
R’s CRAN repository offers a vast range of specialized packages for statistics, econometrics, biostatistics, epidemiology, and social science research.
R integrates seamlessly with R Markdown and knitr for producing reproducible research reports that combine data analysis, code, and narrative text into a single document.
R work as classical calculator, using “+”, “-”, “*” and “/” we can do arithmetic operations in both languages.
We can also apply exponentiation, Modulo and floor division easily in both language.
| Operator (R) | Description |
|---|---|
| + | Addition |
| – | Subtraction |
| * | Multiplication |
| / | Division |
| ^ / ** | Exponent |
| %% | Modulo |
| %/% | Floor Division |
To compare values, we use comparison operators to determine if a value is equal to, not equal to, greater than, etc.
| Operator (R) | Description |
|---|---|
| < | Less than |
| > | Greater than |
| <= | Less than or equal to |
| >= | Greater than or equal to |
| == | Equal to |
| != | Not equal to |
| Operator (R) | Description |
|---|---|
| ! | Logical NOT |
| & | Element-wise logical AND |
| && | Logical AND |
| | | Element-wise logical OR |
| || | Logical OR |
Exercise 1: Write a Python expression that checks if a number is both greater than 10 and less than 20.
Exercise 2: Write a Python expression that checks if a string is either “yes” or “no”.
In R, the equivalent membership testing operator is ‘%in%’, which checks if a particular value exists within a vector or list. Unlike Python, R does not decompose strings into individual characters when using this operator. Instead, ‘%in%’ checks for an exact match between elements. For example, using ‘%in%’ in R would not recognize ‘Hello’ as part of ‘Hello world’ unless ‘Hello world’ is an exact element within the vector or list being tested.
Exercise:
Exercise 1: Write a Python expression that checks if the letter “a” is present in the string “banana”.
Exercise 2: Write a Python expression that checks if the number 5 is not present in the list [1, 2, 3, 4].
R does not require declaring a variable before assigning a value to it. Variables can be thought of as names that refer to an object. However, there is a difference in the way objects and variables are stored in the computer’s memory in R compared to Python.
In R, textual data is referred to as ‘character’. You can use either ” or ’ when defining a textual variable, and it’s possible to explicitly set a variable as a textual variable if needed.
Numerical variables in R are divided into three types. The equivalent of a float in Python is called ‘double’, referring to double-precision floating point, which means that the number has a precision of up to 15 decimal places.
Finally, it’s easy to check the data type of a variable by using the
command typeof in R.
Exercise 1: Create a variable containing your full name, then extract and print only your first name using string slicing.
Exercise 2: Write a Python expression that checks if ‘ab’ is present in the list [‘aa’,‘bb’,‘ab’,‘ba’].
There are many data types in both languages that will be covered throughout this course, for now, we will focus only on basic data types that are available without using any additional packages or libraries. In R, common data types include Vector, Matrix, List, and Dataframe.
# R
vec = c(1, 2, 3)
length(which(vec==2)) # count elements of the list which are exactly equal to 2
## [1] 1
vec = sort(vec,
decreasing = TRUE)
vec
## [1] 3 2 1
# access the element of a list
vec[1]
## [1] 3
vec[which(vec==3)]
## [1] 3
vec[2:length(vec)]
## [1] 2 1
vec[length(vec):2]
## [1] 1 2
vec[-1] # throwaway value
## [1] 2 1# R
mat = matrix(c(1, 2, 3, 4, 5, 6),3,2)
dim(mat)
## [1] 3 2
mat
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
length(which(mat==2)) # count elements of the list which are exactly equal to 2
## [1] 1
vec = sort(mat,
decreasing = TRUE)
vec
## [1] 6 5 4 3 2 1
# access the element of a list
mat[1] # first element
## [1] 1
which(mat==3,
arr.ind = TRUE) # get row and columns for each value that match
## row col
## [1,] 3 1
mat[2:length(mat)] # gives us a vector
## [1] 2 3 4 5 6
mat[2:dim(mat)[1],]
## [,1] [,2]
## [1,] 2 5
## [2,] 3 6
mat[length(mat):1]
## [1] 6 5 4 3 2 1
mat[1,-1] # throwaway values
## [1] 4# modify
b = rep(0,3)
vec = c(vec,b)
vec
## [1] 6 5 4 3 2 1 0 0 0
mat_2 = rbind(vec,vec)
mat_2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## vec 6 5 4 3 2 1 0 0 0
## vec 6 5 4 3 2 1 0 0 0
mat_2 = cbind(mat_2,mat_2)
mat_2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## vec 6 5 4 3 2 1 0 0 0 6 5 4 3 2
## vec 6 5 4 3 2 1 0 0 0 6 5 4 3 2
## [,15] [,16] [,17] [,18]
## vec 1 0 0 0
## vec 1 0 0 0
t_mat = t(mat) # transposition
t_mat[2,3] = as.character(t_mat[2,3]) # all values turn to character
i_mat = diag(1, 3)
i_mat
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1Exercise 1: Given a list [5, 2, 8, 1, 9], sort the list in ascending order.
Exercise 2: Create a NumPy array with the values [[1, 2, 3], [4, 5, 6]]. Then, print the element at row 1, column 2.
Exercise 3: Create a NumPy array with the values [1, 2, 3, 4, 5, 6]. Then, reshape it into a 2x3 matrix.
A Data frame in R is a matrix that can store different types of data such as characters, numeric or factors. Unlike a matrix, it does not have a fixed dimension, which means that we can easily create a new column using the operator ‘$’.
# R
df = data.frame(chr = letters[1:5], # letters is a vector of letter provided directly in R
num = seq(1,5,1),
fac = factor(c(rep('a',3),rep('b',2))))
class(df$chr) # automatically converted to factor, which can be very inefficient ! UP: Not any more with R >= 4.0 !
## [1] "character"
df$chr = as.character(df$chr)
class(df$chr)
## [1] "character"
colnames(df)
## [1] "chr" "num" "fac"
df
## chr num fac
## 1 a 1 a
## 2 b 2 a
## 3 c 3 a
## 4 d 4 b
## 5 e 5 b
# access the element of a variable in a df
df$chr[3]
## [1] "c"
which(df==3,
arr.ind = TRUE)
## row col
## [1,] 3 2
df[1:2] # gives us a df with the first two columns !
## chr num
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
## 5 e 5
df[3:dim(df)[1],]
## chr num fac
## 3 c 3 a
## 4 d 4 b
## 5 e 5 b
df[1,-1] # throwaway values
## num fac
## 1 1 a# modify
b = rep(0,3)
df = rbind(df,b) # introduce a new row in the data frame, note that we get a NA in the factor column because '0' is not a level
## Warning in `[<-.factor`(`*tmp*`, ri, value = 0): invalid factor level, NA
## generated
df$new_col = rep(2,6) # we can introduce new columns by specifying the name and set a value
df
## chr num fac new_col
## 1 a 1 a 2
## 2 b 2 a 2
## 3 c 3 a 2
## 4 d 4 b 2
## 5 e 5 b 2
## 6 0 0 <NA> 2
t_df = t(df) # transposition convert the df into a matrix
class(t_df)
## [1] "matrix" "array"
Exercise 1: Create a Pandas DataFrame from a dictionary with two columns: “Name” and “Age”.
Exercise 2: Add a new column to a Pandas DataFrame that contains the square of ‘Age’.
Arrays is a generalization of the matrix, meaning that we can have a matrix with more than two dimensions. We can modify them as we modify a matrix.
# R
arr = array(seq(1,12,1), dim = c(2,2,3)) # create a matrix with 3 dimension
arr
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
##
## , , 3
##
## [,1] [,2]
## [1,] 9 11
## [2,] 10 12
# access the element of a dictionary
arr[1,,]
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 3 7 11
arr[,1,]
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
arr[,,1]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
A list permits us to assemble an assortment of disconnected objects behind one object, it is also possible to store objects which can be of different types.
# R
lst = list(df = df,
mat = mat,
arr = arr) # create a matrix with 3 dimension
lst
## $df
## chr num fac new_col
## 1 a 1 a 2
## 2 b 2 a 2
## 3 c 3 a 2
## 4 d 4 b 2
## 5 e 5 b 2
## 6 0 0 <NA> 2
##
## $mat
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
##
## $arr
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
##
## , , 3
##
## [,1] [,2]
## [1,] 9 11
## [2,] 10 12
lst[[1]] # first element of the list, here it is the df
## chr num fac new_col
## 1 a 1 a 2
## 2 b 2 a 2
## 3 c 3 a 2
## 4 d 4 b 2
## 5 e 5 b 2
## 6 0 0 <NA> 2
lst[[1]][1,] # first row of first element of the list
## chr num fac new_col
## 1 a 1 a 2
In programming, there are two main control flow tools: conditional statements and loops.
Conditional statements, also known as choices, are useful for establishing rules or conditions. They allow for modifying a value according to a certain condition, and generally allow for certain actions to be taken in specific cases.
Loops, on the other hand, allow for sequential execution of actions. They can be used to interactively modify an object, and generally allow for a procedure to be executed multiple times. For example, we can use loops to create multiple similar objects, or to modify multiple lines in a single object.
Exercise 1: Write a condition that returns “there is an ‘e’” if there is an ‘e’ in a given word, and “there is no ‘e’” otherwise.
Exercise 2: categorizes a word based on its length: “short” if the word has 3 or fewer characters, “medium” if the word has between 4 and 6 characters and “long” if the word has 7 or more characters
With ‘loops’, we iterate over a predefined number of iterations. However, in certain situations we may not know in advance how many iterations are required to complete a task.
For example, when trying to optimize a function, we may not know how many steps are needed to reach an optimum, but we can set a condition for when the algorithm is considered to have converged. In such cases, we can use a ‘while’ loop, which will iterate until a given condition is met.
Exercise 1: Write a for loop that prints the numbers from 1 to 10.
Exercise 2: Write a while loop that keeps prompting the user for input until they enter “quit”.
Functions are an important aspect of R. Being able to write our own functions can be more efficient than searching for and understanding a pre-existing package.
Writing our own functions gives us more flexibility and a better understanding of what we are doing. However, it is important to not reinvent the wheel and instead use pre-existing packages when appropriate. It is crucial to carefully read the documentation when using packages, as they can be misleading and lead to a lot of time spent trying to understand how they work. It is also beneficial to look at the package’s source code when unsure of what a function does behind the scenes.
The full power of programming comes from the ability to be autonomous by reading, modifying, and writing code, as well as reusing pre-existing code.
# R
seq = rep(c(1,2,NA,4,NA,6),120)
clean_sum <- function(seq){
total = 0
for(val in seq){
if(!is.na(val)){
total = total + val
}
}
return(total)
}
t = Sys.time()
clean_sum(seq = seq)
## [1] 1560
Sys.time() - t
## Time difference of 0.001538992 secs
clean_sum2 <- function(seq){
total <- sum(na.rm(seq))
return(total)
}
t <- Sys.time()
clean_sum(seq = seq)
## [1] 1560
Sys.time() - t
## Time difference of 0.000346899 secsIn R, it is not possible to unpack the output of a function into
multiple variables directly. However, this functionality can be achieved
using the zeallot package.
# R
library(zeallot)
add_two <- function(nb){
nb = nb + 2
return(nb)
}
square_nb <- function(nb){
nb = nb**2
return(nb)
}
global_function <- function(nb){
for(func in func_list){
nb = func(nb)
}
return(nb)
}
func_list = c(add_two,square_nb)
c(x1, x2) %<-% global_function(c(2,3))
x1
## [1] 16
x2
## [1] 25Exercise 1: Write a function that takes a list of numbers as input and returns the average of the numbers. Check first that the list is not None before computing the average.
Exercise 1:
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 10000.
Exercise 2:
By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.
What is the 10 001st prime number?
Exercise 3:
You are given the following information, but you may prefer to do some research for yourself.
How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)?