This chapter introduces the basics of R, focusing on the different types of objects that can be used, as well as control flow and function writing. The chapter concludes with three exercises from the Euler Project. Students are encouraged to work in pairs.

 

1 R/Python

Data science languages such as R and Python are highly effective for business analysis and machine learning. Both languages offer a wide range of tools and libraries for data wrangling, preprocessing, and machine learning, making them ideal for reproducible research. Additionally, the ability to combine languages can open up new possibilities for even more advanced analysis. Despite their similarities, each language also has its own unique features and strengths, making it worth considering both options when choosing a language for a specific project.

1.1 R Strengths

The R programming language has its origins in the S language, which was developed by statistician John Chambers at Bell Labs in 1976. The R language was later developed at the University of Auckland, New Zealand, to expand upon the capabilities of S. Both S and R were created by researchers for the purpose of more efficiently conducting research and communicating results.

R is a powerful language for statistics, data analysis, data exploration, and data visualization. It is equipped with tools for reporting and communication, such as RMarkdown and Shiny.

In recent years, R has seen significant growth with the emergence of the tidyverse (tidyverse.org), a collection of tools with a common programming interface that use functional verbs to perform intuitive operations, connected by the pipe operator. The tidyverse is particularly advantageous as it makes data mining more efficient and allows for easy iteration through exploratory analysis. The interface is designed in a way that it makes it easy to write a paragraph describing the steps that you intend to take with the data.

  • Exploration is done using R for its readability and efficiency in the tidyverse and datatable.

  • R is used for communication because of the advanced reporting utilities such as RMarkdown, Shiny, and ggplot2 for visualization.

  • R’s CRAN repository offers a vast range of specialized packages for statistics, econometrics, biostatistics, epidemiology, and social science research.

  • R integrates seamlessly with R Markdown and knitr for producing reproducible research reports that combine data analysis, code, and narrative text into a single document.

2 Operations

2.1 Arithmetic Operations

R work as classical calculator, using “+”, “-”, “*” and “/” we can do arithmetic operations in both languages.

# R

1+2
## [1] 3
1-2
## [1] -1
1/2
## [1] 0.5
1*2
## [1] 2

 

We can also apply exponentiation, Modulo and floor division easily in both language.

# R

2**8 # exponentiation
## [1] 256
2^8 == 2**8 # TRUE
## [1] TRUE
8%%3 # modulo
## [1] 2
8%/%3 # floor division
## [1] 2
Operator (R) Description
+ Addition
Subtraction
* Multiplication
/ Division
^ / ** Exponent
%% Modulo
%/% Floor Division

 

2.2 Comparison Operators

To compare values, we use comparison operators to determine if a value is equal to, not equal to, greater than, etc.

# R

2==8
## [1] FALSE
2!=8
## [1] TRUE
2<8
## [1] TRUE
2>8
## [1] FALSE
2<=8
## [1] TRUE
Operator (R) Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

 

2.3 Logical operators

# R

x = c(TRUE,TRUE)
y = c(TRUE,FALSE)

!x[1]
## [1] FALSE
x & y
## [1]  TRUE FALSE
x && y
## Error in x && y: 'length = 2' in coercion to 'logical(1)'
x | y
## [1] TRUE TRUE
x || y
## Error in x || y: 'length = 2' in coercion to 'logical(1)'
Operator (R) Description
! Logical NOT
& Element-wise logical AND
&& Logical AND
| Element-wise logical OR
|| Logical OR

Exercise 1: Write a Python expression that checks if a number is both greater than 10 and less than 20.

Click here to see the solution
number = 15
result = 10 < number < 20
print(result)
## True

 

Exercise 2: Write a Python expression that checks if a string is either “yes” or “no”.

Click here to see the solution
string = "yes"
result = string == "yes" or string == "no"
print(result)
## True

 

2.4 Membership operators

In R, the equivalent membership testing operator is ‘%in%’, which checks if a particular value exists within a vector or list. Unlike Python, R does not decompose strings into individual characters when using this operator. Instead, ‘%in%’ checks for an exact match between elements. For example, using ‘%in%’ in R would not recognize ‘Hello’ as part of ‘Hello world’ unless ‘Hello world’ is an exact element within the vector or list being tested.

# R

x = 'Hello World'

print('Hello' %in% x)
## [1] FALSE
stringr::str_detect(x,'Hello')
## [1] TRUE

 

# R

x = c('Hello','World')

print('Hello' %in% x)
## [1] TRUE
stringr::str_detect('Hello',x)
## [1]  TRUE FALSE

Exercise:

  • Verify if a number is divisible by 5.
  • Is the number ‘58’ present in the string ‘ak2vf6dGdS8wShiz9S56a5hd58f8Se’?
  • Check both of these conditions simultaneously

Exercise 1: Write a Python expression that checks if the letter “a” is present in the string “banana”.

Click here to see the solution
string = "banana"
result = "a" in string
print(result)
## True

 

Exercise 2: Write a Python expression that checks if the number 5 is not present in the list [1, 2, 3, 4].

Click here to see the solution
list_ = [1, 2, 3, 4]
result = 5 not in list_
print(result)
## True

 

3 Objects & Variables

R does not require declaring a variable before assigning a value to it. Variables can be thought of as names that refer to an object. However, there is a difference in the way objects and variables are stored in the computer’s memory in R compared to Python.

3.1 Textual and numerical variables

In R, textual data is referred to as ‘character’. You can use either ” or ’ when defining a textual variable, and it’s possible to explicitly set a variable as a textual variable if needed.

Numerical variables in R are divided into three types. The equivalent of a float in Python is called ‘double’, referring to double-precision floating point, which means that the number has a precision of up to 15 decimal places.

Finally, it’s easy to check the data type of a variable by using the command typeof in R.

# R

a = 1
typeof(a)
## [1] "double"
a_int = 1L
typeof(a_int)
## [1] "integer"

b = 1.1
typeof(b)
## [1] "double"

c = 1.1+2i
typeof(c)
## [1] "complex"

d = 'd'
typeof(d)
## [1] "character"

# change the type
e = 2
f = as.character(e)
f
## [1] "2"
as.numeric(f)
## [1] 2

Exercise 1: Create a variable containing your full name, then extract and print only your first name using string slicing.

Click here to see the solution
full_name = "John Smith"
first_name = full_name.split()[0] 
print(first_name)
## John

 

Exercise 2: Write a Python expression that checks if ‘ab’ is present in the list [‘aa’,‘bb’,‘ab’,‘ba’].

Click here to see the solution
list_ = ['aa','bb','ab','ba']
result = 'ab' in list_
print(result)
## True

 

3.2 Other Data-Type

There are many data types in both languages that will be covered throughout this course, for now, we will focus only on basic data types that are available without using any additional packages or libraries. In R, common data types include Vector, Matrix, List, and Dataframe.

 

3.2.1 Vectors and Matrices

# R

vec = c(1, 2, 3)
length(which(vec==2)) # count elements of the list which are exactly equal to 2
## [1] 1

vec = sort(vec,
           decreasing = TRUE)
vec
## [1] 3 2 1
# access the element of a list
vec[1]
## [1] 3
vec[which(vec==3)]
## [1] 3
vec[2:length(vec)]
## [1] 2 1
vec[length(vec):2]
## [1] 1 2
vec[-1] # throwaway value
## [1] 2 1
# R

mat = matrix(c(1, 2, 3, 4, 5, 6),3,2)
dim(mat)
## [1] 3 2
mat
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

length(which(mat==2)) # count elements of the list which are exactly equal to 2
## [1] 1

vec = sort(mat,
           decreasing = TRUE)
vec
## [1] 6 5 4 3 2 1
# access the element of a list
mat[1] # first element
## [1] 1
which(mat==3,
      arr.ind = TRUE) # get row and columns for each value that match
##      row col
## [1,]   3   1

mat[2:length(mat)] # gives us a vector
## [1] 2 3 4 5 6
mat[2:dim(mat)[1],]
##      [,1] [,2]
## [1,]    2    5
## [2,]    3    6
mat[length(mat):1]
## [1] 6 5 4 3 2 1

mat[1,-1] # throwaway values
## [1] 4
# modify

b = rep(0,3)
vec = c(vec,b)
vec
## [1] 6 5 4 3 2 1 0 0 0
mat_2 = rbind(vec,vec)
mat_2 
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## vec    6    5    4    3    2    1    0    0    0
## vec    6    5    4    3    2    1    0    0    0
mat_2 = cbind(mat_2,mat_2)
mat_2
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## vec    6    5    4    3    2    1    0    0    0     6     5     4     3     2
## vec    6    5    4    3    2    1    0    0    0     6     5     4     3     2
##     [,15] [,16] [,17] [,18]
## vec     1     0     0     0
## vec     1     0     0     0

t_mat = t(mat) # transposition
t_mat[2,3] = as.character(t_mat[2,3]) # all values turn to character


i_mat = diag(1, 3)
i_mat
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Exercise 1: Given a list [5, 2, 8, 1, 9], sort the list in ascending order.

Click here to see the solution
my_list = [5, 2, 8, 1, 9]
my_list.sort()
print(my_list)
## [1, 2, 5, 8, 9]

 

Exercise 2: Create a NumPy array with the values [[1, 2, 3], [4, 5, 6]]. Then, print the element at row 1, column 2.

Click here to see the solution
import numpy as np
my_array = np.array([[1, 2, 3], [4, 5, 6]])
print(my_array[1, 2])
## 6

 

Exercise 3: Create a NumPy array with the values [1, 2, 3, 4, 5, 6]. Then, reshape it into a 2x3 matrix.

Click here to see the solution
import numpy as np
my_array = np.array([1, 2, 3, 4, 5, 6])
reshaped_array = my_array.reshape(2, 3)
print(reshaped_array)
## [[1 2 3]
##  [4 5 6]]

 

3.2.2 Data frame

A Data frame in R is a matrix that can store different types of data such as characters, numeric or factors. Unlike a matrix, it does not have a fixed dimension, which means that we can easily create a new column using the operator ‘$’.

# R

df = data.frame(chr = letters[1:5], # letters is a vector of letter provided directly in R
                num = seq(1,5,1),
                fac = factor(c(rep('a',3),rep('b',2)))) 
class(df$chr) # automatically converted to factor, which can be very inefficient ! UP: Not any more with R >= 4.0 !
## [1] "character"
df$chr = as.character(df$chr)
class(df$chr)
## [1] "character"
colnames(df)
## [1] "chr" "num" "fac"
df
##   chr num fac
## 1   a   1   a
## 2   b   2   a
## 3   c   3   a
## 4   d   4   b
## 5   e   5   b
# access the element of a variable in a df
df$chr[3]
## [1] "c"
which(df==3,
      arr.ind = TRUE) 
##      row col
## [1,]   3   2

df[1:2] # gives us a df with the first two columns !
##   chr num
## 1   a   1
## 2   b   2
## 3   c   3
## 4   d   4
## 5   e   5
df[3:dim(df)[1],]
##   chr num fac
## 3   c   3   a
## 4   d   4   b
## 5   e   5   b

df[1,-1] # throwaway values
##   num fac
## 1   1   a
# modify

b = rep(0,3)
df = rbind(df,b) # introduce a new row in the data frame, note that we get a NA in the factor column because '0' is not a level
## Warning in `[<-.factor`(`*tmp*`, ri, value = 0): invalid factor level, NA
## generated
df$new_col = rep(2,6) # we can introduce new columns by specifying the name and set a value


df
##   chr num  fac new_col
## 1   a   1    a       2
## 2   b   2    a       2
## 3   c   3    a       2
## 4   d   4    b       2
## 5   e   5    b       2
## 6   0   0 <NA>       2
t_df = t(df) # transposition convert the df into a matrix
class(t_df)
## [1] "matrix" "array"

 

Exercise 1: Create a Pandas DataFrame from a dictionary with two columns: “Name” and “Age”.

Click here to see the solution
import pandas as pd
## ModuleNotFoundError: No module named 'pandas'
data = {"Name": ["Alice", "Bob"], "Age": [25, 24]}
df = pd.DataFrame(data)
## NameError: name 'pd' is not defined
print(df)
## NameError: name 'df' is not defined

 

Exercise 2: Add a new column to a Pandas DataFrame that contains the square of ‘Age’.

Click here to see the solution
df["Squared_Age"] = df["Age"] ** 2
## NameError: name 'df' is not defined
print(df)
## NameError: name 'df' is not defined

 

3.2.3 Arrays

Arrays is a generalization of the matrix, meaning that we can have a matrix with more than two dimensions. We can modify them as we modify a matrix.

# R

arr = array(seq(1,12,1), dim = c(2,2,3)) # create a matrix with 3 dimension
arr
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    9   11
## [2,]   10   12
# access the element of a dictionary
arr[1,,]
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    3    7   11
arr[,1,]
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
arr[,,1]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

 

3.2.4 Lists

A list permits us to assemble an assortment of disconnected objects behind one object, it is also possible to store objects which can be of different types.

# R

lst = list(df = df,
           mat = mat,
           arr = arr) # create a matrix with 3 dimension

lst
## $df
##   chr num  fac new_col
## 1   a   1    a       2
## 2   b   2    a       2
## 3   c   3    a       2
## 4   d   4    b       2
## 5   e   5    b       2
## 6   0   0 <NA>       2
## 
## $mat
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## 
## $arr
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
## 
## , , 3
## 
##      [,1] [,2]
## [1,]    9   11
## [2,]   10   12
lst[[1]] # first element of the list, here it is the df
##   chr num  fac new_col
## 1   a   1    a       2
## 2   b   2    a       2
## 3   c   3    a       2
## 4   d   4    b       2
## 5   e   5    b       2
## 6   0   0 <NA>       2
lst[[1]][1,] # first row of first element of the list
##   chr num fac new_col
## 1   a   1   a       2

 

 

4 Control flow

In programming, there are two main control flow tools: conditional statements and loops.

Conditional statements, also known as choices, are useful for establishing rules or conditions. They allow for modifying a value according to a certain condition, and generally allow for certain actions to be taken in specific cases.

Loops, on the other hand, allow for sequential execution of actions. They can be used to interactively modify an object, and generally allow for a procedure to be executed multiple times. For example, we can use loops to create multiple similar objects, or to modify multiple lines in a single object.

4.1 Choices

# R

# if, elif, else

n = 12L

if(n%%2 == 0){  
  print('n is an even number')
}
## [1] "n is an even number"
  



if(!is.integer(n)){ 
  print('n is not a integer')
  } else if(n%%2 == 0){
    print('n is an even number')
    } else {
      print('n is not an even number')
      }
## [1] "n is an even number"

 

Exercise 1: Write a condition that returns “there is an ‘e’” if there is an ‘e’ in a given word, and “there is no ‘e’” otherwise.

Click here to see the solution
word = 'hello'
if 'e' in word:
  print("there is an 'e'")
else:
  print("there is no 'e'")
## there is an 'e'

 

Exercise 2: categorizes a word based on its length: “short” if the word has 3 or fewer characters, “medium” if the word has between 4 and 6 characters and “long” if the word has 7 or more characters

Click here to see the solution
if len(word) <= 3:
    print("short")
elif len(word) <= 6:
    print("medium")
else:
    print("long")
## medium

 

4.2 Loops

With ‘loops’, we iterate over a predefined number of iterations. However, in certain situations we may not know in advance how many iterations are required to complete a task.

For example, when trying to optimize a function, we may not know how many steps are needed to reach an optimum, but we can set a condition for when the algorithm is considered to have converged. In such cases, we can use a ‘while’ loop, which will iterate until a given condition is met.

# R
seq = c(1,2,NA,4,NA,6)
total = 0

for(val in seq){
  if(!is.na(val)){
    total = total + val
  }
}
    
total
## [1] 13
# R

total = 0

while(total < 1){
  rnd = rnorm(n=1, mean = 0, sd = 1)
  if(rnd < 0){
    # in R we can specify nothing instead of specifying 'pass'
    } else {
    total = total + rnd }
}
  
total
## [1] 1.480187

 

Exercise 1: Write a for loop that prints the numbers from 1 to 10.

Click here to see the solution
for i in range(1, 11):
    print(i)
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10

 

Exercise 2: Write a while loop that keeps prompting the user for input until they enter “quit”.

Click here to see the solution
user_input = ""
while user_input != "quit":
    user_input = input("Enter something (or 'quit'): ")

 

5 Functions

Functions are an important aspect of R. Being able to write our own functions can be more efficient than searching for and understanding a pre-existing package.

Writing our own functions gives us more flexibility and a better understanding of what we are doing. However, it is important to not reinvent the wheel and instead use pre-existing packages when appropriate. It is crucial to carefully read the documentation when using packages, as they can be misleading and lead to a lot of time spent trying to understand how they work. It is also beneficial to look at the package’s source code when unsure of what a function does behind the scenes.

The full power of programming comes from the ability to be autonomous by reading, modifying, and writing code, as well as reusing pre-existing code.

 

5.1 General Functions

# R
seq = rep(c(1,2,NA,4,NA,6),120)


clean_sum <- function(seq){
  total = 0
  for(val in seq){
    if(!is.na(val)){
      total = total + val
    }
  }
  return(total)
}

t = Sys.time()
clean_sum(seq = seq)
## [1] 1560
Sys.time() - t
## Time difference of 0.001538992 secs



clean_sum2 <- function(seq){
  total <- sum(na.rm(seq))
  return(total)
}

t <- Sys.time()
clean_sum(seq = seq)
## [1] 1560
Sys.time() - t
## Time difference of 0.000346899 secs

In R, it is not possible to unpack the output of a function into multiple variables directly. However, this functionality can be achieved using the zeallot package.

# R
library(zeallot)

add_two <- function(nb){
  nb = nb + 2
  return(nb)
}
  
square_nb <- function(nb){
  nb = nb**2
  return(nb)
}

global_function <- function(nb){
  for(func in func_list){
    nb = func(nb)
      }
  return(nb)
}

func_list = c(add_two,square_nb)

c(x1, x2) %<-% global_function(c(2,3))
x1
## [1] 16
x2
## [1] 25

Exercise 1: Write a function that takes a list of numbers as input and returns the average of the numbers. Check first that the list is not None before computing the average.

Click here to see the solution
def average_list(numbers):
    if not numbers:
        return 0
    return sum(numbers) / len(numbers)

my_list = [1, 2, 3, 4, 5]
average = average_list(my_list)
print(average)
## 3.0

 

 

6 Exercises

Exercise 1:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 10000.

 

 

Exercise 2:

By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.

What is the 10 001st prime number?

 

 

Exercise 3:

You are given the following information, but you may prefer to do some research for yourself.

  • 1 Jan 1900 was a Monday.
  • Thirty days has September, April, June and November. All the rest have thirty-one, Saving February alone, Which has twenty-eight, rain or shine. And on leap years, twenty-nine.
  • A leap year occurs on any year evenly divisible by 4, but not on a century unless it is divisible by 400.

How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)?

 

Possible solutions in R without vectorisation