“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003). Data preparation is not just a first step, but must berepeated many times over the course of analysis as new problems come to light or new data is collected.” (Wickham, 2014).

The Tidyverse is a collection of differents packages which works together in a common philosophy, the biggest contributor to this collection is named Hadley Wickham. Hadley Wickham did so much for the R community, he published a lot of books that are highly recommended to read. Check Wikipedia: Haldey Wickham for his bibliograpy. Here is the list of libraries that are contained in the Tidyverse:

  • ggplot2: Visualization

  • dplyr: Data manipulation

  • tidyr: Data formatting

  • purrr: programming

  • readr: Data importation

  • tibble: Data tables

  • forcats: Qualitative variables

  • stringr: Strings The goal of the tidyverse is to make data analysis easier by simplifying data preparation. Tidyverse includes a new data type, Tibbles.

Tibbles can be considered a modern version of data.frame, but they differ in some aspects. They do not have row names and allow users to have column names with special characters (e.g. if the column name of x is ‘Col 1’, then to access this column you can use x$`Col 1` to access this variable). Tibbles also provide a good way to visualize data since it directly shows data as if you were running the ‘head()’ function before, displaying the first 10 rows.

Data.table is also a common choice when dealing with data and it is very fast. However, using Tibbles can be time-consuming when dealing with large tables.

1 Tibbles and data.tables

1.1 Basic Operations

1.1.1 Pipes

Dplyr is user-friendly as it uses very understandable words to manipulate data. It uses a specific way of writing calculations called the pipe operator: ‘%>%’. Pipes allow us to write sequential calculations in a readable way. Here is how we need to read R code that uses pipes: The code on the left side of the pipe operator is passed as the first argument to the function on the right side of the operator. This allows for a more readable and concise way of performing multiple operations on data, as it eliminates the need to create intermediate variables and makes the code more readable by showing the flow of data.

  • Usual Way
    funtion2(function1(dataset,arg1),arg2)

  • Using Pipes
    dataset %>% function1(arg1) %>% function2(arg2)

Note that we can write directly ‘%>%’ by using ‘ctrl + shift + m’

Here is list of function that are commonly used and actions they provide.

Functions Description
slice restrict dataset to specific row (see also slice_head, slice_tail, slice_min, slice_max)
slice_sample restrict dataset to random row
filter restrict dataset to row that pass a condition
select restrict dataset to specifiyed columns
rename change columns names
arrange sort dataset with respect to one columns
mutate modify or create a columns that apply a function
group_by group data before running calculation
ungroup ungroup data
summarise summarise data with respect to the level of agregation (groups)
count count numbers of row in a group
lead/lag shift the observations of a variable one notch backwards (for lead) or forwards (for lag)
distinct given a varaiable restrict the dataset to unique value of this variable
bind_rows/bind_cols add columns or rows to the dataset
left_join merge datatables with respect to variable(s) (see also full_join, inner_join, anti_join)

starts_with

  • Usual Way

# R


library(tidyverse)
## ── Attaching core tidyverse packages ────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ readr     2.1.6
## ✔ ggplot2   4.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.2.0     ✔ tidyr     1.3.1
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
covid_papers <- read_csv("data/covid_papers.csv")
## Rows: 148916 Columns: 19
## ── Column specification ──────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (17): _id, pmid, pmcid, doi, title, source, year_pub, month_pub, author...
## dbl   (1): cited
## date  (1): creation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

covid_papers
## # A tibble: 148,916 × 19
##    `_id`            pmid  pmcid doi   title source year_pub month_pub creation  
##    <chr>            <chr> <chr> <chr> <chr> <chr>  <chr>    <chr>     <date>    
##  1 5f69d6fa65a250f… 3283… PMC7… 10.1… The … MED    2020     null      2020-08-25
##  2 5f69d6fa65a250f… 3219… PMC7… 10.1… Hypo… MED    2020     null      2020-03-21
##  3 5f69d6fa65a250f… 3277… PMC7… 10.1… The … MED    2020     null      2020-08-12
##  4 5f69d6fa65a250f… 3273… PMC7… 10.1… A co… MED    2020     null      2020-07-31
##  5 5f69d6fa65a250f… null  null  10.2… Viri… PPR    2020     null      2020-08-16
##  6 5f69d6fa65a250f… null  PMC7… null  Inte… PMC    2020     null      2020-09-13
##  7 5f69d6fa65a250f… 3239… PMC7… 10.1… A gl… MED    2020     null      2020-05-13
##  8 5f69d6fa65a250f… 3225… PMC7… 10.1… COVI… MED    2020     null      2020-04-08
##  9 5f69d6fa65a250f… 3272… PMC7… 10.1… The … MED    2020     null      2020-07-29
## 10 5f69d6fa65a250f… 3219… PMC7… 10.1… Labo… MED    2020     null      2020-03-21
## # ℹ 148,906 more rows
## # ℹ 10 more variables: authors_gender <chr>, journal <chr>, language <chr>,
## #   pubmodel <chr>, pubtype <chr>, mesh <chr>, keywords <chr>, cited <dbl>,
## #   has_data <chr>, oa <chr>

#which(is.na(covid_papers),arr.ind = T)
covid_papers = na.omit(covid_papers)

dim(slice(covid_papers,2))
## [1]  1 19
dim(slice_sample(covid_papers,prop = 0.001))
## [1] 148  19

t <- Sys.time()
doi_clean <- filter(covid_papers,doi != 'null')
Sys.time()-t
## Time difference of 0.014009 secs

Using Pipes

# R

dim(covid_papers %>% slice(2))
## [1]  1 19
dim(covid_papers %>% slice_sample(prop = 0.001))
## [1] 148  19

t = Sys.time()
doi_clean <- covid_papers %>% filter(doi != 'null')
Sys.time() - t
## Time difference of 0.01347208 secs

t = Sys.time()
doi_clean <- covid_papers[which(covid_papers$doi != 'null'),]
Sys.time() - t
## Time difference of 0.014498 secs

1.1.2 Select columns

# R using Tibbles
cp_tibble <-  as_tibble(covid_papers)

dim(cp_tibble[[4]]) # returns a vector
## NULL
dim(cp_tibble[4])# returns a tibble
## [1] 148913      1

dim(select(cp_tibble, doi)) # returns a tibble
## [1] 148913      1
dim(pull(cp_tibble, doi, name = title))   # returns a (named) vector
## NULL
dim(cp_tibble[, "doi"])        # returns a tibble
## [1] 148913      1
dim(cp_tibble[["doi"]])        # returns a vector
## NULL

dim(select(cp_tibble, doi, creation, title))
## [1] 148913      3
dim(select(cp_tibble, doi:title)) # select columns between doi and title
## [1] 148913      2

dim(select(cp_tibble, -doi, -creation))
## [1] 148913     17

cols <- c("doi", "creation")
dim(select(cp_tibble, !!cols)) # unquoting
## [1] 148913      2
dim(select(cp_tibble, -!!cols))
## [1] 148913     17
# R using data.tablee
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following object is masked from 'package:purrr':
## 
##     transpose
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
cp_datatable = as.data.table(covid_papers)

dim(cp_datatable[[4]]) # returns a vector
## NULL
dim(cp_datatable[, 4])  # returns a data.table
## [1] 148913      1

dim(cp_datatable[, list(doi)]) # returns a data.table
## [1] 148913      1
dim(cp_datatable[, .(doi)])    # returns a data.table
## [1] 148913      1
# . is an alias for list

dim(cp_datatable[, "doi"])     # returns a data.table
## [1] 148913      1
dim(cp_datatable[, doi])       # returns a vector
## NULL
dim(cp_datatable[["doi"]])     # returns a vector
## NULL

dim(cp_datatable[, .(doi, creation, title)])
## [1] 148913      3
dim(cp_datatable[, list(doi, creation, title)])
## [1] 148913      3
dim(cp_datatable[, doi:title]) # select columns between doi and title
## [1] 148913      2

dim(cp_datatable[, !c("doi", "creation")])
## [1] 148913     17

cols <- c("doi", "creation")
dim(cp_datatable[, ..cols]) # .. prefix means 'one-level up'
## [1] 148913      2
dim(cp_datatable[, !..cols]) # or cp_datatable[, -..cols]
## [1] 148913     17
  • using regular expression
# R using Tibbles
colnames(cp_tibble)[1:3] <- paste0('V',c(1:3))

select(cp_tibble, num_range("V", 1:2))
select(cp_tibble, title, everything()) # reorder columns
select(cp_tibble, contains("V"))
select(cp_tibble, ends_with("3"))
select(cp_tibble, matches(".2"))
select(cp_tibble, one_of(c("V1", "X")))
## Warning: Unknown columns: `X`
select(cp_tibble, -starts_with("doi"))
# R using data.tablee
colnames(cp_datatable)[1:3] <- paste0('V',c(1:3))

cols <- paste0("V", 1:2)
cols <- union("title", names(cp_datatable))
cols <- grep("V",   names(cp_datatable))
cols <- grep("3$",  names(cp_datatable))
cols <- grep(".2",  names(cp_datatable))
cols <- grep("^V1|X$",  names(cp_datatable))
cols <- grep("^(?!doi)", names(cp_datatable), perl = TRUE)
cp_datatable[, ..cols]
  • Filtering
cp_tibble = cp_tibble %>% mutate(year_pub = as.integer(year_pub))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `year_pub = as.integer(year_pub)`.
## Caused by warning:
## ! NAs introduced by coercion
# R using Tibbles


cp_tibble[3:4,]
slice(cp_tibble, 3:4) # same

cp_tibble[-(3:7),]
slice(cp_tibble, -(3:7)) # same

dim(filter(cp_tibble, year_pub > 2018))
filter(cp_tibble, title %in% c("Deep convolutional approaches for the analysis of Covid-19 using chest X-Ray images from portable devices",
                               "Evaluating the effect of city lock-down on controlling COVID-19 propagation through deep learning and network science models."))

filter(cp_tibble, year_pub == 2019, title == "Deep convolutional approaches for the analysis of Covid-19 using chest X-Ray images from portable devices")

dim(distinct(cp_tibble)) # distinct_all(cp_tibble)
dim(distinct(cp_tibble, across(c(V1, title)))) # returns selected cols

dim(tidyr::drop_na(cp_tibble, names(cp_tibble)[1:4]))


dim(filter(cp_tibble, grepl("deep learning", title)))
dim(filter(cp_tibble, dplyr::between(year_pub, 2015, 2017)))
dim(filter(cp_tibble, year_pub > 2015 & year_pub < 2017))
cp_datatable[,year_pub := as.integer(year_pub)]
## Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
# R using data.table

cp_datatable[3:4,]
cp_datatable[3:4] # same

cp_datatable[!3:7,]
cp_datatable[-(3:7)] # same

cp_datatable[year_pub > 2018]
cp_datatable[title %chin% c("Deep convolutional approaches for the analysis of Covid-19 using chest X-Ray images from portable devices",
                               "Evaluating the effect of city lock-down on controlling COVID-19 propagation through deep learning and network science models.")] # fast %in% for character

cp_datatable[year_pub == 2019 & title == "Deep convolutional approaches for the analysis of Covid-19 using chest X-Ray images from portable devices"]

unique(cp_datatable)
unique(cp_datatable, by = c("V1", "title")) # returns all cols

na.omit(cp_datatable, cols = 1:4)  # fast S3 method with cols argument

cp_datatable[title %like% "deep learning"]
cp_datatable[year_pub %between% c(2015, 2017)]
cp_datatable[data.table::between(year_pub, 2015, 2017, incbounds = FALSE)]

On the other hand, data.table also provides convenience functions to filter rows based on a regular expression or to find values lying in one (or several) interval(s).

Below, we will see that data.table has two optimized mechanisms to filter rows efficiently (keys and indices).

  • Sorting
# R using Tibbles
head(cp_tibble %>% arrange(creation) %>% select(creation))
## # A tibble: 6 × 1
##   creation  
##   <date>    
## 1 1929-01-01
## 2 1933-02-01
## 3 1945-01-01
## 4 1946-12-01
## 5 1951-09-08
## 6 1951-12-01
head(cp_tibble %>% arrange(desc(creation)) %>% select(creation)) 
## # A tibble: 6 × 1
##   creation  
##   <date>    
## 1 2020-09-16
## 2 2020-09-16
## 3 2020-09-16
## 4 2020-09-16
## 5 2020-09-16
## 6 2020-09-16

head(cp_tibble %>% arrange(year_pub, desc(creation)) %>% select(year_pub,creation))
## # A tibble: 6 × 2
##   year_pub creation  
##      <int> <date>    
## 1     1820 2018-12-05
## 2     1821 2018-06-23
## 3     1830 2018-06-20
## 4     1832 2018-06-20
## 5     1833 2018-06-20
## 6     1835 2018-11-09
# R using data.tablee

head(cp_datatable[order(creation)][,creation])
head(cp_datatable[order(-creation)][,creation])

head(cp_datatable[order(year_pub, -creation)][,.(year_pub,creation)])
  • Summarise
# R using Tibbles
n_distinct(cp_tibble)
## [1] 148913

cp_tibble %>% summarise(sum(cited)) # returns a tibble
## # A tibble: 1 × 1
##   `sum(cited)`
##          <dbl>
## 1      1241805
cp_tibble %>% summarise(sum_citation = sum(cited)) # returns a tibble
## # A tibble: 1 × 1
##   sum_citation
##          <dbl>
## 1      1241805

summarise(cp_tibble, sum(cited), sd(cited))
## # A tibble: 1 × 2
##   `sum(cited)` `sd(cited)`
##          <dbl>       <dbl>
## 1      1241805        38.8

cp_tibble %>%
  summarise(sum_citation = sum(cited),
            sd_citation = sd(cited))
## # A tibble: 1 × 2
##   sum_citation sd_citation
##          <dbl>       <dbl>
## 1      1241805        38.8

cp_tibble %>%
  slice(1:4) %>%
  summarise(sum(cited))
## # A tibble: 1 × 1
##   `sum(cited)`
##          <dbl>
## 1           61

cp_tibble %>% summarise(dplyr::first(year_pub))
## # A tibble: 1 × 1
##   `dplyr::first(year_pub)`
##                      <int>
## 1                     2020
cp_tibble %>% summarise(dplyr::last(year_pub))
## # A tibble: 1 × 1
##   `dplyr::last(year_pub)`
##                     <int>
## 1                    2016
cp_tibble %>% summarise(nth(year_pub, 5))
## # A tibble: 1 × 1
##   `nth(year_pub, 5)`
##                <int>
## 1               2020
cp_tibble %>% summarise(n_distinct(language))
## # A tibble: 1 × 1
##   `n_distinct(language)`
##                    <int>
## 1                     31
# R using data.tablee

uniqueN(cp_datatable)
## [1] 148913

cp_datatable[, sum(cited)]    # returns a vector
## [1] 1241805
cp_datatable[, .(sum(cited))] # returns a data.table
##         V1
##      <num>
## 1: 1241805
cp_datatable[, .(sum_citation = sum(cited))] # returns a data.table
##    sum_citation
##           <num>
## 1:      1241805

cp_datatable[, .(sum(cited), sd(cited))]
##         V1      V2
##      <num>   <num>
## 1: 1241805 38.8293

cp_datatable[, .(sum_citation = sum(cited),
                 sd_citation  = sd(cited))]
##    sum_citation sd_citation
##           <num>       <num>
## 1:      1241805     38.8293

cp_datatable[1:4, sum(cited)]
## [1] 61

cp_datatable[, data.table::first(year_pub)]
## [1] 2020
cp_datatable[, data.table::last(year_pub)]
## [1] 2016
cp_datatable[5, year_pub]
## [1] 2020
cp_datatable[, uniqueN(language)]
## [1] 31

Dplyr helper functions for the summarise() function include first(), last(), n(), nth(), and n_distinct(). The data.table package also includes first(), last(), and uniqueN().

  • Modifiying

When it comes to modifying columns, data.table allows for the use of the column assignment symbol := to modify columns by reference without creating a copy. With dplyr, the results must be assigned.

# R using Tibbles

cp_tibble <- cp_tibble %>% rowwise() %>% mutate(oa = ifelse(oa == 'Y',1,0), log_cited = log(cited + 1))



# Create one column and remove the others
transmute(cp_tibble, title_lower = tolower(title))
## # A tibble: 148,913 × 1
## # Rowwise: 
##    title_lower                                                                  
##    <chr>                                                                        
##  1 the pathophysiology, diagnosis and treatment of corona virus disease 2019 (c…
##  2 hypothesis for potential pathogenesis of sars-cov-2 infection-a review of im…
##  3 the outbreak of the novel severe acute respiratory syndrome coronavirus 2 (s…
##  4 a comparative overview of covid-19, mers and sars: review article.           
##  5 virion structure and mechanism of propagation of coronaviruses including sar…
##  6 interaction of sars-cov-2 and other coronavirus with ace (angiotensin-conver…
##  7 a global treatments for coronaviruses including covid-19.                    
##  8 covid-19 infection: origin, transmission, and characteristics of human coron…
##  9 the kidney, covid-19, and the chemokine network: an intriguing trio.         
## 10 laboratory diagnosis of emerging human coronavirus infections - the state of…
## # ℹ 148,903 more rows

cp_tibble <- cp_tibble %>% select(-V1, -V2)

cols <- c('V3','doi')
cp_tibble <- select(cp_tibble, -one_of(cols))
# R using data.tablee
cp_datatable[, c('oa','log_cited') := .(ifelse(oa == 'Y',1,0),log(cited+1))]



# Create one column and remove the others
cp_datatable[, .(title_lower = tolower(title))]
##                                                                                                                                                         title_lower
##                                                                                                                                                              <char>
##      1:                                                                       the pathophysiology, diagnosis and treatment of corona virus disease 2019 (covid-19).
##      2:                                  hypothesis for potential pathogenesis of sars-cov-2 infection-a review of immune changes in patients with viral pneumonia.
##      3:                              the outbreak of the novel severe acute respiratory syndrome coronavirus 2 (sars-cov-2): a review of the current global status.
##      4:                                                                                          a comparative overview of covid-19, mers and sars: review article.
##      5: virion structure and mechanism of propagation of coronaviruses including sars-cov 2 (covid -19 ) and some meaningful points for drug or vaccine development
##     ---                                                                                                                                                            
## 148909:                                                                                                               bmt settings, infection and infection control
## 148910:                                                                                       early and acute complications and the principles of hsct nursing care
## 148911:                                                                                                                                            viral infections
## 148912:                                                                                                        international collaboration for global public health
## 148913:                                                                                                                    resource allocation and priority setting

cp_datatable[, c('V1', 'V2') := NULL]

cols <- c('V3','doi')
cp_datatable[, (cols) := NULL] # ! not cp_datatable[, cols := NULL]
  • Grouping by

The dplyr::group_by() function and the corresponding by and keyby statements in data.table allow for the manipulation of each group of observations and combining the results. The main difference between by and keyby is that keyby orders the results and creates a key that allows for faster subsetting (cf. the indexing and keys section). Below, we use one or the other arbitrarily.

The group_by() function takes an existing tibble and converts it into a grouped tibble where operations will always be performed “by group”. Using ungroup() removes grouping. With data.table, by is always used on the fly.

Note that it is possible to reorder the arguments in data.table: cp_datatable[i, j, by] <=> cp_datatable[i, by, j]. This is done below to better highlight the similarity with dplyr.

It’s worth noting that group_by() and by in data.table perform the same task of grouping the data by certain variable(s) and keyby is used to enhance the performance of data.table by creating keys on certain columns, also note that you have to use ungroup() function to remove the grouping in dplyr while in data.table you don’t need to do that.

# R using Tibbles
cp_tibble <-  as_tibble(covid_papers) %>% filter(pmid!='null')

cp_tibble %>%
  group_by(year_pub) %>%
  summarise(nb_language = length(unique(language)))
## # A tibble: 113 × 2
##    year_pub nb_language
##    <chr>          <int>
##  1 1830               1
##  2 1832               1
##  3 1833               1
##  4 1836               1
##  5 1842               1
##  6 1845               1
##  7 1857               1
##  8 1864               1
##  9 1874               1
## 10 1876               1
## # ℹ 103 more rows

# By several groups
cp_tibble %>%
  group_by(year_pub, source) %>%
  summarise(mean_cited = mean(cited))
## `summarise()` has grouped output by 'year_pub'. You can override using the `.groups`
## argument.
## # A tibble: 113 × 3
## # Groups:   year_pub [113]
##    year_pub source mean_cited
##    <chr>    <chr>       <dbl>
##  1 1830     MED             0
##  2 1832     MED             0
##  3 1833     MED             0
##  4 1836     MED             0
##  5 1842     MED             0
##  6 1845     MED             0
##  7 1857     MED             0
##  8 1864     MED             0
##  9 1874     MED             0
## 10 1876     MED             0
## # ℹ 103 more rows

# Assigning column name in by
cp_tibble %>%
  group_by(PY = as.numeric(year_pub)) %>%
  summarise(sum_cited = sum(cited))
## # A tibble: 113 × 2
##       PY sum_cited
##    <dbl>     <dbl>
##  1  1830         0
##  2  1832         0
##  3  1833         0
##  4  1836         0
##  5  1842         0
##  6  1845         0
##  7  1857         0
##  8  1864         0
##  9  1874         0
## 10  1876         0
## # ℹ 103 more rows

# Using a condition
cp_tibble %>%
  group_by(source == "MED") %>%
  summarise(mean(cited))
## # A tibble: 1 × 2
##   `source == "MED"` `mean(cited)`
##   <lgl>                     <dbl>
## 1 TRUE                       10.9

# Count number of observations for each group
cp_tibble %>%
  group_by(source) %>%
  tally()
## # A tibble: 1 × 2
##   source      n
##   <chr>   <int>
## 1 MED    113917

cp_tibble %>%
  group_by(source) %>%
  summarise(n())
## # A tibble: 1 × 2
##   source  `n()`
##   <chr>   <int>
## 1 MED    113917

cp_tibble %>%
  group_by(source) %>%
  group_size() # returns a vector
## [1] 113917

# Add a column with number of observations for each group
add_count(cp_tibble, year_pub) %>% select(year_pub,n)
## # A tibble: 113,917 × 2
##    year_pub     n
##    <chr>    <int>
##  1 2020     63760
##  2 2020     63760
##  3 2020     63760
##  4 2020     63760
##  5 2020     63760
##  6 2020     63760
##  7 2020     63760
##  8 2020     63760
##  9 2020     63760
## 10 2020     63760
## # ℹ 113,907 more rows

dim(cp_tibble %>%
  group_by(year_pub) %>%
  add_tally())
## [1] 113917     20
# R using data.table

cp_datatable <-  as.data.table(covid_papers)[pmid!='null',]

cp_datatable[, .(nb_language = length(unique(language))), by = "year_pub"]
##      year_pub nb_language
##        <char>       <int>
##   1:     2020          18
##   2:     2019           8
##   3:     2021           2
##   4:     2018           7
##   5:     2017          10
##  ---                     
## 109:     1833           1
## 110:     1845           1
## 111:     1916           1
## 112:     1885           1
## 113:     1832           1

# By several groups
cp_datatable[, keyby = .(year_pub, source),
     .(mean_cited = mean(cited))]
## Key: <year_pub, source>
##      year_pub source mean_cited
##        <char> <char>      <num>
##   1:     1830    MED 0.00000000
##   2:     1832    MED 0.00000000
##   3:     1833    MED 0.00000000
##   4:     1836    MED 0.00000000
##   5:     1842    MED 0.00000000
##  ---                           
## 109:     2017    MED 7.67789799
## 110:     2018    MED 5.22422209
## 111:     2019    MED 2.76006529
## 112:     2020    MED 2.28764115
## 113:     2021    MED 0.02105263

# Assigning column name in by
cp_datatable[, keyby = .(PY = as.numeric(year_pub)),
     .(sum_cited = sum(cited))]
## Key: <PY>
##         PY sum_cited
##      <num>     <num>
##   1:  1830         0
##   2:  1832         0
##   3:  1833         0
##   4:  1836         0
##   5:  1842         0
##  ---                
## 109:  2017     24838
## 110:  2018     17125
## 111:  2019     10146
## 112:  2020    145860
## 113:  2021         2

# Using a condition in by
cp_datatable[, keyby = source == "MED",
     mean(cited)]
## Key: <source>
##    source       V1
##    <lgcl>    <num>
## 1:   TRUE 10.88341

# Count number of observations for each group
cp_datatable[, .N, by = source]
##    source      N
##    <char>  <int>
## 1:    MED 113917
count(cp_tibble, source)
## # A tibble: 1 × 2
##   source      n
##   <chr>   <int>
## 1 MED    113917

# Add a column with number of observations for each group
dim(cp_datatable[, n := .N, by = source][])
## [1] 113917     20

1.2 More Complex manipulation

  • Select specific row by group
# R using Tibbles

cp_tibble <- cp_tibble %>% mutate(PY = as.numeric(year_pub),log_cited =log(cited+1))

cp_tibble %>%
  group_by(PY) %>%
  slice(1)
cp_tibble %>%
  group_by(PY) %>%
  slice(1, n())
cp_tibble %>%
  group_by(PY) %>%
  group_map(~ tail(.x, 2))
# R using data.table

cp_datatable[,c('PY','log_cited') := .(as.numeric(year_pub),log(cited+1))]


cp_datatable[, .SD[1], by = PY]
cp_datatable[, .SD[c(1, .N)], by = PY]
cp_datatable[, tail(.SD, 2), by = PY]
# R using Tibbles

cp_tibble %>%
  group_by(PY) %>%
  arrange(cited) %>%
  slice(1)
  
# R using data.table

cp_datatable[, .SD[which.min(cited)], by = PY]
  • Group counter
# R using Tibbles

cp_tibble %>%
  group_by(PY, language) %>%
  mutate(Grp = cur_group_id())
# R using data.table

cp_datatable[, Grp := .GRP, by = .(PY,  language)][]
cp_datatable[, Grp := NULL] # delete for consistency
  • Get row number of observation by group
# R using Tibbles
cp_tibble %>%
  group_by(PY) %>%
  mutate(cur_group_rows())
## # A tibble: 113,917 × 22
## # Groups:   PY [113]
##    `_id`            pmid  pmcid doi   title source year_pub month_pub creation  
##    <chr>            <chr> <chr> <chr> <chr> <chr>  <chr>    <chr>     <date>    
##  1 5f69d6fa65a250f… 3283… PMC7… 10.1… The … MED    2020     null      2020-08-25
##  2 5f69d6fa65a250f… 3219… PMC7… 10.1… Hypo… MED    2020     null      2020-03-21
##  3 5f69d6fa65a250f… 3277… PMC7… 10.1… The … MED    2020     null      2020-08-12
##  4 5f69d6fa65a250f… 3273… PMC7… 10.1… A co… MED    2020     null      2020-07-31
##  5 5f69d6fa65a250f… 3239… PMC7… 10.1… A gl… MED    2020     null      2020-05-13
##  6 5f69d6fa65a250f… 3225… PMC7… 10.1… COVI… MED    2020     null      2020-04-08
##  7 5f69d6fa65a250f… 3272… PMC7… 10.1… The … MED    2020     null      2020-07-29
##  8 5f69d6fa65a250f… 3219… PMC7… 10.1… Labo… MED    2020     null      2020-03-21
##  9 5f69d6fa65a250f… 3265… PMC7… 10.1… The … MED    2020     null      2020-07-14
## 10 5f69d6fa65a250f… 3273… PMC7… 10.1… Cycl… MED    2020     null      2020-08-01
## # ℹ 113,907 more rows
## # ℹ 13 more variables: authors_gender <chr>, journal <chr>, language <chr>,
## #   pubmodel <chr>, pubtype <chr>, mesh <chr>, keywords <chr>, cited <dbl>,
## #   has_data <chr>, oa <chr>, PY <dbl>, log_cited <dbl>,
## #   `cur_group_rows()` <int>
#cp_tibble %>%
#  group_by(PY) %>%
#  group_data() %>%
#  tidyr::unnest(.rows)
cp_tibble %>%
  group_by(PY) %>%
  summarize(cur_group_rows()[1])
## # A tibble: 113 × 2
##       PY `cur_group_rows()[1]`
##    <dbl>                 <int>
##  1  1830                113846
##  2  1832                113890
##  3  1833                113879
##  4  1836                113827
##  5  1842                113769
##  6  1845                113882
##  7  1857                113803
##  8  1864                113838
##  9  1874                113544
## 10  1876                113787
## # ℹ 103 more rows
cp_tibble %>%
  group_by(PY) %>%
  summarize(cur_group_rows()[c(1, n())])
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns
##   an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `summarise()` has grouped output by 'PY'. You can override using the `.groups`
## argument.
## # A tibble: 226 × 2
## # Groups:   PY [113]
##       PY `cur_group_rows()[c(1, n())]`
##    <dbl>                         <int>
##  1  1830                        113846
##  2  1830                        113846
##  3  1832                        113890
##  4  1832                        113890
##  5  1833                        113879
##  6  1833                        113879
##  7  1836                        113827
##  8  1836                        113888
##  9  1842                        113769
## 10  1842                        113885
## # ℹ 216 more rows
# R using data.table

cp_datatable[, .I, by = PY] # returns a data.table
##            PY      I
##         <num>  <int>
##      1:  2020      1
##      2:  2020      2
##      3:  2020      3
##      4:  2020      4
##      5:  2020      5
##     ---             
## 113913:  1833 113879
## 113914:  1845 113882
## 113915:  1916 113883
## 113916:  1885 113889
## 113917:  1832 113890
cp_datatable[, .I[1], by = PY]
##         PY     V1
##      <num>  <int>
##   1:  2020      1
##   2:  2019   1814
##   3:  2021   2351
##   4:  2018  14295
##   5:  2017  19268
##  ---             
## 109:  1833 113879
## 110:  1845 113882
## 111:  1916 113883
## 112:  1885 113889
## 113:  1832 113890
cp_datatable[, .I[c(1, .N)], by = PY]
##         PY     V1
##      <num>  <int>
##   1:  2020      1
##   2:  2020 113896
##   3:  2019   1814
##   4:  2019 113915
##   5:  2021   2351
##  ---             
## 222:  1916 113883
## 223:  1885 113889
## 224:  1885 113889
## 225:  1832 113890
## 226:  1832 113890
  • Handle list-columns by group
# R using Tibbles

cp_tibble %>%
  group_by(PY) %>%
  summarise(list(cited))
## # A tibble: 113 × 2
##       PY `list(cited)`
##    <dbl> <list>       
##  1  1830 <dbl [1]>    
##  2  1832 <dbl [1]>    
##  3  1833 <dbl [1]>    
##  4  1836 <dbl [3]>    
##  5  1842 <dbl [2]>    
##  6  1845 <dbl [1]>    
##  7  1857 <dbl [1]>    
##  8  1864 <dbl [1]>    
##  9  1874 <dbl [2]>    
## 10  1876 <dbl [1]>    
## # ℹ 103 more rows

cp_tibble %>%
  group_by(PY) %>%
  group_nest()
## # A tibble: 113 × 2
##       PY                data
##    <dbl> <list<tibble[,20]>>
##  1  1830            [1 × 20]
##  2  1832            [1 × 20]
##  3  1833            [1 × 20]
##  4  1836            [3 × 20]
##  5  1842            [2 × 20]
##  6  1845            [1 × 20]
##  7  1857            [1 × 20]
##  8  1864            [1 × 20]
##  9  1874            [2 × 20]
## 10  1876            [1 × 20]
## # ℹ 103 more rows
# R using data.table

cp_datatable[, .(.(cited)),  by = PY]  # return V1 as a list
##         PY                          V1
##      <num>                      <list>
##   1:  2020        0,61, 0, 0, 1,79,...
##   2:  2019   0, 12,  5,  0,363,  0,...
##   3:  2021             0,0,0,0,0,1,...
##   4:  2018   3, 49, 11, 27, 60,123,...
##   5:  2017       27, 6, 2,22,10,11,...
##  ---                                  
## 109:  1833                           0
## 110:  1845                           0
## 111:  1916                           0
## 112:  1885                           0
## 113:  1832                           0
cp_datatable[, .(.(.SD)), by = PY] # subsets of the data
##         PY                     V1
##      <num>                 <list>
##   1:  2020 <data.table[63760x21]>
##   2:  2019  <data.table[3676x21]>
##   3:  2021    <data.table[95x21]>
##   4:  2018  <data.table[3278x21]>
##   5:  2017  <data.table[3235x21]>
##  ---                             
## 109:  1833     <data.table[1x21]>
## 110:  1845     <data.table[1x21]>
## 111:  1916     <data.table[1x21]>
## 112:  1885     <data.table[1x21]>
## 113:  1832     <data.table[1x21]>
  • Grouping sets (multiple by at once)
# R using data.table

rollup(cp_datatable,
       .(cited_med = median(cited)),
       by = c("pubmodel", "PY"))
##                      pubmodel    PY cited_med
##                        <char> <num>     <num>
##   1:       "Print-Electronic"  2020         0
##   2:                  "Print"  2020         0
##   3: "Electronic-eCollection"  2020         0
##   4:             "Electronic"  2020         0
##   5:       "Electronic-Print"  2020         0
##  ---                                         
## 224: "Electronic-eCollection"    NA         0
## 225:             "Electronic"    NA         0
## 226:       "Electronic-Print"    NA         3
## 227:                     null    NA         0
## 228:                     <NA>    NA         0

rollup(cp_datatable,
       .(cited_med = median(cited), .N),
       by = c("pubmodel", "PY"),
       id = TRUE)
##      grouping                 pubmodel    PY cited_med      N
##         <int>                   <char> <num>     <num>  <int>
##   1:        0       "Print-Electronic"  2020         0  40982
##   2:        0                  "Print"  2020         0  10862
##   3:        0 "Electronic-eCollection"  2020         0   4090
##   4:        0             "Electronic"  2020         0   7362
##   5:        0       "Electronic-Print"  2020         0    456
##  ---                                                         
## 224:        1 "Electronic-eCollection"    NA         0   6981
## 225:        1             "Electronic"    NA         0  11932
## 226:        1       "Electronic-Print"    NA         3    964
## 227:        1                     null    NA         0     32
## 228:        3                     <NA>    NA         0 113917

cube(cp_datatable,
     .(cited_med = median(cited), .N),
     by = c("pubmodel", "PY"),
     id = TRUE)
##      grouping                 pubmodel    PY cited_med      N
##         <int>                   <char> <num>     <num>  <int>
##   1:        0       "Print-Electronic"  2020         0  40982
##   2:        0                  "Print"  2020         0  10862
##   3:        0 "Electronic-eCollection"  2020         0   4090
##   4:        0             "Electronic"  2020         0   7362
##   5:        0       "Electronic-Print"  2020         0    456
##  ---                                                         
## 337:        2                     <NA>  1845         0      1
## 338:        2                     <NA>  1916         0      1
## 339:        2                     <NA>  1885         0      1
## 340:        2                     <NA>  1832         0      1
## 341:        3                     <NA>    NA         0 113917

groupingsets(cp_datatable,
             .(cited_med = median(cited), .N),
             by   = c("pubmodel", "PY"),
             sets = list("pubmodel", c("pubmodel", "PY")),
             id   = TRUE)
##      grouping                 pubmodel    PY cited_med     N
##         <int>                   <char> <num>     <num> <int>
##   1:        1       "Print-Electronic"    NA       0.0 61429
##   2:        1                  "Print"    NA       3.0 32579
##   3:        1 "Electronic-eCollection"    NA       0.0  6981
##   4:        1             "Electronic"    NA       0.0 11932
##   5:        1       "Electronic-Print"    NA       3.0   964
##  ---                                                        
## 223:        0                     null  2004       0.0     1
## 224:        0                     null  2016       0.0     4
## 225:        0                     null  2017       0.5     2
## 226:        0                     null  2014       0.0     2
## 227:        0                     null  2018       0.0     2
  • Using multiple Variables

To manipulate multiple columns, dplyr 1.0.0 introduced the across() function, which supersedes the _all, _at, and _if versions of summarise(), mutate(), and transmute(). In data.table, we use the .SD symbol which is a data.table containing the Subset of Data for each group, excluding the column(s) used in the ‘by’ argument. For example, cp_datatable[, .SD] is equivalent to cp_datatable itself. In the expression cp_datatable[, .SD, by = title], .SD contains all the cp_datatable columns (except the title column) for each unique value of the title column. The .SDcols argument allows to select specific columns to include in the .SD.

# R using Tibbles

# Summarise several columns
cp_tibble %>% summarise(across(c(PY, cited), mean))
## # A tibble: 1 × 2
##      PY cited
##   <dbl> <dbl>
## 1 2015.  10.9


# Summarise several columns by group
cp_tibble %>%
  group_by(source) %>%
  summarise(across(c(PY, cited), mean))
## # A tibble: 1 × 3
##   source    PY cited
##   <chr>  <dbl> <dbl>
## 1 MED    2015.  10.9

## using patterns (regex)
cp_tibble %>%
  group_by(source) %>%
  summarise(across(ends_with('cited'), mean))
## # A tibble: 1 × 3
##   source cited log_cited
##   <chr>  <dbl>     <dbl>
## 1 MED     10.9      1.10

# Summarise with more than one function by group
cp_tibble %>%
  group_by(source) %>%
  summarise(across(c(PY, cited),
                   list(median = median, mean = mean)))
## # A tibble: 1 × 5
##   source PY_median PY_mean cited_median cited_mean
##   <chr>      <dbl>   <dbl>        <dbl>      <dbl>
## 1 MED         2020   2015.            0       10.9

# Summarise using a condition
cp_tibble %>%
  summarise(across(where(is.numeric),
                   mean))
## # A tibble: 1 × 3
##   cited    PY log_cited
##   <dbl> <dbl>     <dbl>
## 1  10.9 2015.      1.10

cp_tibble  %>%
  group_by(source) %>%
  summarise(across(where(~ is.numeric(.x) && median(.x)>0), 
                        median))
## # A tibble: 1 × 2
##   source    PY
##   <chr>  <dbl>
## 1 MED     2020
# R using data.tablee


# Summarise several columns
cp_datatable[, lapply(.SD, mean),
     .SDcols = c('PY', 'cited')]
##         PY    cited
##      <num>    <num>
## 1: 2015.18 10.88341

# Summarise several columns by group
cp_datatable[, by = source,
     lapply(.SD, mean),
     .SDcols = c('PY', 'cited')]
##    source      PY    cited
##    <char>   <num>    <num>
## 1:    MED 2015.18 10.88341
## using patterns (regex)

cp_datatable[, by = source,
     lapply(.SD, mean),
     .SDcols = patterns("cited")]
##    source    cited log_cited
##    <char>    <num>     <num>
## 1:    MED 10.88341  1.103266


# Summarise with more than one function by group
cp_datatable[, by = source,
     c(lapply(.SD, median),
       lapply(.SD, mean)),
     .SDcols = c('PY', 'cited')]
##    source    PY cited      PY    cited
##    <char> <num> <num>   <num>    <num>
## 1:    MED  2020     0 2015.18 10.88341


# Summarise using a condition
cp_datatable[, sapply(cp_datatable, is.numeric),with = FALSE][,lapply(.SD, mean)]
##       cited      n      PY log_cited
##       <num>  <num>   <num>     <num>
## 1: 10.88341 113917 2015.18  1.103266

fun <- function(x) {is.numeric(x) && median(x)>0}

cp_datatable[, sapply(cp_datatable, FUN = fun),with = FALSE][,lapply(.SD, mean)]
##         n      PY
##     <num>   <num>
## 1: 113917 2015.18
  • Modify Multiples Variables

tobinary = function(x){ifelse(x=='Y',1,0)}

# Modify several columns (dropping the others)
head(cp_tibble %>% transmute(across(c(has_data, oa),
                 tobinary)))
## # A tibble: 6 × 2
##   has_data    oa
##      <dbl> <dbl>
## 1        0     1
## 2        0     1
## 3        0     1
## 4        0     1
## 5        0     1
## 6        0     1


# Modify several columns (keeping the others)
cp_tibble <- cp_tibble %>%
  mutate(across(all_of(c('pubtype','mesh')),
                 ~ tolower(str_replace_all(.x,'-',''))))

# Modify columns using a condition

t = Sys.time()
cp_tibble <- cp_tibble %>%
  mutate(across(where(is.character),
         toupper))
Sys.time() - t
## Time difference of 3.127825 secs
# Modify several columns (dropping the others)
head(cp_datatable[, lapply(.SD, tobinary),
     .SDcols = c('has_data', 'oa')])
##    has_data    oa
##       <num> <num>
## 1:        0     1
## 2:        0     1
## 3:        0     1
## 4:        0     1
## 5:        0     1
## 6:        0     1


# Modify several columns (keeping the others)
cp_datatable[, c('pubtype','mesh') := lapply(.SD, function(x){tolower(str_replace_all(x,'-',''))}),#cols <- setdiff(names(cp_datatable), "title")
     .SDcols = c('pubtype','mesh') ]

# Modify columns using a condition 
t = Sys.time()
for(j in  which(sapply(cp_datatable, class)=='character')){set(cp_datatable, i=NULL, j=j, value=toupper(cp_datatable[[j]]))}
Sys.time() - t 
## Time difference of 2.693229 secs

The use of cp_datatable[,j] in data.table is very flexible, as it allows for the passing of complex expressions in a straightforward way, or the combination of expressions with multiple outputs. This makes it easy to perform complex data manipulation and calculations on large datasets efficiently. The flexibility of the j-expression in data.table is a powerful feature that allows you to perform various data manipulation tasks with a simple and readable syntax.

  • Chain expressions

The dplyr workflow relies on the magrittr pipe operator (%>%). The magrittr package can also be used with data.table objects, but data.table has its own chaining system: cp_datatable[][][].

# R using Tibbles
# Expression chaining using %>%


head(cp_tibble %>%
  group_by(PY) %>%
  summarise(cited_med = median(cited)) %>%
  arrange(desc(cited_med)))
## # A tibble: 6 × 2
##      PY cited_med
##   <dbl>     <dbl>
## 1  1933      66.5
## 2  1948      34  
## 3  2002      23  
## 4  1956      21  
## 5  2000      21  
## 6  1999      20

head(cp_tibble %>%
  group_by(PY) %>%
  summarise(cited_med = median(cited)) %>%
  filter(cited_med > 10))
## # A tibble: 6 × 2
##      PY cited_med
##   <dbl>     <dbl>
## 1  1933      66.5
## 2  1942      14  
## 3  1948      34  
## 4  1956      21  
## 5  1964      13.5
## 6  1965      14
# R using data.tablee
# Expression chaining using cp_datatable[][] (recommended)

head(cp_datatable[, by = PY, 
     .(cited_med = median(cited))][order(-cited_med)])
##       PY cited_med
##    <num>     <num>
## 1:  1933      66.5
## 2:  1948      34.0
## 3:  2002      23.0
## 4:  2000      21.0
## 5:  1956      21.0
## 6:  1999      20.0

head(cp_datatable[, by = PY, 
      .(cited_med = median(cited))][cited_med > 10])
##       PY cited_med
##    <num>     <num>
## 1:  2012        13
## 2:  2009        13
## 3:  2010        14
## 4:  2011        12
## 5:  2008        15
## 6:  2007        14
  • More complex expression
# R using Tibbles

head(cp_tibble %>%
  mutate(is_en = ifelse(language =='ENG',1,0),
         oa_int = ifelse(oa =='Y',1,0)) %>%
  group_by(year_pub)  %>% 
  summarize(share_en = sum(is_en)/n(),
            share_en_oa = sum(is_en*oa_int)/n()) %>% 
  arrange(desc(year_pub)))
## # A tibble: 6 × 3
##   year_pub share_en share_en_oa
##   <chr>       <dbl>       <dbl>
## 1 2021        0.989       0.884
## 2 2020        0.956       0.685
## 3 2019        0.989       0.867
## 4 2018        0.985       0.829
## 5 2017        0.985       0.817
## 6 2016        0.978       0.791
# R using data.tablee

head(cp_datatable[,c('is_en','oa_int') := .(ifelse(language =='ENG',1,0),
                                       ifelse(oa =='Y',1,0))][
                                         ,by = year_pub,.(share_en = sum(is_en)/.N,
                                                          share_en_oa = sum(is_en*oa_int)/.N)])
##    year_pub  share_en share_en_oa
##      <char>     <num>       <num>
## 1:     2020 0.9555364   0.6848808
## 2:     2019 0.9893906   0.8669750
## 3:     2021 0.9894737   0.8842105
## 4:     2018 0.9853569   0.8288591
## 5:     2017 0.9851623   0.8173107
## 6:     2016 0.9775491   0.7907702
# Use multiple expressions (with cp_datatable[,{j}])
head(cp_datatable[, {is_en = ifelse(language =='ENG',1,0) 
                en_oa = is_en*oa_int
                .(pmid,en_oa = ifelse(en_oa == 1, 'Y','N')) # last list returned as a data.table
                }])
##        pmid  en_oa
##      <char> <char>
## 1: 32837036      Y
## 2: 32196410      Y
## 3: 32778421      Y
## 4: 32730205      Y
## 5: 32394467      Y
## 6: 32257431      Y

2 Reshape data

When you have a variable with multiple categories that are stored in one cell, such as the ‘pubtype’ variable in our dataset, you may want to create separate columns for each category. The opposite is also possible, where you have several columns with values that you want to transform into two columns: one with the factor and the other with the values.

To clean this variable and separate the string in each cell, you can use the tidyr package in R. The separate() function allows you to split a single column into multiple columns based on a delimiter. Additionally, the spread() function can be used to transform wide format data into long format data. This can be used to create separate rows for each pubtype.

# R using Tibbles
cp_tibble[1,'pubtype']
## # A tibble: 1 × 1
##   pubtype                                                              
##   <chr>                                                                
## 1 "{\"PUBTYPE\": [\"REVIEWARTICLE\", \"REVIEW\", \"JOURNAL ARTICLE\"]}"

clean_pubtype <- function(str){
  gsub('[{]|[}]|PUBTYPE|: |[[]|[]]','',str)
}

clean_pubtype(cp_tibble[1,'pubtype'])
## [1] "\"\"\"REVIEWARTICLE\", \"REVIEW\", \"JOURNAL ARTICLE\""

mcp_tibble <- cp_tibble %>% 
  rowwise() %>%
  mutate(pubtype = clean_pubtype(pubtype)) %>% 
  separate_rows(pubtype,sep = '", "') %>% 
  mutate(pubtype = gsub('["]','',pubtype))

mcp_tibble <- mcp_tibble %>%
  select(pmid, pubtype) %>%
  group_by(pmid) %>% 
  count(pubtype)

mcp_tibble
## # A tibble: 258,370 × 3
## # Groups:   pmid [113,917]
##    pmid     pubtype                                  n
##    <chr>    <chr>                                <int>
##  1 10022847 JOURNAL ARTICLE                          1
##  2 10022847 RESEARCH SUPPORT, NONU.S. GOV'T          1
##  3 10022847 RESEARCHARTICLE                          1
##  4 10023135 JOURNAL ARTICLE                          1
##  5 10023135 RESEARCH SUPPORT, NONU.S. GOV'T          1
##  6 10023135 RESEARCHARTICLE                          1
##  7 10023767 JOURNAL ARTICLE                          1
##  8 10023767 RESEARCH SUPPORT, NONU.S. GOV'T          1
##  9 10023767 RESEARCH SUPPORT, U.S. GOV'T, P.H.S.     1
## 10 10023767 RESEARCHARTICLE                          1
## # ℹ 258,360 more rows
# R using data.table
mcp_datatable <- cp_datatable[,pubtype :=  clean_pubtype(pubtype)][,lapply(.SD, function(x) unlist(tstrsplit(x, '", "'))), .SDcols = "pubtype", by = pmid][,pubtype :=  gsub('["]','',pubtype)]

mcp_datatable <- mcp_datatable[, .(count = .N), by = .(pmid,pubtype)]
mcp_datatable
##             pmid         pubtype count
##           <char>          <char> <int>
##      1: 32837036   REVIEWARTICLE     1
##      2: 32837036          REVIEW     1
##      3: 32837036 JOURNAL ARTICLE     1
##      4: 32196410   REVIEWARTICLE     1
##      5: 32196410 JOURNAL ARTICLE     1
##     ---                               
## 258366: 32091757          REVIEW     1
## 258367: 28590697  CHAPTERARTICLE     1
## 258368: 28590697          REVIEW     1
## 258369: 28590695  CHAPTERARTICLE     1
## 258370: 28590695          REVIEW     1

2.1 Cast data (from long to wide)


# R using Tibbles

# see also cast_dtm()
ccp_tibble = tidyr::spread(data  = mcp_tibble,
              key   = 'pubtype',
              value = 'n',
              fill  = 0)

dim(ccp_tibble)
## [1] 113917    116
# R using data.table

ccp_datatable <- dcast(mcp_datatable, pmid ~ pubtype, value.var = "count" ,fun.aggregate = sum) # aggregate by count
dim(ccp_datatable)
## [1] 113917    116

2.2 Melt data (from wide to long)

# R using Tibbles

mcp_tibble <- ccp_tibble %>% gather(key = Variable,
                                    value = Value,
                                    -pmid)
# same
mcp_tibble <- ccp_tibble %>% pivot_longer(cols = -pmid, names_to = "variable", values_to = "Value")
# R using data.table

mcp_datatable <- melt(ccp_datatable, id.vars = "pmid",
                      variable.name = "Variable",
                      value.name    = "Value")
    

2.3 Split

# R using Tibbles

group_split(cp_tibble, oa)
# R using data.table

split(cp_datatable, by = "oa") # S3 method

2.4 Split and transpose a vector/column

# R using Tibbles

vec <- c("A:a", "B:b", "C:c")
# vector not handled
tidyr::separate(tibble(vec), vec, c("V1", "V2"))
## # A tibble: 3 × 2
##   V1    V2   
##   <chr> <chr>
## 1 A     a    
## 2 B     b    
## 3 C     c
# R using data.table

vec <- c("A:a", "B:b", "C:c")
tstrsplit(vec, split = ":", keep = 2L) # works on vector
## [[1]]
## [1] "a" "b" "c"
setDT(tstrsplit(vec, split = ":"))[]
##        V1     V2
##    <char> <char>
## 1:      A      a
## 2:      B      b
## 3:      C      c

3 Indexing and Keys

3.1 Set key/index

Row subsetting in dplyr relies on the filter() and slice() functions. In addition to these methods, data.table provides two systems that make row filtering and join operations faster and more convenient: keys (primary ordered index) and indices (automatic secondary indexing).

The main differences between keys and indices are:

  • Keys involve physically reordering the data in memory, while indices store the order as an attribute.
  • Only one key can be used at a time, but multiple indices can coexist.
  • Keys are defined explicitly, while indices can be created manually or automatically (when using == or %in%).
  • Indices are used with the ‘on’ argument, while keys are optional but recommended for better readability.
# R using Tibbles

cp_tibble <- arrange(cp_tibble, year_pub) 
# R using data.table

setkey(cp_datatable, year_pub)
setindex(cp_datatable, year_pub)
  • Select rows

# R using Tibbles

dim(cp_tibble %>% filter(year_pub == "2010"))
## [1] 2190   21
dim(cp_tibble %>% filter(year_pub %in% c("2010", "2020")))
## [1] 65950    21
# R using data.table

dim(cp_datatable["2010", on = "year_pub"])
## [1] 2190   24
dim(cp_datatable[c("2010", "2020"), on = .(year_pub)]) # same as on = "year_pub"
## [1] 65950    24
  • Nomatch
# R using Tibbles

cp_tibble %>% filter(year_pub %in% c("2010", "2025"))
# R using data.table

# (default) returns a row with "D" even if not found
cp_datatable[c("2010", "2025"), on = "year_pub", nomatch = NA]
## no rows for unmatched values
cp_datatable[c("2010", "2025"), on = "year_pub", nomatch = 0]
  • Apply function on matched rows
# R using Tibbles

cp_tibble %>%
  filter(year_pub %in% c("2010", "2020")) %>%
  summarize(median = median(cited))
# R using data.table

cp_datatable[c("2010", "2020"), median(cited), on = "year_pub"]
  • Modify matched rows values
# R using Tibbles

cp_tibble <- cp_tibble %>%
  mutate(cited_ = base::replace(cited, year_pub == "2010", 0L)) %>%
  arrange(PY)
# R using data.table

cp_datatable["2010", cited_ := 0, on = "year_pub"]
  • Use keys in by
# R using Tibbles
head(cp_tibble %>%
  filter(oa != 'Y') %>%
  group_by(PY) %>%
  summarise(median(cited)))
## # A tibble: 6 × 2
##      PY `median(cited)`
##   <dbl>           <dbl>
## 1  1842               0
## 2  1874               0
## 3  1876               0
## 4  1881               0
## 5  1885               0
## 6  1888               0
# R using data.table

cp_datatable[!"Y", mean(cited), on = .(oa), by = .EACHI]
## Key: <oa>
##        oa       V1
##    <char>    <num>
## 1:      N 12.93581

head(cp_datatable[oa != "Y",
   by = PY,
   mean(cited)])
##       PY    V1
##    <num> <num>
## 1:  1842     0
## 2:  1874     0
## 3:  1876     0
## 4:  1881     0
## 5:  1885     0
## 6:  1888     0
  • Set multiple keys/indices
# R using Tibbles

cp_tibble <-cp_tibble %>%  arrange(PY, cited)
# R using data.table

setkey(cp_datatable, PY, cited) # or setkeyv(cp_datatable, c("PY", "cited"))
setindex(cp_datatable, PY, cited) # setindexv(cp_datatable, c("PY", "cited"))
  • Subset using multiple keys/indices
# R using Tibbles

dim(filter(cp_tibble, cited == 0, year_pub == "2010"))
## [1] 227  22
dim(filter(cp_tibble, cited == 0, year_pub %in% c("2010","2020")))
## [1] 50783    22
# R using data.table

dim(cp_datatable[.("2010", 0), on = .(year_pub, cited)])
## [1] 227  25
dim(cp_datatable[.(c("2010","2020"), 0), on = .(year_pub, cited)])
## [1] 50783    25
# using which = TRUE only returns the matching rows indices
dim(cp_datatable[.(c("2010","2020"), 0), on = .(year_pub, cited), which = TRUE])
## NULL
  • Remove keys/indices
# R using data.table

setkey(cp_datatable, NULL)
setindex(cp_datatable, NULL)

3.2 set*() modifications

In data.table, the set() functions such as set(), setcol() and setnames() modify objects by reference, which makes these operations fast and memory-efficient. However, if this is not the desired behavior, users can use the copy() function to create a new object. In contrast, the corresponding expressions in dplyr will be less memory-efficient as it creates a new object each time an operation is performed. It’s important to note that data.table’s way of modifying in place is very efficient in terms of time and memory usage when working with large datasets, but it can cause problems if you don’t want to modify your original data. Therefore, it’s recommended to use copy() or make a copy of the data.table before applying set() functions.

  • Replace values
# R using Tibbles

cp_tibble[1, 2] <- NA
head(cp_tibble[,2])
## # A tibble: 6 × 1
##   pmid    
##   <chr>   
## 1 <NA>    
## 2 29917937
## 3 29918028
## 4 29918489
## 5 29918401
## 6 29918490
# R using data.table

set(cp_datatable, i = 1L, j = 2L, value = NA)
head(cp_datatable[,2])
##        pmid
##      <char>
## 1:     <NA>
## 2: 29917937
## 3: 29918028
## 4: 29918489
## 5: 29918401
## 6: 29918490
  • Reorder rows
# R using Tibbles

cp_tibble <- cp_tibble %>% arrange(PY, desc(cited))
# R using data.table

setorder(cp_datatable, PY, -cited)
setorderv(cp_datatable, c("PY", "cited"), c(1, -1))
  • Modify colnames
# R using Tibbles

cp_tibble <- rename(cp_tibble, DOI = doi)
# R using data.table

setnames(cp_datatable, old = "doi", new = "DOI")
  • Reorder columns
# R using Tibbles

head(cp_tibble %>% select(PY, cited, DOI))
## # A tibble: 6 × 3
##      PY cited DOI  
##   <dbl> <dbl> <chr>
## 1  1830     0 NULL 
## 2  1832     0 NULL 
## 3  1833     0 NULL 
## 4  1836     0 NULL 
## 5  1836     0 NULL 
## 6  1836     0 NULL
# R using data.table

setcolorder(cp_datatable, c("PY", "cited", "DOI"))
  • Set operations
x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))
  • Intersection
# R using Tibbles

dplyr::intersect(x, y)
##       V1
##    <num>
## 1:     2
## 2:     3
# R using data.table

fintersect(x, y)
##       V1
##    <num>
## 1:     2
## 2:     3
fintersect(x, y, all = TRUE)
##       V1
##    <num>
## 1:     2
## 2:     2
## 3:     3
  • Difference
# R using Tibbles

dplyr::setdiff(x, y)
##       V1
##    <num>
## 1:     1
# R using data.table

fsetdiff(x, y)
##       V1
##    <num>
## 1:     1
fsetdiff(x, y, all = TRUE)
##       V1
##    <num>
## 1:     1
## 2:     3
  • Union
# R using Tibbles

dplyr::union(x, y)
##       V1
##    <num>
## 1:     1
## 2:     2
## 3:     3
## 4:     4
union_all(x, y)
##        V1
##     <num>
##  1:     1
##  2:     2
##  3:     2
##  4:     3
##  5:     3
##  6:     2
##  7:     2
##  8:     3
##  9:     4
## 10:     4
# R using data.table

funion(x, y)
##       V1
##    <num>
## 1:     1
## 2:     2
## 3:     3
## 4:     4
funion(x, y, all = TRUE)
##        V1
##     <num>
##  1:     1
##  2:     2
##  3:     2
##  4:     3
##  5:     3
##  6:     2
##  7:     2
##  8:     3
##  9:     4
## 10:     4
  • Equality
# R using Tibbles

setequal(x, x[order(-V1),])
## [1] TRUE
all_equal(x, x)
## Warning: `all_equal()` was deprecated in dplyr 1.1.0.
## ℹ Please use `all.equal()` instead.
## ℹ And manually order the rows/cols as needed
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## [1] TRUE
# R using data.table

fsetequal(x, x[order(-V1),])
## [1] TRUE
all.equal(x, x) # S3 method
## [1] TRUE

4 Join/Bind data sets

4.1 Join

Joining data in data.table can be done using keys, the ad-hoc ‘on’ argument, or the merge.data.table method. The three methods are presented below. As mentioned earlier, the ‘on’ and ‘by’ (in merge) arguments are optional with keyed data.tables, but are recommended to make the code more explicit. In the examples below, the x, y, and z data.tables can also be used with dplyr.

# R using Tibbles

x <- data.table(Id  = c("A", "B", "C", "C"),
                X1  = c(1L, 3L, 5L, 7L),
                XY  = c("x2", "x4", "x6", "x8"),
                key = "Id")

y <- data.table(Id  = c("A", "B", "B", "D"),
                Y1  = c(1L, 3L, 5L, 7L),
                XY  = c("y1", "y3", "y5", "y7"),
                key = "Id")
# R using data.table
  • Join matching rows from y to x
# R using Tibbles


left_join(x, y, by = "Id")
## Key: <Id>
##        Id    X1   XY.x    Y1   XY.y
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
## 4:      C     5     x6    NA   <NA>
## 5:      C     7     x8    NA   <NA>
# R using data.table

y[x, on = "Id"]
## Key: <Id>
##        Id    Y1     XY    X1   i.XY
##    <char> <int> <char> <int> <char>
## 1:      A     1     y1     1     x2
## 2:      B     3     y3     3     x4
## 3:      B     5     y5     3     x4
## 4:      C    NA   <NA>     5     x6
## 5:      C    NA   <NA>     7     x8
merge(x, y, all.x = TRUE, by = "Id")
## Key: <Id>
##        Id    X1   XY.x    Y1   XY.y
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
## 4:      C     5     x6    NA   <NA>
## 5:      C     7     x8    NA   <NA>
y[x] # requires keys
## Key: <Id>
##        Id    Y1     XY    X1   i.XY
##    <char> <int> <char> <int> <char>
## 1:      A     1     y1     1     x2
## 2:      B     3     y3     3     x4
## 3:      B     5     y5     3     x4
## 4:      C    NA   <NA>     5     x6
## 5:      C    NA   <NA>     7     x8
  • Join matching rows from x to y
# R using Tibbles

right_join(x, y, by = "Id")
## Key: <Id>
##        Id    X1   XY.x    Y1   XY.y
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
## 4:      D    NA   <NA>     7     y7
# R using data.table

x[y, on = "Id"]
## Key: <Id>
##        Id    X1     XY    Y1   i.XY
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
## 4:      D    NA   <NA>     7     y7
merge(x, y, all.y = TRUE, by = "Id")
## Key: <Id>
##        Id    X1   XY.x    Y1   XY.y
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
## 4:      D    NA   <NA>     7     y7
x[y] # requires keys
## Key: <Id>
##        Id    X1     XY    Y1   i.XY
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
## 4:      D    NA   <NA>     7     y7
  • Join matching rows from both x and y
# R using Tibbles

inner_join(x, y, by = "Id")
## Key: <Id>
##        Id    X1   XY.x    Y1   XY.y
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
# R using data.table

x[y, on = "Id", nomatch = 0]
## Key: <Id>
##        Id    X1     XY    Y1   i.XY
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
merge(x, y)
## Key: <Id>
##        Id    X1   XY.x    Y1   XY.y
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
x[y, nomatch = 0] # requires keys
## Key: <Id>
##        Id    X1     XY    Y1   i.XY
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
  • Join keeping all the rows
# R using Tibbles

full_join(x, y, by = "Id")
## Key: <Id>
##        Id    X1   XY.x    Y1   XY.y
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
## 4:      C     5     x6    NA   <NA>
## 5:      C     7     x8    NA   <NA>
## 6:      D    NA   <NA>     7     y7
# R using data.table

merge(x, y, all = TRUE, by = "Id")
## Key: <Id>
##        Id    X1   XY.x    Y1   XY.y
##    <char> <int> <char> <int> <char>
## 1:      A     1     x2     1     y1
## 2:      B     3     x4     3     y3
## 3:      B     3     x4     5     y5
## 4:      C     5     x6    NA   <NA>
## 5:      C     7     x8    NA   <NA>
## 6:      D    NA   <NA>     7     y7
  • Return rows from x matching y
# R using Tibbles

semi_join(x, y, by = "Id")
## Key: <Id>
##        Id    X1     XY
##    <char> <int> <char>
## 1:      A     1     x2
## 2:      B     3     x4
# R using data.table

unique(x[y$Id, on = "Id", nomatch = 0])
## Key: <Id>
##        Id    X1     XY
##    <char> <int> <char>
## 1:      A     1     x2
## 2:      B     3     x4
unique(x[y$Id, nomatch = 0]) # requires keys
## Key: <Id>
##        Id    X1     XY
##    <char> <int> <char>
## 1:      A     1     x2
## 2:      B     3     x4
  • Return rows from x not matching y
# R using Tibbles

anti_join(x, y, by = "Id")
## Key: <Id>
##        Id    X1     XY
##    <char> <int> <char>
## 1:      C     5     x6
## 2:      C     7     x8
# R using data.table

x[!y, on = "Id"]
## Key: <Id>
##        Id    X1     XY
##    <char> <int> <char>
## 1:      C     5     x6
## 2:      C     7     x8
x[!y] # requires keys
## Key: <Id>
##        Id    X1     XY
##    <char> <int> <char>
## 1:      C     5     x6
## 2:      C     7     x8

4.2 Bind

  • Bind rows
# R using Tibbles

x <- data.table(1:3)
y <- data.table(4:6)
z <- data.table(7:9, 0L)

bind_rows(x, y)
##       V1
##    <int>
## 1:     1
## 2:     2
## 3:     3
## 4:     4
## 5:     5
## 6:     6
bind_rows(x, z) # always fills
##       V1    V2
##    <int> <int>
## 1:     1    NA
## 2:     2    NA
## 3:     3    NA
## 4:     7     0
## 5:     8     0
## 6:     9     0
# R using data.table

rbind(x, y)
##       V1
##    <int>
## 1:     1
## 2:     2
## 3:     3
## 4:     4
## 5:     5
## 6:     6
rbind(x, z, fill = TRUE)
##       V1    V2
##    <int> <int>
## 1:     1    NA
## 2:     2    NA
## 3:     3    NA
## 4:     7     0
## 5:     8     0
## 6:     9     0
  • Bind rows using a list
# R using Tibbles

bind_rows(list(x, y), .id = "id")
##        id    V1
##    <char> <int>
## 1:      1     1
## 2:      1     2
## 3:      1     3
## 4:      2     4
## 5:      2     5
## 6:      2     6
# R using data.table

rbindlist(list(x, y), idcol = TRUE)
##      .id    V1
##    <int> <int>
## 1:     1     1
## 2:     1     2
## 3:     1     3
## 4:     2     4
## 5:     2     5
## 6:     2     6
  • Bind columns
# R using Tibbles

bind_cols(x, y)
## New names:
## • `V1` -> `V1...1`
## • `V1` -> `V1...2`
##    V1...1 V1...2
##     <int>  <int>
## 1:      1      4
## 2:      2      5
## 3:      3      6
# R using data.table

base::cbind(x, y)
##       V1    V1
##    <int> <int>
## 1:     1     4
## 2:     2     5
## 3:     3     6

5 Other

5.1 Lead/Lag

# R using Tibbles

lag(1:10, n = 1, default = NA)
##  [1] NA  1  2  3  4  5  6  7  8  9
purrr::map(1:2, ~lag(1:10, n = .x))
## [[1]]
##  [1] NA  1  2  3  4  5  6  7  8  9
## 
## [[2]]
##  [1] NA NA  1  2  3  4  5  6  7  8
lead(1:10, n = 1, default = NA)
##  [1]  2  3  4  5  6  7  8  9 10 NA
# R using data.table

shift(1:10, n = 1,   fill = NA, type = "lag")
##  [1] NA  1  2  3  4  5  6  7  8  9
shift(1:10, n = 1:2, fill = NA, type = "lag") # multiple
## [[1]]
##  [1] NA  1  2  3  4  5  6  7  8  9
## 
## [[2]]
##  [1] NA NA  1  2  3  4  5  6  7  8
shift(1:10, n = 1,   fill = NA, type = "lead")
##  [1]  2  3  4  5  6  7  8  9 10 NA

6 Read/Write data

6.1 standart format

# R
library(tidyverse)

covid_papers <- read_csv("data/covid_papers.csv")
## Rows: 148916 Columns: 19
## ── Column specification ──────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (17): _id, pmid, pmcid, doi, title, source, year_pub, month_pub, author...
## dbl   (1): cited
## date  (1): creation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

covid_papers
## # A tibble: 148,916 × 19
##    `_id`            pmid  pmcid doi   title source year_pub month_pub creation  
##    <chr>            <chr> <chr> <chr> <chr> <chr>  <chr>    <chr>     <date>    
##  1 5f69d6fa65a250f… 3283… PMC7… 10.1… The … MED    2020     null      2020-08-25
##  2 5f69d6fa65a250f… 3219… PMC7… 10.1… Hypo… MED    2020     null      2020-03-21
##  3 5f69d6fa65a250f… 3277… PMC7… 10.1… The … MED    2020     null      2020-08-12
##  4 5f69d6fa65a250f… 3273… PMC7… 10.1… A co… MED    2020     null      2020-07-31
##  5 5f69d6fa65a250f… null  null  10.2… Viri… PPR    2020     null      2020-08-16
##  6 5f69d6fa65a250f… null  PMC7… null  Inte… PMC    2020     null      2020-09-13
##  7 5f69d6fa65a250f… 3239… PMC7… 10.1… A gl… MED    2020     null      2020-05-13
##  8 5f69d6fa65a250f… 3225… PMC7… 10.1… COVI… MED    2020     null      2020-04-08
##  9 5f69d6fa65a250f… 3272… PMC7… 10.1… The … MED    2020     null      2020-07-29
## 10 5f69d6fa65a250f… 3219… PMC7… 10.1… Labo… MED    2020     null      2020-03-21
## # ℹ 148,906 more rows
## # ℹ 10 more variables: authors_gender <chr>, journal <chr>, language <chr>,
## #   pubmodel <chr>, pubtype <chr>, mesh <chr>, keywords <chr>, cited <dbl>,
## #   has_data <chr>, oa <chr>

Depending on the way string are encoded we may want to precise the type of encoding. (Some well know encoding character format: UTF-8; Unicode; ASCII ..)

we can sepcify this encoding in the read function.
# R
covid_papers <- read_csv("data/covid_papers.csv",locale = locale(encoding = "UTF-8"))

Other types of data can be import using other libraries I just show you some examples here without enterring into details that may be helpfull sometimes.

  • Excel data
# R
library(readxl)

covid_papers <- read_excel("data/covid_papers.xls", sheet = "Feuille1", range = "A1:Z25")
  • SAS, SPSS et Stata data
# R
library(haven)

#SAS data
covid_papers <- read_sas("data/covid_papers.sas")
covid_papers <- read_xpt("data/covid_papers.xpt")

#SPSS data
covid_papers <- read_sav("data/covid_papers.sav")
covid_papers <- read_por("data/covid_papers.por")

In order to save data in a specific format we just have to change the ‘read’ part of the function name by ‘write’. Here is an example for csv

# R

write_csv(new_data_set,"data/new_data.csv")

Also R provide a way to store objects in the same file. For this we use RData format, it’s very usefull if a project is written totaly in R because you can save the structure of your objects very easily, you can save all type of objects and load it later.

# R

save(covid_papers,file= 'covid_papers.RData')
load('covid_papers.RData')

6.2 Miscellaneous

  • Read / Write data

fread() and fwrite() are among the most powerful functions of data.table. They are not only incredibly fast (as seen in benchmarks), but they are also extremely robust. The few commands below only touch the surface, and there are many awesome features. For example, fread() accepts http and https URLs directly, as well as operating system commands such as sed and awk output. Make sure to check the documentation.

Here again, fread() and fwrite() are very versatile and can handle different file formats while dplyr delegates file reading and writing to the readr package with several specific functions (csv, tsv, delim, …). This means that dplyr requires the use of different functions for different file formats, whereas fread() and fwrite() can handle multiple file formats with a single function. This makes data.table’s file handling functions more efficient and easier to use.

  • Write data to a csv file
# R using Tibbles

readr::write_csv(cp_tibble, "cp_tibble.csv")
# R using data.table

fwrite(cp_datatable, "cp_datatable.csv")
  • Write data to a tab-delimited file
# R using Tibbles

readr::write_delim(cp_tibble, "cp_tibble.txt", delim = "\t")
# R using data.table

fwrite(cp_datatable, "cp_datatable.txt", sep = "\t")
  • Write list-column data to a csv file
# R using Tibbles
# R using data.table

fwrite(setDT(list(0, list(1:5))), "cp_datatable2.csv")
  • Read a csv / tab-delimited file
# R using Tibbles

readr::read_csv("cp_tibble.csv")

readr::read_delim("cp_tibble.txt", delim = "\t")
# R using data.table

fread("cp_datatable.csv")
# fread("cp_datatable.csv", verbose = TRUE) # full details
fread("cp_datatable.txt", sep = "\t")
  • Read a csv file selecting / droping columns
# R using Tibbles
# R using data.table

fread("cp_datatable.csv", select = c("pmid", "year_pub"))
fread("cp_datatable.csv", drop = "title")
  • Read and rbind several files
# R using Tibbles

c("cp_tibble.csv", "cp_tibble.csv") %>%
  purrr::map_dfr(readr::read_csv)
# R using data.table

rbindlist(lapply(c("cp_datatable.csv", "cp_datatable.csv"), fread))
# c("cp_datatable.csv", "cp_datatable.csv") %>% lapply(fread) %>% rbindlist

7 Exercise

Using first dplyr and the tidyverse, Use table_1.csv, table_2.csv and loc_cities.csv, we will need information from the three datasets.

  • Import datasets

  • Remove articles with no pmid and no DOI, and all articles before 1975

  • Merge the two datasets, pmid is unique for each paper

  • Create a new variable with the number of authors for each paper

  • plot distribution for the log(number of authors +1)

  • How many papers contains ‘deep learning’ or ‘machine learning’ and ‘neural network’ (also with a ‘s’ for neural networks) in their title ? Create a binary variable to save this information. What is the mean of authors for ML papers and non#ML papers ?

  • Transform has_data and oa into binary variable also, what is the share of ML paper that are oa

  • Clean up pubtype, for simplicity just get the first type

  • What is the pub type with the highest mean/sd of citation for each type of publication ? (use cited and the cleaned pub_type)

  • What are the most representative country by year ? In another tibble, keep only pmid and authors, and get the country for each author from the loc_cities.csv. You may want to separate rows for each authors to get all countries involved in the paper. If an author have multiple affiliations, take the first one.

  • Select the top 25 of countries involved in coronavirus research since 2001, plot the evolution on a bar chart with plot_ly

Solution with dplyr –>