“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003). Data preparation is not just a first step, but must berepeated many times over the course of analysis as new problems come to light or new data is collected.” (Wickham, 2014).
The Tidyverse is a collection of differents packages which works together in a common philosophy, the biggest contributor to this collection is named Hadley Wickham. Hadley Wickham did so much for the R community, he published a lot of books that are highly recommended to read. Check Wikipedia: Haldey Wickham for his bibliograpy. Here is the list of libraries that are contained in the Tidyverse:
ggplot2: Visualization
dplyr: Data manipulation
tidyr: Data formatting
purrr: programming
readr: Data importation
tibble: Data tables
forcats: Qualitative variables
stringr: Strings The goal of the tidyverse is to make data analysis easier by simplifying data preparation. Tidyverse includes a new data type, Tibbles.
Tibbles can be considered a modern version of data.frame, but they differ in some aspects. They do not have row names and allow users to have column names with special characters (e.g. if the column name of x is ‘Col 1’, then to access this column you can use x$`Col 1` to access this variable). Tibbles also provide a good way to visualize data since it directly shows data as if you were running the ‘head()’ function before, displaying the first 10 rows.
Data.table is also a common choice when dealing with data and it is very fast. However, using Tibbles can be time-consuming when dealing with large tables.
Dplyr is user-friendly as it uses very understandable words to manipulate data. It uses a specific way of writing calculations called the pipe operator: ‘%>%’. Pipes allow us to write sequential calculations in a readable way. Here is how we need to read R code that uses pipes: The code on the left side of the pipe operator is passed as the first argument to the function on the right side of the operator. This allows for a more readable and concise way of performing multiple operations on data, as it eliminates the need to create intermediate variables and makes the code more readable by showing the flow of data.
Usual Way
funtion2(function1(dataset,arg1),arg2)
Using Pipes
dataset %>% function1(arg1) %>% function2(arg2)
Note that we can write directly ‘%>%’ by using ‘ctrl + shift + m’
Here is list of function that are commonly used and actions they provide.
| Functions | Description |
|---|---|
| slice | restrict dataset to specific row (see also slice_head, slice_tail, slice_min, slice_max) |
| slice_sample | restrict dataset to random row |
| filter | restrict dataset to row that pass a condition |
| select | restrict dataset to specifiyed columns |
| rename | change columns names |
| arrange | sort dataset with respect to one columns |
| mutate | modify or create a columns that apply a function |
| group_by | group data before running calculation |
| ungroup | ungroup data |
| summarise | summarise data with respect to the level of agregation (groups) |
| count | count numbers of row in a group |
| lead/lag | shift the observations of a variable one notch backwards (for lead) or forwards (for lag) |
| distinct | given a varaiable restrict the dataset to unique value of this variable |
| bind_rows/bind_cols | add columns or rows to the dataset |
| left_join | merge datatables with respect to variable(s) (see also full_join, inner_join, anti_join) |
starts_with
# R
library(tidyverse)
## ── Attaching core tidyverse packages ────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ readr 2.1.6
## ✔ ggplot2 4.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.2.0 ✔ tidyr 1.3.1
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
covid_papers <- read_csv("data/covid_papers.csv")
## Rows: 148916 Columns: 19
## ── Column specification ──────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): _id, pmid, pmcid, doi, title, source, year_pub, month_pub, author...
## dbl (1): cited
## date (1): creation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
covid_papers
## # A tibble: 148,916 × 19
## `_id` pmid pmcid doi title source year_pub month_pub creation
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <date>
## 1 5f69d6fa65a250f… 3283… PMC7… 10.1… The … MED 2020 null 2020-08-25
## 2 5f69d6fa65a250f… 3219… PMC7… 10.1… Hypo… MED 2020 null 2020-03-21
## 3 5f69d6fa65a250f… 3277… PMC7… 10.1… The … MED 2020 null 2020-08-12
## 4 5f69d6fa65a250f… 3273… PMC7… 10.1… A co… MED 2020 null 2020-07-31
## 5 5f69d6fa65a250f… null null 10.2… Viri… PPR 2020 null 2020-08-16
## 6 5f69d6fa65a250f… null PMC7… null Inte… PMC 2020 null 2020-09-13
## 7 5f69d6fa65a250f… 3239… PMC7… 10.1… A gl… MED 2020 null 2020-05-13
## 8 5f69d6fa65a250f… 3225… PMC7… 10.1… COVI… MED 2020 null 2020-04-08
## 9 5f69d6fa65a250f… 3272… PMC7… 10.1… The … MED 2020 null 2020-07-29
## 10 5f69d6fa65a250f… 3219… PMC7… 10.1… Labo… MED 2020 null 2020-03-21
## # ℹ 148,906 more rows
## # ℹ 10 more variables: authors_gender <chr>, journal <chr>, language <chr>,
## # pubmodel <chr>, pubtype <chr>, mesh <chr>, keywords <chr>, cited <dbl>,
## # has_data <chr>, oa <chr>
#which(is.na(covid_papers),arr.ind = T)
covid_papers = na.omit(covid_papers)
dim(slice(covid_papers,2))
## [1] 1 19
dim(slice_sample(covid_papers,prop = 0.001))
## [1] 148 19
t <- Sys.time()
doi_clean <- filter(covid_papers,doi != 'null')
Sys.time()-t
## Time difference of 0.014009 secs
Using Pipes
# R
dim(covid_papers %>% slice(2))
## [1] 1 19
dim(covid_papers %>% slice_sample(prop = 0.001))
## [1] 148 19
t = Sys.time()
doi_clean <- covid_papers %>% filter(doi != 'null')
Sys.time() - t
## Time difference of 0.01347208 secs
t = Sys.time()
doi_clean <- covid_papers[which(covid_papers$doi != 'null'),]
Sys.time() - t
## Time difference of 0.014498 secs
# R using Tibbles
cp_tibble <- as_tibble(covid_papers)
dim(cp_tibble[[4]]) # returns a vector
## NULL
dim(cp_tibble[4])# returns a tibble
## [1] 148913 1
dim(select(cp_tibble, doi)) # returns a tibble
## [1] 148913 1
dim(pull(cp_tibble, doi, name = title)) # returns a (named) vector
## NULL
dim(cp_tibble[, "doi"]) # returns a tibble
## [1] 148913 1
dim(cp_tibble[["doi"]]) # returns a vector
## NULL
dim(select(cp_tibble, doi, creation, title))
## [1] 148913 3
dim(select(cp_tibble, doi:title)) # select columns between doi and title
## [1] 148913 2
dim(select(cp_tibble, -doi, -creation))
## [1] 148913 17
cols <- c("doi", "creation")
dim(select(cp_tibble, !!cols)) # unquoting
## [1] 148913 2
dim(select(cp_tibble, -!!cols))
## [1] 148913 17
# R using data.tablee
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
cp_datatable = as.data.table(covid_papers)
dim(cp_datatable[[4]]) # returns a vector
## NULL
dim(cp_datatable[, 4]) # returns a data.table
## [1] 148913 1
dim(cp_datatable[, list(doi)]) # returns a data.table
## [1] 148913 1
dim(cp_datatable[, .(doi)]) # returns a data.table
## [1] 148913 1
# . is an alias for list
dim(cp_datatable[, "doi"]) # returns a data.table
## [1] 148913 1
dim(cp_datatable[, doi]) # returns a vector
## NULL
dim(cp_datatable[["doi"]]) # returns a vector
## NULL
dim(cp_datatable[, .(doi, creation, title)])
## [1] 148913 3
dim(cp_datatable[, list(doi, creation, title)])
## [1] 148913 3
dim(cp_datatable[, doi:title]) # select columns between doi and title
## [1] 148913 2
dim(cp_datatable[, !c("doi", "creation")])
## [1] 148913 17
cols <- c("doi", "creation")
dim(cp_datatable[, ..cols]) # .. prefix means 'one-level up'
## [1] 148913 2
dim(cp_datatable[, !..cols]) # or cp_datatable[, -..cols]
## [1] 148913 17
# R using Tibbles
colnames(cp_tibble)[1:3] <- paste0('V',c(1:3))
select(cp_tibble, num_range("V", 1:2))
select(cp_tibble, title, everything()) # reorder columns
select(cp_tibble, contains("V"))
select(cp_tibble, ends_with("3"))
select(cp_tibble, matches(".2"))
select(cp_tibble, one_of(c("V1", "X")))
## Warning: Unknown columns: `X`
select(cp_tibble, -starts_with("doi"))
# R using data.tablee
colnames(cp_datatable)[1:3] <- paste0('V',c(1:3))
cols <- paste0("V", 1:2)
cols <- union("title", names(cp_datatable))
cols <- grep("V", names(cp_datatable))
cols <- grep("3$", names(cp_datatable))
cols <- grep(".2", names(cp_datatable))
cols <- grep("^V1|X$", names(cp_datatable))
cols <- grep("^(?!doi)", names(cp_datatable), perl = TRUE)
cp_datatable[, ..cols]
cp_tibble = cp_tibble %>% mutate(year_pub = as.integer(year_pub))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `year_pub = as.integer(year_pub)`.
## Caused by warning:
## ! NAs introduced by coercion
# R using Tibbles
cp_tibble[3:4,]
slice(cp_tibble, 3:4) # same
cp_tibble[-(3:7),]
slice(cp_tibble, -(3:7)) # same
dim(filter(cp_tibble, year_pub > 2018))
filter(cp_tibble, title %in% c("Deep convolutional approaches for the analysis of Covid-19 using chest X-Ray images from portable devices",
"Evaluating the effect of city lock-down on controlling COVID-19 propagation through deep learning and network science models."))
filter(cp_tibble, year_pub == 2019, title == "Deep convolutional approaches for the analysis of Covid-19 using chest X-Ray images from portable devices")
dim(distinct(cp_tibble)) # distinct_all(cp_tibble)
dim(distinct(cp_tibble, across(c(V1, title)))) # returns selected cols
dim(tidyr::drop_na(cp_tibble, names(cp_tibble)[1:4]))
dim(filter(cp_tibble, grepl("deep learning", title)))
dim(filter(cp_tibble, dplyr::between(year_pub, 2015, 2017)))
dim(filter(cp_tibble, year_pub > 2015 & year_pub < 2017))
cp_datatable[,year_pub := as.integer(year_pub)]
## Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
# R using data.table
cp_datatable[3:4,]
cp_datatable[3:4] # same
cp_datatable[!3:7,]
cp_datatable[-(3:7)] # same
cp_datatable[year_pub > 2018]
cp_datatable[title %chin% c("Deep convolutional approaches for the analysis of Covid-19 using chest X-Ray images from portable devices",
"Evaluating the effect of city lock-down on controlling COVID-19 propagation through deep learning and network science models.")] # fast %in% for character
cp_datatable[year_pub == 2019 & title == "Deep convolutional approaches for the analysis of Covid-19 using chest X-Ray images from portable devices"]
unique(cp_datatable)
unique(cp_datatable, by = c("V1", "title")) # returns all cols
na.omit(cp_datatable, cols = 1:4) # fast S3 method with cols argument
cp_datatable[title %like% "deep learning"]
cp_datatable[year_pub %between% c(2015, 2017)]
cp_datatable[data.table::between(year_pub, 2015, 2017, incbounds = FALSE)]
On the other hand, data.table also provides convenience functions to filter rows based on a regular expression or to find values lying in one (or several) interval(s).
Below, we will see that data.table has two optimized mechanisms to filter rows efficiently (keys and indices).
# R using Tibbles
head(cp_tibble %>% arrange(creation) %>% select(creation))
## # A tibble: 6 × 1
## creation
## <date>
## 1 1929-01-01
## 2 1933-02-01
## 3 1945-01-01
## 4 1946-12-01
## 5 1951-09-08
## 6 1951-12-01
head(cp_tibble %>% arrange(desc(creation)) %>% select(creation))
## # A tibble: 6 × 1
## creation
## <date>
## 1 2020-09-16
## 2 2020-09-16
## 3 2020-09-16
## 4 2020-09-16
## 5 2020-09-16
## 6 2020-09-16
head(cp_tibble %>% arrange(year_pub, desc(creation)) %>% select(year_pub,creation))
## # A tibble: 6 × 2
## year_pub creation
## <int> <date>
## 1 1820 2018-12-05
## 2 1821 2018-06-23
## 3 1830 2018-06-20
## 4 1832 2018-06-20
## 5 1833 2018-06-20
## 6 1835 2018-11-09
# R using data.tablee
head(cp_datatable[order(creation)][,creation])
head(cp_datatable[order(-creation)][,creation])
head(cp_datatable[order(year_pub, -creation)][,.(year_pub,creation)])
# R using Tibbles
n_distinct(cp_tibble)
## [1] 148913
cp_tibble %>% summarise(sum(cited)) # returns a tibble
## # A tibble: 1 × 1
## `sum(cited)`
## <dbl>
## 1 1241805
cp_tibble %>% summarise(sum_citation = sum(cited)) # returns a tibble
## # A tibble: 1 × 1
## sum_citation
## <dbl>
## 1 1241805
summarise(cp_tibble, sum(cited), sd(cited))
## # A tibble: 1 × 2
## `sum(cited)` `sd(cited)`
## <dbl> <dbl>
## 1 1241805 38.8
cp_tibble %>%
summarise(sum_citation = sum(cited),
sd_citation = sd(cited))
## # A tibble: 1 × 2
## sum_citation sd_citation
## <dbl> <dbl>
## 1 1241805 38.8
cp_tibble %>%
slice(1:4) %>%
summarise(sum(cited))
## # A tibble: 1 × 1
## `sum(cited)`
## <dbl>
## 1 61
cp_tibble %>% summarise(dplyr::first(year_pub))
## # A tibble: 1 × 1
## `dplyr::first(year_pub)`
## <int>
## 1 2020
cp_tibble %>% summarise(dplyr::last(year_pub))
## # A tibble: 1 × 1
## `dplyr::last(year_pub)`
## <int>
## 1 2016
cp_tibble %>% summarise(nth(year_pub, 5))
## # A tibble: 1 × 1
## `nth(year_pub, 5)`
## <int>
## 1 2020
cp_tibble %>% summarise(n_distinct(language))
## # A tibble: 1 × 1
## `n_distinct(language)`
## <int>
## 1 31
# R using data.tablee
uniqueN(cp_datatable)
## [1] 148913
cp_datatable[, sum(cited)] # returns a vector
## [1] 1241805
cp_datatable[, .(sum(cited))] # returns a data.table
## V1
## <num>
## 1: 1241805
cp_datatable[, .(sum_citation = sum(cited))] # returns a data.table
## sum_citation
## <num>
## 1: 1241805
cp_datatable[, .(sum(cited), sd(cited))]
## V1 V2
## <num> <num>
## 1: 1241805 38.8293
cp_datatable[, .(sum_citation = sum(cited),
sd_citation = sd(cited))]
## sum_citation sd_citation
## <num> <num>
## 1: 1241805 38.8293
cp_datatable[1:4, sum(cited)]
## [1] 61
cp_datatable[, data.table::first(year_pub)]
## [1] 2020
cp_datatable[, data.table::last(year_pub)]
## [1] 2016
cp_datatable[5, year_pub]
## [1] 2020
cp_datatable[, uniqueN(language)]
## [1] 31
Dplyr helper functions for the summarise() function include first(), last(), n(), nth(), and n_distinct(). The data.table package also includes first(), last(), and uniqueN().
When it comes to modifying columns, data.table allows for the use of the column assignment symbol := to modify columns by reference without creating a copy. With dplyr, the results must be assigned.
# R using Tibbles
cp_tibble <- cp_tibble %>% rowwise() %>% mutate(oa = ifelse(oa == 'Y',1,0), log_cited = log(cited + 1))
# Create one column and remove the others
transmute(cp_tibble, title_lower = tolower(title))
## # A tibble: 148,913 × 1
## # Rowwise:
## title_lower
## <chr>
## 1 the pathophysiology, diagnosis and treatment of corona virus disease 2019 (c…
## 2 hypothesis for potential pathogenesis of sars-cov-2 infection-a review of im…
## 3 the outbreak of the novel severe acute respiratory syndrome coronavirus 2 (s…
## 4 a comparative overview of covid-19, mers and sars: review article.
## 5 virion structure and mechanism of propagation of coronaviruses including sar…
## 6 interaction of sars-cov-2 and other coronavirus with ace (angiotensin-conver…
## 7 a global treatments for coronaviruses including covid-19.
## 8 covid-19 infection: origin, transmission, and characteristics of human coron…
## 9 the kidney, covid-19, and the chemokine network: an intriguing trio.
## 10 laboratory diagnosis of emerging human coronavirus infections - the state of…
## # ℹ 148,903 more rows
cp_tibble <- cp_tibble %>% select(-V1, -V2)
cols <- c('V3','doi')
cp_tibble <- select(cp_tibble, -one_of(cols))
# R using data.tablee
cp_datatable[, c('oa','log_cited') := .(ifelse(oa == 'Y',1,0),log(cited+1))]
# Create one column and remove the others
cp_datatable[, .(title_lower = tolower(title))]
## title_lower
## <char>
## 1: the pathophysiology, diagnosis and treatment of corona virus disease 2019 (covid-19).
## 2: hypothesis for potential pathogenesis of sars-cov-2 infection-a review of immune changes in patients with viral pneumonia.
## 3: the outbreak of the novel severe acute respiratory syndrome coronavirus 2 (sars-cov-2): a review of the current global status.
## 4: a comparative overview of covid-19, mers and sars: review article.
## 5: virion structure and mechanism of propagation of coronaviruses including sars-cov 2 (covid -19 ) and some meaningful points for drug or vaccine development
## ---
## 148909: bmt settings, infection and infection control
## 148910: early and acute complications and the principles of hsct nursing care
## 148911: viral infections
## 148912: international collaboration for global public health
## 148913: resource allocation and priority setting
cp_datatable[, c('V1', 'V2') := NULL]
cols <- c('V3','doi')
cp_datatable[, (cols) := NULL] # ! not cp_datatable[, cols := NULL]
The dplyr::group_by() function and the corresponding by and keyby statements in data.table allow for the manipulation of each group of observations and combining the results. The main difference between by and keyby is that keyby orders the results and creates a key that allows for faster subsetting (cf. the indexing and keys section). Below, we use one or the other arbitrarily.
The group_by() function takes an existing tibble and converts it into a grouped tibble where operations will always be performed “by group”. Using ungroup() removes grouping. With data.table, by is always used on the fly.
Note that it is possible to reorder the arguments in data.table: cp_datatable[i, j, by] <=> cp_datatable[i, by, j]. This is done below to better highlight the similarity with dplyr.
It’s worth noting that group_by() and by in data.table perform the same task of grouping the data by certain variable(s) and keyby is used to enhance the performance of data.table by creating keys on certain columns, also note that you have to use ungroup() function to remove the grouping in dplyr while in data.table you don’t need to do that.
# R using Tibbles
cp_tibble <- as_tibble(covid_papers) %>% filter(pmid!='null')
cp_tibble %>%
group_by(year_pub) %>%
summarise(nb_language = length(unique(language)))
## # A tibble: 113 × 2
## year_pub nb_language
## <chr> <int>
## 1 1830 1
## 2 1832 1
## 3 1833 1
## 4 1836 1
## 5 1842 1
## 6 1845 1
## 7 1857 1
## 8 1864 1
## 9 1874 1
## 10 1876 1
## # ℹ 103 more rows
# By several groups
cp_tibble %>%
group_by(year_pub, source) %>%
summarise(mean_cited = mean(cited))
## `summarise()` has grouped output by 'year_pub'. You can override using the `.groups`
## argument.
## # A tibble: 113 × 3
## # Groups: year_pub [113]
## year_pub source mean_cited
## <chr> <chr> <dbl>
## 1 1830 MED 0
## 2 1832 MED 0
## 3 1833 MED 0
## 4 1836 MED 0
## 5 1842 MED 0
## 6 1845 MED 0
## 7 1857 MED 0
## 8 1864 MED 0
## 9 1874 MED 0
## 10 1876 MED 0
## # ℹ 103 more rows
# Assigning column name in by
cp_tibble %>%
group_by(PY = as.numeric(year_pub)) %>%
summarise(sum_cited = sum(cited))
## # A tibble: 113 × 2
## PY sum_cited
## <dbl> <dbl>
## 1 1830 0
## 2 1832 0
## 3 1833 0
## 4 1836 0
## 5 1842 0
## 6 1845 0
## 7 1857 0
## 8 1864 0
## 9 1874 0
## 10 1876 0
## # ℹ 103 more rows
# Using a condition
cp_tibble %>%
group_by(source == "MED") %>%
summarise(mean(cited))
## # A tibble: 1 × 2
## `source == "MED"` `mean(cited)`
## <lgl> <dbl>
## 1 TRUE 10.9
# Count number of observations for each group
cp_tibble %>%
group_by(source) %>%
tally()
## # A tibble: 1 × 2
## source n
## <chr> <int>
## 1 MED 113917
cp_tibble %>%
group_by(source) %>%
summarise(n())
## # A tibble: 1 × 2
## source `n()`
## <chr> <int>
## 1 MED 113917
cp_tibble %>%
group_by(source) %>%
group_size() # returns a vector
## [1] 113917
# Add a column with number of observations for each group
add_count(cp_tibble, year_pub) %>% select(year_pub,n)
## # A tibble: 113,917 × 2
## year_pub n
## <chr> <int>
## 1 2020 63760
## 2 2020 63760
## 3 2020 63760
## 4 2020 63760
## 5 2020 63760
## 6 2020 63760
## 7 2020 63760
## 8 2020 63760
## 9 2020 63760
## 10 2020 63760
## # ℹ 113,907 more rows
dim(cp_tibble %>%
group_by(year_pub) %>%
add_tally())
## [1] 113917 20
# R using data.table
cp_datatable <- as.data.table(covid_papers)[pmid!='null',]
cp_datatable[, .(nb_language = length(unique(language))), by = "year_pub"]
## year_pub nb_language
## <char> <int>
## 1: 2020 18
## 2: 2019 8
## 3: 2021 2
## 4: 2018 7
## 5: 2017 10
## ---
## 109: 1833 1
## 110: 1845 1
## 111: 1916 1
## 112: 1885 1
## 113: 1832 1
# By several groups
cp_datatable[, keyby = .(year_pub, source),
.(mean_cited = mean(cited))]
## Key: <year_pub, source>
## year_pub source mean_cited
## <char> <char> <num>
## 1: 1830 MED 0.00000000
## 2: 1832 MED 0.00000000
## 3: 1833 MED 0.00000000
## 4: 1836 MED 0.00000000
## 5: 1842 MED 0.00000000
## ---
## 109: 2017 MED 7.67789799
## 110: 2018 MED 5.22422209
## 111: 2019 MED 2.76006529
## 112: 2020 MED 2.28764115
## 113: 2021 MED 0.02105263
# Assigning column name in by
cp_datatable[, keyby = .(PY = as.numeric(year_pub)),
.(sum_cited = sum(cited))]
## Key: <PY>
## PY sum_cited
## <num> <num>
## 1: 1830 0
## 2: 1832 0
## 3: 1833 0
## 4: 1836 0
## 5: 1842 0
## ---
## 109: 2017 24838
## 110: 2018 17125
## 111: 2019 10146
## 112: 2020 145860
## 113: 2021 2
# Using a condition in by
cp_datatable[, keyby = source == "MED",
mean(cited)]
## Key: <source>
## source V1
## <lgcl> <num>
## 1: TRUE 10.88341
# Count number of observations for each group
cp_datatable[, .N, by = source]
## source N
## <char> <int>
## 1: MED 113917
count(cp_tibble, source)
## # A tibble: 1 × 2
## source n
## <chr> <int>
## 1 MED 113917
# Add a column with number of observations for each group
dim(cp_datatable[, n := .N, by = source][])
## [1] 113917 20
# R using Tibbles
cp_tibble <- cp_tibble %>% mutate(PY = as.numeric(year_pub),log_cited =log(cited+1))
cp_tibble %>%
group_by(PY) %>%
slice(1)
cp_tibble %>%
group_by(PY) %>%
slice(1, n())
cp_tibble %>%
group_by(PY) %>%
group_map(~ tail(.x, 2))
# R using data.table
cp_datatable[,c('PY','log_cited') := .(as.numeric(year_pub),log(cited+1))]
cp_datatable[, .SD[1], by = PY]
cp_datatable[, .SD[c(1, .N)], by = PY]
cp_datatable[, tail(.SD, 2), by = PY]
# R using Tibbles
cp_tibble %>%
group_by(PY) %>%
arrange(cited) %>%
slice(1)
# R using data.table
cp_datatable[, .SD[which.min(cited)], by = PY]
# R using Tibbles
cp_tibble %>%
group_by(PY, language) %>%
mutate(Grp = cur_group_id())
# R using data.table
cp_datatable[, Grp := .GRP, by = .(PY, language)][]
cp_datatable[, Grp := NULL] # delete for consistency
# R using Tibbles
cp_tibble %>%
group_by(PY) %>%
mutate(cur_group_rows())
## # A tibble: 113,917 × 22
## # Groups: PY [113]
## `_id` pmid pmcid doi title source year_pub month_pub creation
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <date>
## 1 5f69d6fa65a250f… 3283… PMC7… 10.1… The … MED 2020 null 2020-08-25
## 2 5f69d6fa65a250f… 3219… PMC7… 10.1… Hypo… MED 2020 null 2020-03-21
## 3 5f69d6fa65a250f… 3277… PMC7… 10.1… The … MED 2020 null 2020-08-12
## 4 5f69d6fa65a250f… 3273… PMC7… 10.1… A co… MED 2020 null 2020-07-31
## 5 5f69d6fa65a250f… 3239… PMC7… 10.1… A gl… MED 2020 null 2020-05-13
## 6 5f69d6fa65a250f… 3225… PMC7… 10.1… COVI… MED 2020 null 2020-04-08
## 7 5f69d6fa65a250f… 3272… PMC7… 10.1… The … MED 2020 null 2020-07-29
## 8 5f69d6fa65a250f… 3219… PMC7… 10.1… Labo… MED 2020 null 2020-03-21
## 9 5f69d6fa65a250f… 3265… PMC7… 10.1… The … MED 2020 null 2020-07-14
## 10 5f69d6fa65a250f… 3273… PMC7… 10.1… Cycl… MED 2020 null 2020-08-01
## # ℹ 113,907 more rows
## # ℹ 13 more variables: authors_gender <chr>, journal <chr>, language <chr>,
## # pubmodel <chr>, pubtype <chr>, mesh <chr>, keywords <chr>, cited <dbl>,
## # has_data <chr>, oa <chr>, PY <dbl>, log_cited <dbl>,
## # `cur_group_rows()` <int>
#cp_tibble %>%
# group_by(PY) %>%
# group_data() %>%
# tidyr::unnest(.rows)
cp_tibble %>%
group_by(PY) %>%
summarize(cur_group_rows()[1])
## # A tibble: 113 × 2
## PY `cur_group_rows()[1]`
## <dbl> <int>
## 1 1830 113846
## 2 1832 113890
## 3 1833 113879
## 4 1836 113827
## 5 1842 113769
## 6 1845 113882
## 7 1857 113803
## 8 1864 113838
## 9 1874 113544
## 10 1876 113787
## # ℹ 103 more rows
cp_tibble %>%
group_by(PY) %>%
summarize(cur_group_rows()[c(1, n())])
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns
## an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `summarise()` has grouped output by 'PY'. You can override using the `.groups`
## argument.
## # A tibble: 226 × 2
## # Groups: PY [113]
## PY `cur_group_rows()[c(1, n())]`
## <dbl> <int>
## 1 1830 113846
## 2 1830 113846
## 3 1832 113890
## 4 1832 113890
## 5 1833 113879
## 6 1833 113879
## 7 1836 113827
## 8 1836 113888
## 9 1842 113769
## 10 1842 113885
## # ℹ 216 more rows
# R using data.table
cp_datatable[, .I, by = PY] # returns a data.table
## PY I
## <num> <int>
## 1: 2020 1
## 2: 2020 2
## 3: 2020 3
## 4: 2020 4
## 5: 2020 5
## ---
## 113913: 1833 113879
## 113914: 1845 113882
## 113915: 1916 113883
## 113916: 1885 113889
## 113917: 1832 113890
cp_datatable[, .I[1], by = PY]
## PY V1
## <num> <int>
## 1: 2020 1
## 2: 2019 1814
## 3: 2021 2351
## 4: 2018 14295
## 5: 2017 19268
## ---
## 109: 1833 113879
## 110: 1845 113882
## 111: 1916 113883
## 112: 1885 113889
## 113: 1832 113890
cp_datatable[, .I[c(1, .N)], by = PY]
## PY V1
## <num> <int>
## 1: 2020 1
## 2: 2020 113896
## 3: 2019 1814
## 4: 2019 113915
## 5: 2021 2351
## ---
## 222: 1916 113883
## 223: 1885 113889
## 224: 1885 113889
## 225: 1832 113890
## 226: 1832 113890
# R using Tibbles
cp_tibble %>%
group_by(PY) %>%
summarise(list(cited))
## # A tibble: 113 × 2
## PY `list(cited)`
## <dbl> <list>
## 1 1830 <dbl [1]>
## 2 1832 <dbl [1]>
## 3 1833 <dbl [1]>
## 4 1836 <dbl [3]>
## 5 1842 <dbl [2]>
## 6 1845 <dbl [1]>
## 7 1857 <dbl [1]>
## 8 1864 <dbl [1]>
## 9 1874 <dbl [2]>
## 10 1876 <dbl [1]>
## # ℹ 103 more rows
cp_tibble %>%
group_by(PY) %>%
group_nest()
## # A tibble: 113 × 2
## PY data
## <dbl> <list<tibble[,20]>>
## 1 1830 [1 × 20]
## 2 1832 [1 × 20]
## 3 1833 [1 × 20]
## 4 1836 [3 × 20]
## 5 1842 [2 × 20]
## 6 1845 [1 × 20]
## 7 1857 [1 × 20]
## 8 1864 [1 × 20]
## 9 1874 [2 × 20]
## 10 1876 [1 × 20]
## # ℹ 103 more rows
# R using data.table
cp_datatable[, .(.(cited)), by = PY] # return V1 as a list
## PY V1
## <num> <list>
## 1: 2020 0,61, 0, 0, 1,79,...
## 2: 2019 0, 12, 5, 0,363, 0,...
## 3: 2021 0,0,0,0,0,1,...
## 4: 2018 3, 49, 11, 27, 60,123,...
## 5: 2017 27, 6, 2,22,10,11,...
## ---
## 109: 1833 0
## 110: 1845 0
## 111: 1916 0
## 112: 1885 0
## 113: 1832 0
cp_datatable[, .(.(.SD)), by = PY] # subsets of the data
## PY V1
## <num> <list>
## 1: 2020 <data.table[63760x21]>
## 2: 2019 <data.table[3676x21]>
## 3: 2021 <data.table[95x21]>
## 4: 2018 <data.table[3278x21]>
## 5: 2017 <data.table[3235x21]>
## ---
## 109: 1833 <data.table[1x21]>
## 110: 1845 <data.table[1x21]>
## 111: 1916 <data.table[1x21]>
## 112: 1885 <data.table[1x21]>
## 113: 1832 <data.table[1x21]>
# R using data.table
rollup(cp_datatable,
.(cited_med = median(cited)),
by = c("pubmodel", "PY"))
## pubmodel PY cited_med
## <char> <num> <num>
## 1: "Print-Electronic" 2020 0
## 2: "Print" 2020 0
## 3: "Electronic-eCollection" 2020 0
## 4: "Electronic" 2020 0
## 5: "Electronic-Print" 2020 0
## ---
## 224: "Electronic-eCollection" NA 0
## 225: "Electronic" NA 0
## 226: "Electronic-Print" NA 3
## 227: null NA 0
## 228: <NA> NA 0
rollup(cp_datatable,
.(cited_med = median(cited), .N),
by = c("pubmodel", "PY"),
id = TRUE)
## grouping pubmodel PY cited_med N
## <int> <char> <num> <num> <int>
## 1: 0 "Print-Electronic" 2020 0 40982
## 2: 0 "Print" 2020 0 10862
## 3: 0 "Electronic-eCollection" 2020 0 4090
## 4: 0 "Electronic" 2020 0 7362
## 5: 0 "Electronic-Print" 2020 0 456
## ---
## 224: 1 "Electronic-eCollection" NA 0 6981
## 225: 1 "Electronic" NA 0 11932
## 226: 1 "Electronic-Print" NA 3 964
## 227: 1 null NA 0 32
## 228: 3 <NA> NA 0 113917
cube(cp_datatable,
.(cited_med = median(cited), .N),
by = c("pubmodel", "PY"),
id = TRUE)
## grouping pubmodel PY cited_med N
## <int> <char> <num> <num> <int>
## 1: 0 "Print-Electronic" 2020 0 40982
## 2: 0 "Print" 2020 0 10862
## 3: 0 "Electronic-eCollection" 2020 0 4090
## 4: 0 "Electronic" 2020 0 7362
## 5: 0 "Electronic-Print" 2020 0 456
## ---
## 337: 2 <NA> 1845 0 1
## 338: 2 <NA> 1916 0 1
## 339: 2 <NA> 1885 0 1
## 340: 2 <NA> 1832 0 1
## 341: 3 <NA> NA 0 113917
groupingsets(cp_datatable,
.(cited_med = median(cited), .N),
by = c("pubmodel", "PY"),
sets = list("pubmodel", c("pubmodel", "PY")),
id = TRUE)
## grouping pubmodel PY cited_med N
## <int> <char> <num> <num> <int>
## 1: 1 "Print-Electronic" NA 0.0 61429
## 2: 1 "Print" NA 3.0 32579
## 3: 1 "Electronic-eCollection" NA 0.0 6981
## 4: 1 "Electronic" NA 0.0 11932
## 5: 1 "Electronic-Print" NA 3.0 964
## ---
## 223: 0 null 2004 0.0 1
## 224: 0 null 2016 0.0 4
## 225: 0 null 2017 0.5 2
## 226: 0 null 2014 0.0 2
## 227: 0 null 2018 0.0 2
To manipulate multiple columns, dplyr 1.0.0 introduced the across() function, which supersedes the _all, _at, and _if versions of summarise(), mutate(), and transmute(). In data.table, we use the .SD symbol which is a data.table containing the Subset of Data for each group, excluding the column(s) used in the ‘by’ argument. For example, cp_datatable[, .SD] is equivalent to cp_datatable itself. In the expression cp_datatable[, .SD, by = title], .SD contains all the cp_datatable columns (except the title column) for each unique value of the title column. The .SDcols argument allows to select specific columns to include in the .SD.
# R using Tibbles
# Summarise several columns
cp_tibble %>% summarise(across(c(PY, cited), mean))
## # A tibble: 1 × 2
## PY cited
## <dbl> <dbl>
## 1 2015. 10.9
# Summarise several columns by group
cp_tibble %>%
group_by(source) %>%
summarise(across(c(PY, cited), mean))
## # A tibble: 1 × 3
## source PY cited
## <chr> <dbl> <dbl>
## 1 MED 2015. 10.9
## using patterns (regex)
cp_tibble %>%
group_by(source) %>%
summarise(across(ends_with('cited'), mean))
## # A tibble: 1 × 3
## source cited log_cited
## <chr> <dbl> <dbl>
## 1 MED 10.9 1.10
# Summarise with more than one function by group
cp_tibble %>%
group_by(source) %>%
summarise(across(c(PY, cited),
list(median = median, mean = mean)))
## # A tibble: 1 × 5
## source PY_median PY_mean cited_median cited_mean
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MED 2020 2015. 0 10.9
# Summarise using a condition
cp_tibble %>%
summarise(across(where(is.numeric),
mean))
## # A tibble: 1 × 3
## cited PY log_cited
## <dbl> <dbl> <dbl>
## 1 10.9 2015. 1.10
cp_tibble %>%
group_by(source) %>%
summarise(across(where(~ is.numeric(.x) && median(.x)>0),
median))
## # A tibble: 1 × 2
## source PY
## <chr> <dbl>
## 1 MED 2020
# R using data.tablee
# Summarise several columns
cp_datatable[, lapply(.SD, mean),
.SDcols = c('PY', 'cited')]
## PY cited
## <num> <num>
## 1: 2015.18 10.88341
# Summarise several columns by group
cp_datatable[, by = source,
lapply(.SD, mean),
.SDcols = c('PY', 'cited')]
## source PY cited
## <char> <num> <num>
## 1: MED 2015.18 10.88341
## using patterns (regex)
cp_datatable[, by = source,
lapply(.SD, mean),
.SDcols = patterns("cited")]
## source cited log_cited
## <char> <num> <num>
## 1: MED 10.88341 1.103266
# Summarise with more than one function by group
cp_datatable[, by = source,
c(lapply(.SD, median),
lapply(.SD, mean)),
.SDcols = c('PY', 'cited')]
## source PY cited PY cited
## <char> <num> <num> <num> <num>
## 1: MED 2020 0 2015.18 10.88341
# Summarise using a condition
cp_datatable[, sapply(cp_datatable, is.numeric),with = FALSE][,lapply(.SD, mean)]
## cited n PY log_cited
## <num> <num> <num> <num>
## 1: 10.88341 113917 2015.18 1.103266
fun <- function(x) {is.numeric(x) && median(x)>0}
cp_datatable[, sapply(cp_datatable, FUN = fun),with = FALSE][,lapply(.SD, mean)]
## n PY
## <num> <num>
## 1: 113917 2015.18
tobinary = function(x){ifelse(x=='Y',1,0)}
# Modify several columns (dropping the others)
head(cp_tibble %>% transmute(across(c(has_data, oa),
tobinary)))
## # A tibble: 6 × 2
## has_data oa
## <dbl> <dbl>
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 0 1
## 6 0 1
# Modify several columns (keeping the others)
cp_tibble <- cp_tibble %>%
mutate(across(all_of(c('pubtype','mesh')),
~ tolower(str_replace_all(.x,'-',''))))
# Modify columns using a condition
t = Sys.time()
cp_tibble <- cp_tibble %>%
mutate(across(where(is.character),
toupper))
Sys.time() - t
## Time difference of 3.127825 secs
# Modify several columns (dropping the others)
head(cp_datatable[, lapply(.SD, tobinary),
.SDcols = c('has_data', 'oa')])
## has_data oa
## <num> <num>
## 1: 0 1
## 2: 0 1
## 3: 0 1
## 4: 0 1
## 5: 0 1
## 6: 0 1
# Modify several columns (keeping the others)
cp_datatable[, c('pubtype','mesh') := lapply(.SD, function(x){tolower(str_replace_all(x,'-',''))}),#cols <- setdiff(names(cp_datatable), "title")
.SDcols = c('pubtype','mesh') ]
# Modify columns using a condition
t = Sys.time()
for(j in which(sapply(cp_datatable, class)=='character')){set(cp_datatable, i=NULL, j=j, value=toupper(cp_datatable[[j]]))}
Sys.time() - t
## Time difference of 2.693229 secs
The use of cp_datatable[,j] in data.table is very flexible, as it allows for the passing of complex expressions in a straightforward way, or the combination of expressions with multiple outputs. This makes it easy to perform complex data manipulation and calculations on large datasets efficiently. The flexibility of the j-expression in data.table is a powerful feature that allows you to perform various data manipulation tasks with a simple and readable syntax.
The dplyr workflow relies on the magrittr pipe operator (%>%). The magrittr package can also be used with data.table objects, but data.table has its own chaining system: cp_datatable[][][].
# R using Tibbles
# Expression chaining using %>%
head(cp_tibble %>%
group_by(PY) %>%
summarise(cited_med = median(cited)) %>%
arrange(desc(cited_med)))
## # A tibble: 6 × 2
## PY cited_med
## <dbl> <dbl>
## 1 1933 66.5
## 2 1948 34
## 3 2002 23
## 4 1956 21
## 5 2000 21
## 6 1999 20
head(cp_tibble %>%
group_by(PY) %>%
summarise(cited_med = median(cited)) %>%
filter(cited_med > 10))
## # A tibble: 6 × 2
## PY cited_med
## <dbl> <dbl>
## 1 1933 66.5
## 2 1942 14
## 3 1948 34
## 4 1956 21
## 5 1964 13.5
## 6 1965 14
# R using data.tablee
# Expression chaining using cp_datatable[][] (recommended)
head(cp_datatable[, by = PY,
.(cited_med = median(cited))][order(-cited_med)])
## PY cited_med
## <num> <num>
## 1: 1933 66.5
## 2: 1948 34.0
## 3: 2002 23.0
## 4: 2000 21.0
## 5: 1956 21.0
## 6: 1999 20.0
head(cp_datatable[, by = PY,
.(cited_med = median(cited))][cited_med > 10])
## PY cited_med
## <num> <num>
## 1: 2012 13
## 2: 2009 13
## 3: 2010 14
## 4: 2011 12
## 5: 2008 15
## 6: 2007 14
# R using Tibbles
head(cp_tibble %>%
mutate(is_en = ifelse(language =='ENG',1,0),
oa_int = ifelse(oa =='Y',1,0)) %>%
group_by(year_pub) %>%
summarize(share_en = sum(is_en)/n(),
share_en_oa = sum(is_en*oa_int)/n()) %>%
arrange(desc(year_pub)))
## # A tibble: 6 × 3
## year_pub share_en share_en_oa
## <chr> <dbl> <dbl>
## 1 2021 0.989 0.884
## 2 2020 0.956 0.685
## 3 2019 0.989 0.867
## 4 2018 0.985 0.829
## 5 2017 0.985 0.817
## 6 2016 0.978 0.791
# R using data.tablee
head(cp_datatable[,c('is_en','oa_int') := .(ifelse(language =='ENG',1,0),
ifelse(oa =='Y',1,0))][
,by = year_pub,.(share_en = sum(is_en)/.N,
share_en_oa = sum(is_en*oa_int)/.N)])
## year_pub share_en share_en_oa
## <char> <num> <num>
## 1: 2020 0.9555364 0.6848808
## 2: 2019 0.9893906 0.8669750
## 3: 2021 0.9894737 0.8842105
## 4: 2018 0.9853569 0.8288591
## 5: 2017 0.9851623 0.8173107
## 6: 2016 0.9775491 0.7907702
# Use multiple expressions (with cp_datatable[,{j}])
head(cp_datatable[, {is_en = ifelse(language =='ENG',1,0)
en_oa = is_en*oa_int
.(pmid,en_oa = ifelse(en_oa == 1, 'Y','N')) # last list returned as a data.table
}])
## pmid en_oa
## <char> <char>
## 1: 32837036 Y
## 2: 32196410 Y
## 3: 32778421 Y
## 4: 32730205 Y
## 5: 32394467 Y
## 6: 32257431 Y
When you have a variable with multiple categories that are stored in one cell, such as the ‘pubtype’ variable in our dataset, you may want to create separate columns for each category. The opposite is also possible, where you have several columns with values that you want to transform into two columns: one with the factor and the other with the values.
To clean this variable and separate the string in each cell, you can use the tidyr package in R. The separate() function allows you to split a single column into multiple columns based on a delimiter. Additionally, the spread() function can be used to transform wide format data into long format data. This can be used to create separate rows for each pubtype.
# R using Tibbles
cp_tibble[1,'pubtype']
## # A tibble: 1 × 1
## pubtype
## <chr>
## 1 "{\"PUBTYPE\": [\"REVIEWARTICLE\", \"REVIEW\", \"JOURNAL ARTICLE\"]}"
clean_pubtype <- function(str){
gsub('[{]|[}]|PUBTYPE|: |[[]|[]]','',str)
}
clean_pubtype(cp_tibble[1,'pubtype'])
## [1] "\"\"\"REVIEWARTICLE\", \"REVIEW\", \"JOURNAL ARTICLE\""
mcp_tibble <- cp_tibble %>%
rowwise() %>%
mutate(pubtype = clean_pubtype(pubtype)) %>%
separate_rows(pubtype,sep = '", "') %>%
mutate(pubtype = gsub('["]','',pubtype))
mcp_tibble <- mcp_tibble %>%
select(pmid, pubtype) %>%
group_by(pmid) %>%
count(pubtype)
mcp_tibble
## # A tibble: 258,370 × 3
## # Groups: pmid [113,917]
## pmid pubtype n
## <chr> <chr> <int>
## 1 10022847 JOURNAL ARTICLE 1
## 2 10022847 RESEARCH SUPPORT, NONU.S. GOV'T 1
## 3 10022847 RESEARCHARTICLE 1
## 4 10023135 JOURNAL ARTICLE 1
## 5 10023135 RESEARCH SUPPORT, NONU.S. GOV'T 1
## 6 10023135 RESEARCHARTICLE 1
## 7 10023767 JOURNAL ARTICLE 1
## 8 10023767 RESEARCH SUPPORT, NONU.S. GOV'T 1
## 9 10023767 RESEARCH SUPPORT, U.S. GOV'T, P.H.S. 1
## 10 10023767 RESEARCHARTICLE 1
## # ℹ 258,360 more rows
# R using data.table
mcp_datatable <- cp_datatable[,pubtype := clean_pubtype(pubtype)][,lapply(.SD, function(x) unlist(tstrsplit(x, '", "'))), .SDcols = "pubtype", by = pmid][,pubtype := gsub('["]','',pubtype)]
mcp_datatable <- mcp_datatable[, .(count = .N), by = .(pmid,pubtype)]
mcp_datatable
## pmid pubtype count
## <char> <char> <int>
## 1: 32837036 REVIEWARTICLE 1
## 2: 32837036 REVIEW 1
## 3: 32837036 JOURNAL ARTICLE 1
## 4: 32196410 REVIEWARTICLE 1
## 5: 32196410 JOURNAL ARTICLE 1
## ---
## 258366: 32091757 REVIEW 1
## 258367: 28590697 CHAPTERARTICLE 1
## 258368: 28590697 REVIEW 1
## 258369: 28590695 CHAPTERARTICLE 1
## 258370: 28590695 REVIEW 1
# R using Tibbles
# see also cast_dtm()
ccp_tibble = tidyr::spread(data = mcp_tibble,
key = 'pubtype',
value = 'n',
fill = 0)
dim(ccp_tibble)
## [1] 113917 116
# R using data.table
ccp_datatable <- dcast(mcp_datatable, pmid ~ pubtype, value.var = "count" ,fun.aggregate = sum) # aggregate by count
dim(ccp_datatable)
## [1] 113917 116
# R using Tibbles
mcp_tibble <- ccp_tibble %>% gather(key = Variable,
value = Value,
-pmid)
# same
mcp_tibble <- ccp_tibble %>% pivot_longer(cols = -pmid, names_to = "variable", values_to = "Value")
# R using data.table
mcp_datatable <- melt(ccp_datatable, id.vars = "pmid",
variable.name = "Variable",
value.name = "Value")
# R using Tibbles
group_split(cp_tibble, oa)
# R using data.table
split(cp_datatable, by = "oa") # S3 method
# R using Tibbles
vec <- c("A:a", "B:b", "C:c")
# vector not handled
tidyr::separate(tibble(vec), vec, c("V1", "V2"))
## # A tibble: 3 × 2
## V1 V2
## <chr> <chr>
## 1 A a
## 2 B b
## 3 C c
# R using data.table
vec <- c("A:a", "B:b", "C:c")
tstrsplit(vec, split = ":", keep = 2L) # works on vector
## [[1]]
## [1] "a" "b" "c"
setDT(tstrsplit(vec, split = ":"))[]
## V1 V2
## <char> <char>
## 1: A a
## 2: B b
## 3: C c
Row subsetting in dplyr relies on the filter() and slice() functions. In addition to these methods, data.table provides two systems that make row filtering and join operations faster and more convenient: keys (primary ordered index) and indices (automatic secondary indexing).
The main differences between keys and indices are:
# R using Tibbles
cp_tibble <- arrange(cp_tibble, year_pub)
# R using data.table
setkey(cp_datatable, year_pub)
setindex(cp_datatable, year_pub)
# R using Tibbles
dim(cp_tibble %>% filter(year_pub == "2010"))
## [1] 2190 21
dim(cp_tibble %>% filter(year_pub %in% c("2010", "2020")))
## [1] 65950 21
# R using data.table
dim(cp_datatable["2010", on = "year_pub"])
## [1] 2190 24
dim(cp_datatable[c("2010", "2020"), on = .(year_pub)]) # same as on = "year_pub"
## [1] 65950 24
# R using Tibbles
cp_tibble %>% filter(year_pub %in% c("2010", "2025"))
# R using data.table
# (default) returns a row with "D" even if not found
cp_datatable[c("2010", "2025"), on = "year_pub", nomatch = NA]
## no rows for unmatched values
cp_datatable[c("2010", "2025"), on = "year_pub", nomatch = 0]
# R using Tibbles
cp_tibble %>%
filter(year_pub %in% c("2010", "2020")) %>%
summarize(median = median(cited))
# R using data.table
cp_datatable[c("2010", "2020"), median(cited), on = "year_pub"]
# R using Tibbles
cp_tibble <- cp_tibble %>%
mutate(cited_ = base::replace(cited, year_pub == "2010", 0L)) %>%
arrange(PY)
# R using data.table
cp_datatable["2010", cited_ := 0, on = "year_pub"]
# R using Tibbles
head(cp_tibble %>%
filter(oa != 'Y') %>%
group_by(PY) %>%
summarise(median(cited)))
## # A tibble: 6 × 2
## PY `median(cited)`
## <dbl> <dbl>
## 1 1842 0
## 2 1874 0
## 3 1876 0
## 4 1881 0
## 5 1885 0
## 6 1888 0
# R using data.table
cp_datatable[!"Y", mean(cited), on = .(oa), by = .EACHI]
## Key: <oa>
## oa V1
## <char> <num>
## 1: N 12.93581
head(cp_datatable[oa != "Y",
by = PY,
mean(cited)])
## PY V1
## <num> <num>
## 1: 1842 0
## 2: 1874 0
## 3: 1876 0
## 4: 1881 0
## 5: 1885 0
## 6: 1888 0
# R using Tibbles
cp_tibble <-cp_tibble %>% arrange(PY, cited)
# R using data.table
setkey(cp_datatable, PY, cited) # or setkeyv(cp_datatable, c("PY", "cited"))
setindex(cp_datatable, PY, cited) # setindexv(cp_datatable, c("PY", "cited"))
# R using Tibbles
dim(filter(cp_tibble, cited == 0, year_pub == "2010"))
## [1] 227 22
dim(filter(cp_tibble, cited == 0, year_pub %in% c("2010","2020")))
## [1] 50783 22
# R using data.table
dim(cp_datatable[.("2010", 0), on = .(year_pub, cited)])
## [1] 227 25
dim(cp_datatable[.(c("2010","2020"), 0), on = .(year_pub, cited)])
## [1] 50783 25
# using which = TRUE only returns the matching rows indices
dim(cp_datatable[.(c("2010","2020"), 0), on = .(year_pub, cited), which = TRUE])
## NULL
# R using data.table
setkey(cp_datatable, NULL)
setindex(cp_datatable, NULL)
In data.table, the set() functions such as set(), setcol() and setnames() modify objects by reference, which makes these operations fast and memory-efficient. However, if this is not the desired behavior, users can use the copy() function to create a new object. In contrast, the corresponding expressions in dplyr will be less memory-efficient as it creates a new object each time an operation is performed. It’s important to note that data.table’s way of modifying in place is very efficient in terms of time and memory usage when working with large datasets, but it can cause problems if you don’t want to modify your original data. Therefore, it’s recommended to use copy() or make a copy of the data.table before applying set() functions.
# R using Tibbles
cp_tibble[1, 2] <- NA
head(cp_tibble[,2])
## # A tibble: 6 × 1
## pmid
## <chr>
## 1 <NA>
## 2 29917937
## 3 29918028
## 4 29918489
## 5 29918401
## 6 29918490
# R using data.table
set(cp_datatable, i = 1L, j = 2L, value = NA)
head(cp_datatable[,2])
## pmid
## <char>
## 1: <NA>
## 2: 29917937
## 3: 29918028
## 4: 29918489
## 5: 29918401
## 6: 29918490
# R using Tibbles
cp_tibble <- cp_tibble %>% arrange(PY, desc(cited))
# R using data.table
setorder(cp_datatable, PY, -cited)
setorderv(cp_datatable, c("PY", "cited"), c(1, -1))
# R using Tibbles
cp_tibble <- rename(cp_tibble, DOI = doi)
# R using data.table
setnames(cp_datatable, old = "doi", new = "DOI")
# R using Tibbles
head(cp_tibble %>% select(PY, cited, DOI))
## # A tibble: 6 × 3
## PY cited DOI
## <dbl> <dbl> <chr>
## 1 1830 0 NULL
## 2 1832 0 NULL
## 3 1833 0 NULL
## 4 1836 0 NULL
## 5 1836 0 NULL
## 6 1836 0 NULL
# R using data.table
setcolorder(cp_datatable, c("PY", "cited", "DOI"))
x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))
# R using Tibbles
dplyr::intersect(x, y)
## V1
## <num>
## 1: 2
## 2: 3
# R using data.table
fintersect(x, y)
## V1
## <num>
## 1: 2
## 2: 3
fintersect(x, y, all = TRUE)
## V1
## <num>
## 1: 2
## 2: 2
## 3: 3
# R using Tibbles
dplyr::setdiff(x, y)
## V1
## <num>
## 1: 1
# R using data.table
fsetdiff(x, y)
## V1
## <num>
## 1: 1
fsetdiff(x, y, all = TRUE)
## V1
## <num>
## 1: 1
## 2: 3
# R using Tibbles
dplyr::union(x, y)
## V1
## <num>
## 1: 1
## 2: 2
## 3: 3
## 4: 4
union_all(x, y)
## V1
## <num>
## 1: 1
## 2: 2
## 3: 2
## 4: 3
## 5: 3
## 6: 2
## 7: 2
## 8: 3
## 9: 4
## 10: 4
# R using data.table
funion(x, y)
## V1
## <num>
## 1: 1
## 2: 2
## 3: 3
## 4: 4
funion(x, y, all = TRUE)
## V1
## <num>
## 1: 1
## 2: 2
## 3: 2
## 4: 3
## 5: 3
## 6: 2
## 7: 2
## 8: 3
## 9: 4
## 10: 4
# R using Tibbles
setequal(x, x[order(-V1),])
## [1] TRUE
all_equal(x, x)
## Warning: `all_equal()` was deprecated in dplyr 1.1.0.
## ℹ Please use `all.equal()` instead.
## ℹ And manually order the rows/cols as needed
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## [1] TRUE
# R using data.table
fsetequal(x, x[order(-V1),])
## [1] TRUE
all.equal(x, x) # S3 method
## [1] TRUE
Joining data in data.table can be done using keys, the ad-hoc ‘on’ argument, or the merge.data.table method. The three methods are presented below. As mentioned earlier, the ‘on’ and ‘by’ (in merge) arguments are optional with keyed data.tables, but are recommended to make the code more explicit. In the examples below, the x, y, and z data.tables can also be used with dplyr.
# R using Tibbles
x <- data.table(Id = c("A", "B", "C", "C"),
X1 = c(1L, 3L, 5L, 7L),
XY = c("x2", "x4", "x6", "x8"),
key = "Id")
y <- data.table(Id = c("A", "B", "B", "D"),
Y1 = c(1L, 3L, 5L, 7L),
XY = c("y1", "y3", "y5", "y7"),
key = "Id")
# R using data.table
# R using Tibbles
left_join(x, y, by = "Id")
## Key: <Id>
## Id X1 XY.x Y1 XY.y
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
## 4: C 5 x6 NA <NA>
## 5: C 7 x8 NA <NA>
# R using data.table
y[x, on = "Id"]
## Key: <Id>
## Id Y1 XY X1 i.XY
## <char> <int> <char> <int> <char>
## 1: A 1 y1 1 x2
## 2: B 3 y3 3 x4
## 3: B 5 y5 3 x4
## 4: C NA <NA> 5 x6
## 5: C NA <NA> 7 x8
merge(x, y, all.x = TRUE, by = "Id")
## Key: <Id>
## Id X1 XY.x Y1 XY.y
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
## 4: C 5 x6 NA <NA>
## 5: C 7 x8 NA <NA>
y[x] # requires keys
## Key: <Id>
## Id Y1 XY X1 i.XY
## <char> <int> <char> <int> <char>
## 1: A 1 y1 1 x2
## 2: B 3 y3 3 x4
## 3: B 5 y5 3 x4
## 4: C NA <NA> 5 x6
## 5: C NA <NA> 7 x8
# R using Tibbles
right_join(x, y, by = "Id")
## Key: <Id>
## Id X1 XY.x Y1 XY.y
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
## 4: D NA <NA> 7 y7
# R using data.table
x[y, on = "Id"]
## Key: <Id>
## Id X1 XY Y1 i.XY
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
## 4: D NA <NA> 7 y7
merge(x, y, all.y = TRUE, by = "Id")
## Key: <Id>
## Id X1 XY.x Y1 XY.y
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
## 4: D NA <NA> 7 y7
x[y] # requires keys
## Key: <Id>
## Id X1 XY Y1 i.XY
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
## 4: D NA <NA> 7 y7
# R using Tibbles
inner_join(x, y, by = "Id")
## Key: <Id>
## Id X1 XY.x Y1 XY.y
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
# R using data.table
x[y, on = "Id", nomatch = 0]
## Key: <Id>
## Id X1 XY Y1 i.XY
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
merge(x, y)
## Key: <Id>
## Id X1 XY.x Y1 XY.y
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
x[y, nomatch = 0] # requires keys
## Key: <Id>
## Id X1 XY Y1 i.XY
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
# R using Tibbles
full_join(x, y, by = "Id")
## Key: <Id>
## Id X1 XY.x Y1 XY.y
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
## 4: C 5 x6 NA <NA>
## 5: C 7 x8 NA <NA>
## 6: D NA <NA> 7 y7
# R using data.table
merge(x, y, all = TRUE, by = "Id")
## Key: <Id>
## Id X1 XY.x Y1 XY.y
## <char> <int> <char> <int> <char>
## 1: A 1 x2 1 y1
## 2: B 3 x4 3 y3
## 3: B 3 x4 5 y5
## 4: C 5 x6 NA <NA>
## 5: C 7 x8 NA <NA>
## 6: D NA <NA> 7 y7
# R using Tibbles
semi_join(x, y, by = "Id")
## Key: <Id>
## Id X1 XY
## <char> <int> <char>
## 1: A 1 x2
## 2: B 3 x4
# R using data.table
unique(x[y$Id, on = "Id", nomatch = 0])
## Key: <Id>
## Id X1 XY
## <char> <int> <char>
## 1: A 1 x2
## 2: B 3 x4
unique(x[y$Id, nomatch = 0]) # requires keys
## Key: <Id>
## Id X1 XY
## <char> <int> <char>
## 1: A 1 x2
## 2: B 3 x4
# R using Tibbles
anti_join(x, y, by = "Id")
## Key: <Id>
## Id X1 XY
## <char> <int> <char>
## 1: C 5 x6
## 2: C 7 x8
# R using data.table
x[!y, on = "Id"]
## Key: <Id>
## Id X1 XY
## <char> <int> <char>
## 1: C 5 x6
## 2: C 7 x8
x[!y] # requires keys
## Key: <Id>
## Id X1 XY
## <char> <int> <char>
## 1: C 5 x6
## 2: C 7 x8
# R using Tibbles
x <- data.table(1:3)
y <- data.table(4:6)
z <- data.table(7:9, 0L)
bind_rows(x, y)
## V1
## <int>
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
bind_rows(x, z) # always fills
## V1 V2
## <int> <int>
## 1: 1 NA
## 2: 2 NA
## 3: 3 NA
## 4: 7 0
## 5: 8 0
## 6: 9 0
# R using data.table
rbind(x, y)
## V1
## <int>
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
rbind(x, z, fill = TRUE)
## V1 V2
## <int> <int>
## 1: 1 NA
## 2: 2 NA
## 3: 3 NA
## 4: 7 0
## 5: 8 0
## 6: 9 0
# R using Tibbles
bind_rows(list(x, y), .id = "id")
## id V1
## <char> <int>
## 1: 1 1
## 2: 1 2
## 3: 1 3
## 4: 2 4
## 5: 2 5
## 6: 2 6
# R using data.table
rbindlist(list(x, y), idcol = TRUE)
## .id V1
## <int> <int>
## 1: 1 1
## 2: 1 2
## 3: 1 3
## 4: 2 4
## 5: 2 5
## 6: 2 6
# R using Tibbles
bind_cols(x, y)
## New names:
## • `V1` -> `V1...1`
## • `V1` -> `V1...2`
## V1...1 V1...2
## <int> <int>
## 1: 1 4
## 2: 2 5
## 3: 3 6
# R using data.table
base::cbind(x, y)
## V1 V1
## <int> <int>
## 1: 1 4
## 2: 2 5
## 3: 3 6
# R using Tibbles
lag(1:10, n = 1, default = NA)
## [1] NA 1 2 3 4 5 6 7 8 9
purrr::map(1:2, ~lag(1:10, n = .x))
## [[1]]
## [1] NA 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] NA NA 1 2 3 4 5 6 7 8
lead(1:10, n = 1, default = NA)
## [1] 2 3 4 5 6 7 8 9 10 NA
# R using data.table
shift(1:10, n = 1, fill = NA, type = "lag")
## [1] NA 1 2 3 4 5 6 7 8 9
shift(1:10, n = 1:2, fill = NA, type = "lag") # multiple
## [[1]]
## [1] NA 1 2 3 4 5 6 7 8 9
##
## [[2]]
## [1] NA NA 1 2 3 4 5 6 7 8
shift(1:10, n = 1, fill = NA, type = "lead")
## [1] 2 3 4 5 6 7 8 9 10 NA
# R
library(tidyverse)
covid_papers <- read_csv("data/covid_papers.csv")
## Rows: 148916 Columns: 19
## ── Column specification ──────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): _id, pmid, pmcid, doi, title, source, year_pub, month_pub, author...
## dbl (1): cited
## date (1): creation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
covid_papers
## # A tibble: 148,916 × 19
## `_id` pmid pmcid doi title source year_pub month_pub creation
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <date>
## 1 5f69d6fa65a250f… 3283… PMC7… 10.1… The … MED 2020 null 2020-08-25
## 2 5f69d6fa65a250f… 3219… PMC7… 10.1… Hypo… MED 2020 null 2020-03-21
## 3 5f69d6fa65a250f… 3277… PMC7… 10.1… The … MED 2020 null 2020-08-12
## 4 5f69d6fa65a250f… 3273… PMC7… 10.1… A co… MED 2020 null 2020-07-31
## 5 5f69d6fa65a250f… null null 10.2… Viri… PPR 2020 null 2020-08-16
## 6 5f69d6fa65a250f… null PMC7… null Inte… PMC 2020 null 2020-09-13
## 7 5f69d6fa65a250f… 3239… PMC7… 10.1… A gl… MED 2020 null 2020-05-13
## 8 5f69d6fa65a250f… 3225… PMC7… 10.1… COVI… MED 2020 null 2020-04-08
## 9 5f69d6fa65a250f… 3272… PMC7… 10.1… The … MED 2020 null 2020-07-29
## 10 5f69d6fa65a250f… 3219… PMC7… 10.1… Labo… MED 2020 null 2020-03-21
## # ℹ 148,906 more rows
## # ℹ 10 more variables: authors_gender <chr>, journal <chr>, language <chr>,
## # pubmodel <chr>, pubtype <chr>, mesh <chr>, keywords <chr>, cited <dbl>,
## # has_data <chr>, oa <chr>
Depending on the way string are encoded we may want to precise the type of encoding. (Some well know encoding character format: UTF-8; Unicode; ASCII ..)
we can sepcify this encoding in the read function.# R
covid_papers <- read_csv("data/covid_papers.csv",locale = locale(encoding = "UTF-8"))
Other types of data can be import using other libraries I just show you some examples here without enterring into details that may be helpfull sometimes.
# R
library(readxl)
covid_papers <- read_excel("data/covid_papers.xls", sheet = "Feuille1", range = "A1:Z25")
# R
library(haven)
#SAS data
covid_papers <- read_sas("data/covid_papers.sas")
covid_papers <- read_xpt("data/covid_papers.xpt")
#SPSS data
covid_papers <- read_sav("data/covid_papers.sav")
covid_papers <- read_por("data/covid_papers.por")
In order to save data in a specific format we just have to change the ‘read’ part of the function name by ‘write’. Here is an example for csv
# R
write_csv(new_data_set,"data/new_data.csv")
Also R provide a way to store objects in the same file. For this we use RData format, it’s very usefull if a project is written totaly in R because you can save the structure of your objects very easily, you can save all type of objects and load it later.
# R
save(covid_papers,file= 'covid_papers.RData')
load('covid_papers.RData')
fread() and fwrite() are among the most powerful functions of data.table. They are not only incredibly fast (as seen in benchmarks), but they are also extremely robust. The few commands below only touch the surface, and there are many awesome features. For example, fread() accepts http and https URLs directly, as well as operating system commands such as sed and awk output. Make sure to check the documentation.
Here again, fread() and fwrite() are very versatile and can handle different file formats while dplyr delegates file reading and writing to the readr package with several specific functions (csv, tsv, delim, …). This means that dplyr requires the use of different functions for different file formats, whereas fread() and fwrite() can handle multiple file formats with a single function. This makes data.table’s file handling functions more efficient and easier to use.
# R using Tibbles
readr::write_csv(cp_tibble, "cp_tibble.csv")
# R using data.table
fwrite(cp_datatable, "cp_datatable.csv")
# R using Tibbles
readr::write_delim(cp_tibble, "cp_tibble.txt", delim = "\t")
# R using data.table
fwrite(cp_datatable, "cp_datatable.txt", sep = "\t")
# R using Tibbles
# R using data.table
fwrite(setDT(list(0, list(1:5))), "cp_datatable2.csv")
# R using Tibbles
readr::read_csv("cp_tibble.csv")
readr::read_delim("cp_tibble.txt", delim = "\t")
# R using data.table
fread("cp_datatable.csv")
# fread("cp_datatable.csv", verbose = TRUE) # full details
fread("cp_datatable.txt", sep = "\t")
# R using Tibbles
# R using data.table
fread("cp_datatable.csv", select = c("pmid", "year_pub"))
fread("cp_datatable.csv", drop = "title")
# R using Tibbles
c("cp_tibble.csv", "cp_tibble.csv") %>%
purrr::map_dfr(readr::read_csv)
# R using data.table
rbindlist(lapply(c("cp_datatable.csv", "cp_datatable.csv"), fread))
# c("cp_datatable.csv", "cp_datatable.csv") %>% lapply(fread) %>% rbindlist
Using first dplyr and the tidyverse, Use
table_1.csv, table_2.csv and
loc_cities.csv, we will need information from the three
datasets.
Import datasets
Remove articles with no pmid and no DOI, and all articles before 1975
Merge the two datasets, pmid is unique for each paper
Create a new variable with the number of authors for each paper
plot distribution for the log(number of authors +1)
How many papers contains ‘deep learning’ or ‘machine learning’ and ‘neural network’ (also with a ‘s’ for neural networks) in their title ? Create a binary variable to save this information. What is the mean of authors for ML papers and non#ML papers ?
Transform has_data and oa into binary variable also, what is the share of ML paper that are oa
Clean up pubtype, for simplicity just get the first type
What is the pub type with the highest mean/sd of citation for each type of publication ? (use cited and the cleaned pub_type)
What are the most representative country by year ? In another tibble, keep only pmid and authors, and get the country for each author from the loc_cities.csv. You may want to separate rows for each authors to get all countries involved in the paper. If an author have multiple affiliations, take the first one.
Select the top 25 of countries involved in coronavirus research since 2001, plot the evolution on a bar chart with plot_ly