Exercice chap 5¶

Import datasets

import pandas as pd
import numpy as np
import time
import re
import seaborn as sns
import plotly.express as px

t = time.time()
table_1 = pd.read_csv('data/table_1.csv')
table_2 = pd.read_csv('data/table_2.csv')
cities = pd.read_csv('data/cities_loc.csv')
time.time() - t

2.387356758117676

Remove articles with no pmid and no DOI, and all articles before 1975

t = time.time()
table_1 = table_1.dropna(subset = ['pmid', 'doi'])
time.time() - t

t = time.time()
table_2 = table_2[table_2.year_pub > 1975].dropna(subset = ['pmid'])
time.time() - t

0.04082179069519043

Merge the two datasets, pmid is unique for each paper

t = time.time()
final_table = table_1.merge(table_2, how = 'left', on = 'pmid').dropna()
time.time() - t

final_table

Create a new variable with the number of authors for each paper

t = time.time()
final_table = final_table.assign(nb_aut = final_table.authors.map(lambda aut: len(re.findall('<AUTHOR>',aut))))
time.time() - t 
final_table[['pmid','nb_aut']]

t = time.time()
final_table = final_table[final_table.nb_aut > 0]
time.time() - t

0.024890422821044922

plot distribution for the log(number of authors +1)

sns.distplot(np.log(final_table.nb_aut +1), hist = False, kde = True,
                 kde_kws = {'linewidth': 3})

<AxesSubplot:xlabel='nb_aut'>

How many papers contains 'deep learning' or 'machine learning' and 'neural network' (also with a 's' for neural networks) in their title ? Create a binary variable to save this information. What is the mean of authors for ML papers and non#ML papers ?
Transform has_data and oa into binary variable also, what is the share of ML paper that are oa

t = time.time()
final_table[['title']] = final_table.title.str.lower()
final_table = final_table.assign(
    ML = np.where(final_table.title.str.contains('deep learning|machine learning|neural networks?'),1,0),
    has_data = np.where(final_table.has_data == 'Y',1,0),
    oa = np.where(final_table.oa == 'Y',1,0))
time.time() - t

final_table

np.sum(final_table.ML)
np.sum(final_table.oa)
np.sum(final_table.has_data)

final_table.groupby('ML').agg({'nb_aut':np.mean,
                               'oa':np.mean})

Clean up pubtype, for simplicity just get the first type

t = time.time()
final_table[['pub_type']] = final_table.pubtype.map(lambda pubtype: re.findall('\[\"(.*?)\"',pubtype)[0])
time.time() - t

final_table[['pubtype','pub_type']]

What is the pub type with the highest mean/sd of citation ? (use cited and the cleaned pub_type)

final_table.groupby('pub_type').agg({'cited':np.mean}).sort_values('cited',ascending = False)
final_table.groupby('pub_type').agg({'cited':lambda x : np.std(x,ddof = 1)}).sort_values('cited',ascending = False)

Which are the most representative country by year ? You may want to separate rows for each authors to get all countries involved in the paper, in an authors have multiple affiliations, take the first one. Store it in an other tibble, keep only pmid and authors, get the country for each author from the loc_cities.csv.

countries = list(cities.country.str.lower().drop_duplicates())

countries_pd = final_table[['pmid', 'authors']]

t = time.time()
countries_pd = countries_pd.assign(
    authors=countries_pd.authors.str.split('<AUTHOR>')).explode('authors')
time.time() - t
countries_pd

t = time.time()
countries_pd = countries_pd[(countries_pd.authors != '')
                                & (-countries_pd.authors.str.contains('<AFFILIATION>None'))]
time.time() - t


t = time.time()
countries_pd = countries_pd.assign(
    authors = countries_pd.authors.map(lambda aff: aff.split('<AFFILIATION>')[1])).drop_duplicates()
time.time() - t

countries_pd

t = time.time()
for abr, name in zip([' USA', ' UK', ' Korea'], ['United States', 'United Kingdom', 'South Korea']):
    countries_pd.authors = countries_pd.authors.map(lambda aff: re.sub(abr,name,aff))
    
countries_pd.authors = countries_pd.authors.str.lower()
time.time() - t

1.5909476280212402

t = time.time()
countries_pd = countries_pd[countries_pd.authors.map(lambda aff: any([country in aff for country in countries]))]
countries_matched = countries_pd.authors.map(lambda aff: [country for country in countries if country in aff][0])
countries_pd = countries_pd.assign(country = countries_matched).drop('authors', axis = 1)
time.time() - t

24.330344200134277

t = time.time()
countries_pd = countries_pd[countries_pd.country != ''].drop_duplicates()
time.time() - t
countries_pd

t = time.time()
countries_year = countries_pd.merge(final_table[['pmid','year_pub']], how = 'left', on = 'pmid')
time.time() - t
countries_year


countries_year = countries_year.groupby(['country', 'year_pub']).agg(n = ('pmid', len)).sort_values('n', ascending = False)
countries_year = countries_year.reset_index()
countries_year

Select the top 25 of countries involved in coronavirus research since 2001, plot the evolution on a bar chart with plot_ly

wide_countries = countries_year.pivot(index="year_pub", columns="country", values="n").reset_index()
wide_countries = wide_countries[wide_countries.year_pub > 2000]

countries_to_plot = list(wide_countries.fillna(0).drop('year_pub',axis = 1).apply(sum).sort_values()[-25:].index)

fig = px.bar(wide_countries, x="year_pub", y=countries_to_plot)
fig.update_yaxes(type="log")
fig.show()

	pmid	doi	title	source	language	pubtype	pubmodel	authors	year_pub	has_data	oa	cited
0	32837036.0	10.1007/s12291-020-00919-0	The Pathophysiology, Diagnosis and Treatment o...	MED	eng	{"pubType": ["review-article", "Review", "Jour...	"Print-Electronic"	<AUTHOR> Das SK Subir Kumar Das<AFFILIATION>De...	2020.0	N	Y	0.0
1	32196410.0	10.1080/22221751.2020.1746199	Hypothesis for potential pathogenesis of SARS-...	MED	eng	{"pubType": ["review-article", "Journal Articl...	"Print"	<AUTHOR> Lin L Ling Lin<AFFILIATION>Department...	2020.0	N	Y	61.0
2	32778421.0	10.1016/j.jiph.2020.07.011	The outbreak of the novel severe acute respira...	MED	eng	{"pubType": ["review-article", "Review", "Jour...	"Print-Electronic"	<AUTHOR> Bchetnia M Mbarka Bchetnia<AFFILIATIO...	2020.0	N	Y	0.0
3	32730205.0	10.1016/j.ijsu.2020.07.032	A comparative overview of COVID-19, MERS and S...	MED	eng	{"pubType": ["review-article", "Review", "Jour...	"Print-Electronic"	<AUTHOR> Liu J Jie Liu<AFFILIATION>Department ...	2020.0	N	Y	0.0
4	32394467.0	10.1002/jcp.29785	A global treatments for coronaviruses includin...	MED	eng	{"pubType": ["review-article", "Review", "Jour...	"Print-Electronic"	<AUTHOR> Yousefi B Bahman Yousefi<AFFILIATION>...	2020.0	N	Y	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...
108956	7704896.0	10.1128/cmr.8.1.87-112.1995	Feline immunodeficiency virus: an interesting ...	MED	eng	{"pubType": ["Research Support, Non-U.S. Gov't...	"Print"	<AUTHOR> Bendinelli M M Bendinelli<AFFILIATION...	1995.0	N	N	208.0
108967	22696127.0	10.1007/s13244-012-0153-4	ECR 2012 Book of Abstracts - A - Postergraduat...	MED	eng	{"pubType": ["calendar", "Journal Article"]}	"Print"	<AUTHOR><AFFILIATION>None	2012.0	N	Y	2.0
108969	3368329.0	10.1093/nar/16.suppl.r315	Codon usage tabulated from the GenBank Genetic...	MED	eng	{"pubType": ["Comparative Study", "Research Su...	"Print"	<AUTHOR> Aota S S Aota<AFFILIATION>National In...	1988.0	N	N	90.0
108970	25412402.0	10.1002/14651858.cd000011.pub4	Interventions for enhancing medication adherence.	MED	eng	{"pubType": ["Meta-Analysis", "Research Suppor...	"Electronic"	<AUTHOR> Nieuwlaat R Robby Nieuwlaat<AFFILIATI...	2014.0	N	N	222.0
108971	21994484.0	10.1007/s13225-011-0088-y	European species of Hypocrea part II: species ...	MED	eng	{"pubType": ["research-article", "Journal Arti...	"Print"	<AUTHOR> Jaklitsch WM Walter M Jaklitsch<AFFIL...	2011.0	Y	Y	44.0

	pmid	nb_aut
0	32837036.0	1
1	32196410.0	4
2	32778421.0	4
3	32730205.0	7
4	32394467.0	6
...	...	...
108956	7704896.0	9
108967	22696127.0	1
108969	3368329.0	5
108970	25412402.0	16
108971	21994484.0	1

	pmid	doi	title	source	language	pubtype	pubmodel	authors	year_pub	has_data	oa	cited	nb_aut	ML
0	32837036.0	10.1007/s12291-020-00919-0	the pathophysiology, diagnosis and treatment o...	MED	eng	{"pubType": ["review-article", "Review", "Jour...	"Print-Electronic"	<AUTHOR> Das SK Subir Kumar Das<AFFILIATION>De...	2020.0	0	1	0.0	1	0
1	32196410.0	10.1080/22221751.2020.1746199	hypothesis for potential pathogenesis of sars-...	MED	eng	{"pubType": ["review-article", "Journal Articl...	"Print"	<AUTHOR> Lin L Ling Lin<AFFILIATION>Department...	2020.0	0	1	61.0	4	0
2	32778421.0	10.1016/j.jiph.2020.07.011	the outbreak of the novel severe acute respira...	MED	eng	{"pubType": ["review-article", "Review", "Jour...	"Print-Electronic"	<AUTHOR> Bchetnia M Mbarka Bchetnia<AFFILIATIO...	2020.0	0	1	0.0	4	0
3	32730205.0	10.1016/j.ijsu.2020.07.032	a comparative overview of covid-19, mers and s...	MED	eng	{"pubType": ["review-article", "Review", "Jour...	"Print-Electronic"	<AUTHOR> Liu J Jie Liu<AFFILIATION>Department ...	2020.0	0	1	0.0	7	0
4	32394467.0	10.1002/jcp.29785	a global treatments for coronaviruses includin...	MED	eng	{"pubType": ["review-article", "Review", "Jour...	"Print-Electronic"	<AUTHOR> Yousefi B Bahman Yousefi<AFFILIATION>...	2020.0	0	1	1.0	6	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
108956	7704896.0	10.1128/cmr.8.1.87-112.1995	feline immunodeficiency virus: an interesting ...	MED	eng	{"pubType": ["Research Support, Non-U.S. Gov't...	"Print"	<AUTHOR> Bendinelli M M Bendinelli<AFFILIATION...	1995.0	0	0	208.0	9	0
108967	22696127.0	10.1007/s13244-012-0153-4	ecr 2012 book of abstracts - a - postergraduat...	MED	eng	{"pubType": ["calendar", "Journal Article"]}	"Print"	<AUTHOR><AFFILIATION>None	2012.0	0	1	2.0	1	0
108969	3368329.0	10.1093/nar/16.suppl.r315	codon usage tabulated from the genbank genetic...	MED	eng	{"pubType": ["Comparative Study", "Research Su...	"Print"	<AUTHOR> Aota S S Aota<AFFILIATION>National In...	1988.0	0	0	90.0	5	0
108970	25412402.0	10.1002/14651858.cd000011.pub4	interventions for enhancing medication adherence.	MED	eng	{"pubType": ["Meta-Analysis", "Research Suppor...	"Electronic"	<AUTHOR> Nieuwlaat R Robby Nieuwlaat<AFFILIATI...	2014.0	0	0	222.0	16	0
108971	21994484.0	10.1007/s13225-011-0088-y	european species of hypocrea part ii: species ...	MED	eng	{"pubType": ["research-article", "Journal Arti...	"Print"	<AUTHOR> Jaklitsch WM Walter M Jaklitsch<AFFIL...	2011.0	1	1	44.0	1	0

	nb_aut	oa
ML
0	5.772364	0.726677
1	6.058559	0.801802

	pubtype	pub_type
0	{"pubType": ["review-article", "Review", "Jour...	review-article
1	{"pubType": ["review-article", "Journal Articl...	review-article
2	{"pubType": ["review-article", "Review", "Jour...	review-article
3	{"pubType": ["review-article", "Review", "Jour...	review-article
4	{"pubType": ["review-article", "Review", "Jour...	review-article
...	...	...
108956	{"pubType": ["Research Support, Non-U.S. Gov't...	Research Support, Non-U.S. Gov't
108967	{"pubType": ["calendar", "Journal Article"]}	calendar
108969	{"pubType": ["Comparative Study", "Research Su...	Comparative Study
108970	{"pubType": ["Meta-Analysis", "Research Suppor...	Meta-Analysis
108971	{"pubType": ["research-article", "Journal Arti...	research-article

	cited
pub_type
Consensus Development Conference	392.044048
Research Support, N.I.H., Intramural	144.450609
Research Support, U.S. Gov't, P.H.S.	97.102551
editorial	90.848164
Clinical Conference	84.303766
...	...
Equivalence Trial	NaN
Personal Narratives	NaN
Twin Study	NaN
Validation Studies	NaN
obituary	NaN

	pmid	authors
0	32837036.0
0	32837036.0	Das SK Subir Kumar Das<AFFILIATION>Department...
1	32196410.0
1	32196410.0	Lin L Ling Lin<AFFILIATION>Department of Infe...
1	32196410.0	Lu L Lianfeng Lu<AFFILIATION>Department of In...
...	...	...
108970	25412402.0	Jedraszewski D Dawn Jedraszewski<AFFILIATION>...
108970	25412402.0	Cotoi C Chris Cotoi<AFFILIATION>None
108970	25412402.0	Haynes RB R Brian Haynes<AFFILIATION>None
108971	21994484.0
108971	21994484.0	Jaklitsch WM Walter M Jaklitsch<AFFILIATION>D...

	pmid	authors
0	32837036.0	Department of Biochemistry, College of Medicin...
1	32196410.0	Department of Infectious Diseases, Peking Unio...
1	32196410.0	Department of Infectious Diseases, Peking Unio...
2	32778421.0	Université du Québec à Chicoutimi (UQAC), Dépa...
2	32778421.0	Centre de recherche, Institut universitaire de...
...	...	...
108952	18843651.0	Department of Anesthesiology, Pharmacology and...
108956	7704896.0	Department of Biomedicine, University of Pisa,...
108969	3368329.0	National Institute of Genetics, Mishima, Japan.
108970	25412402.0	Department of Clinical Epidemiology and Biosta...
108971	21994484.0	Department of Systematic and Evolutionary Bota...

	country	year_pub	n
0	united states	2020.0	11668.0
1	china	2020.0	7888.0
2	united kingdom	2020.0	6262.0
3	italy	2020.0	5910.0
4	india	2020.0	3297.0
...	...	...	...
1904	mexico	2003.0	1.0
1905	mexico	2005.0	1.0
1906	mexico	2010.0	1.0
1907	micronesia	2014.0	1.0
1908	lebanon	2014.0	1.0