Exercice chap 5

Go back to chapter 5

  • Import datasets
In [1]:
import pandas as pd
import numpy as np
import time
import re
import seaborn as sns
import plotly.express as px

t = time.time()
table_1 = pd.read_csv('data/table_1.csv')
table_2 = pd.read_csv('data/table_2.csv')
cities = pd.read_csv('data/cities_loc.csv')
time.time() - t
Out[1]:
2.387356758117676
  • Remove articles with no pmid and no DOI, and all articles before 1975
In [2]:
t = time.time()
table_1 = table_1.dropna(subset = ['pmid', 'doi'])
time.time() - t

t = time.time()
table_2 = table_2[table_2.year_pub > 1975].dropna(subset = ['pmid'])
time.time() - t
Out[2]:
0.04082179069519043
  • Merge the two datasets, pmid is unique for each paper
In [3]:
t = time.time()
final_table = table_1.merge(table_2, how = 'left', on = 'pmid').dropna()
time.time() - t

final_table
Out[3]:
pmid doi title source language pubtype pubmodel authors year_pub has_data oa cited
0 32837036.0 10.1007/s12291-020-00919-0 The Pathophysiology, Diagnosis and Treatment o... MED eng {"pubType": ["review-article", "Review", "Jour... "Print-Electronic" <AUTHOR> Das SK Subir Kumar Das<AFFILIATION>De... 2020.0 N Y 0.0
1 32196410.0 10.1080/22221751.2020.1746199 Hypothesis for potential pathogenesis of SARS-... MED eng {"pubType": ["review-article", "Journal Articl... "Print" <AUTHOR> Lin L Ling Lin<AFFILIATION>Department... 2020.0 N Y 61.0
2 32778421.0 10.1016/j.jiph.2020.07.011 The outbreak of the novel severe acute respira... MED eng {"pubType": ["review-article", "Review", "Jour... "Print-Electronic" <AUTHOR> Bchetnia M Mbarka Bchetnia<AFFILIATIO... 2020.0 N Y 0.0
3 32730205.0 10.1016/j.ijsu.2020.07.032 A comparative overview of COVID-19, MERS and S... MED eng {"pubType": ["review-article", "Review", "Jour... "Print-Electronic" <AUTHOR> Liu J Jie Liu<AFFILIATION>Department ... 2020.0 N Y 0.0
4 32394467.0 10.1002/jcp.29785 A global treatments for coronaviruses includin... MED eng {"pubType": ["review-article", "Review", "Jour... "Print-Electronic" <AUTHOR> Yousefi B Bahman Yousefi<AFFILIATION>... 2020.0 N Y 1.0
... ... ... ... ... ... ... ... ... ... ... ... ...
108956 7704896.0 10.1128/cmr.8.1.87-112.1995 Feline immunodeficiency virus: an interesting ... MED eng {"pubType": ["Research Support, Non-U.S. Gov't... "Print" <AUTHOR> Bendinelli M M Bendinelli<AFFILIATION... 1995.0 N N 208.0
108967 22696127.0 10.1007/s13244-012-0153-4 ECR 2012 Book of Abstracts - A - Postergraduat... MED eng {"pubType": ["calendar", "Journal Article"]} "Print" <AUTHOR><AFFILIATION>None 2012.0 N Y 2.0
108969 3368329.0 10.1093/nar/16.suppl.r315 Codon usage tabulated from the GenBank Genetic... MED eng {"pubType": ["Comparative Study", "Research Su... "Print" <AUTHOR> Aota S S Aota<AFFILIATION>National In... 1988.0 N N 90.0
108970 25412402.0 10.1002/14651858.cd000011.pub4 Interventions for enhancing medication adherence. MED eng {"pubType": ["Meta-Analysis", "Research Suppor... "Electronic" <AUTHOR> Nieuwlaat R Robby Nieuwlaat<AFFILIATI... 2014.0 N N 222.0
108971 21994484.0 10.1007/s13225-011-0088-y European species of Hypocrea part II: species ... MED eng {"pubType": ["research-article", "Journal Arti... "Print" <AUTHOR> Jaklitsch WM Walter M Jaklitsch<AFFIL... 2011.0 Y Y 44.0

107103 rows × 12 columns

  • Create a new variable with the number of authors for each paper
In [4]:
t = time.time()
final_table = final_table.assign(nb_aut = final_table.authors.map(lambda aut: len(re.findall('<AUTHOR>',aut))))
time.time() - t 
final_table[['pmid','nb_aut']]
Out[4]:
pmid nb_aut
0 32837036.0 1
1 32196410.0 4
2 32778421.0 4
3 32730205.0 7
4 32394467.0 6
... ... ...
108956 7704896.0 9
108967 22696127.0 1
108969 3368329.0 5
108970 25412402.0 16
108971 21994484.0 1

107103 rows × 2 columns

In [5]:
t = time.time()
final_table = final_table[final_table.nb_aut > 0]
time.time() - t
Out[5]:
0.024890422821044922
  • plot distribution for the log(number of authors +1)
In [6]:
sns.distplot(np.log(final_table.nb_aut +1), hist = False, kde = True,
                 kde_kws = {'linewidth': 3})
Out[6]:
<AxesSubplot:xlabel='nb_aut'>
  • How many papers contains 'deep learning' or 'machine learning' and 'neural network' (also with a 's' for neural networks) in their title ? Create a binary variable to save this information. What is the mean of authors for ML papers and non#ML papers ?

  • Transform has_data and oa into binary variable also, what is the share of ML paper that are oa

In [7]:
t = time.time()
final_table[['title']] = final_table.title.str.lower()
final_table = final_table.assign(
    ML = np.where(final_table.title.str.contains('deep learning|machine learning|neural networks?'),1,0),
    has_data = np.where(final_table.has_data == 'Y',1,0),
    oa = np.where(final_table.oa == 'Y',1,0))
time.time() - t

final_table
Out[7]:
pmid doi title source language pubtype pubmodel authors year_pub has_data oa cited nb_aut ML
0 32837036.0 10.1007/s12291-020-00919-0 the pathophysiology, diagnosis and treatment o... MED eng {"pubType": ["review-article", "Review", "Jour... "Print-Electronic" <AUTHOR> Das SK Subir Kumar Das<AFFILIATION>De... 2020.0 0 1 0.0 1 0
1 32196410.0 10.1080/22221751.2020.1746199 hypothesis for potential pathogenesis of sars-... MED eng {"pubType": ["review-article", "Journal Articl... "Print" <AUTHOR> Lin L Ling Lin<AFFILIATION>Department... 2020.0 0 1 61.0 4 0
2 32778421.0 10.1016/j.jiph.2020.07.011 the outbreak of the novel severe acute respira... MED eng {"pubType": ["review-article", "Review", "Jour... "Print-Electronic" <AUTHOR> Bchetnia M Mbarka Bchetnia<AFFILIATIO... 2020.0 0 1 0.0 4 0
3 32730205.0 10.1016/j.ijsu.2020.07.032 a comparative overview of covid-19, mers and s... MED eng {"pubType": ["review-article", "Review", "Jour... "Print-Electronic" <AUTHOR> Liu J Jie Liu<AFFILIATION>Department ... 2020.0 0 1 0.0 7 0
4 32394467.0 10.1002/jcp.29785 a global treatments for coronaviruses includin... MED eng {"pubType": ["review-article", "Review", "Jour... "Print-Electronic" <AUTHOR> Yousefi B Bahman Yousefi<AFFILIATION>... 2020.0 0 1 1.0 6 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
108956 7704896.0 10.1128/cmr.8.1.87-112.1995 feline immunodeficiency virus: an interesting ... MED eng {"pubType": ["Research Support, Non-U.S. Gov't... "Print" <AUTHOR> Bendinelli M M Bendinelli<AFFILIATION... 1995.0 0 0 208.0 9 0
108967 22696127.0 10.1007/s13244-012-0153-4 ecr 2012 book of abstracts - a - postergraduat... MED eng {"pubType": ["calendar", "Journal Article"]} "Print" <AUTHOR><AFFILIATION>None 2012.0 0 1 2.0 1 0
108969 3368329.0 10.1093/nar/16.suppl.r315 codon usage tabulated from the genbank genetic... MED eng {"pubType": ["Comparative Study", "Research Su... "Print" <AUTHOR> Aota S S Aota<AFFILIATION>National In... 1988.0 0 0 90.0 5 0
108970 25412402.0 10.1002/14651858.cd000011.pub4 interventions for enhancing medication adherence. MED eng {"pubType": ["Meta-Analysis", "Research Suppor... "Electronic" <AUTHOR> Nieuwlaat R Robby Nieuwlaat<AFFILIATI... 2014.0 0 0 222.0 16 0
108971 21994484.0 10.1007/s13225-011-0088-y european species of hypocrea part ii: species ... MED eng {"pubType": ["research-article", "Journal Arti... "Print" <AUTHOR> Jaklitsch WM Walter M Jaklitsch<AFFIL... 2011.0 1 1 44.0 1 0

107103 rows × 14 columns

In [8]:
np.sum(final_table.ML)
np.sum(final_table.oa)
np.sum(final_table.has_data)

final_table.groupby('ML').agg({'nb_aut':np.mean,
                               'oa':np.mean})
Out[8]:
nb_aut oa
ML
0 5.772364 0.726677
1 6.058559 0.801802
  • Clean up pubtype, for simplicity just get the first type
In [9]:
t = time.time()
final_table[['pub_type']] = final_table.pubtype.map(lambda pubtype: re.findall('\[\"(.*?)\"',pubtype)[0])
time.time() - t

final_table[['pubtype','pub_type']]
Out[9]:
pubtype pub_type
0 {"pubType": ["review-article", "Review", "Jour... review-article
1 {"pubType": ["review-article", "Journal Articl... review-article
2 {"pubType": ["review-article", "Review", "Jour... review-article
3 {"pubType": ["review-article", "Review", "Jour... review-article
4 {"pubType": ["review-article", "Review", "Jour... review-article
... ... ...
108956 {"pubType": ["Research Support, Non-U.S. Gov't... Research Support, Non-U.S. Gov't
108967 {"pubType": ["calendar", "Journal Article"]} calendar
108969 {"pubType": ["Comparative Study", "Research Su... Comparative Study
108970 {"pubType": ["Meta-Analysis", "Research Suppor... Meta-Analysis
108971 {"pubType": ["research-article", "Journal Arti... research-article

107103 rows × 2 columns

  • What is the pub type with the highest mean/sd of citation ? (use cited and the cleaned pub_type)
In [10]:
final_table.groupby('pub_type').agg({'cited':np.mean}).sort_values('cited',ascending = False)
final_table.groupby('pub_type').agg({'cited':lambda x : np.std(x,ddof = 1)}).sort_values('cited',ascending = False)
Out[10]:
cited
pub_type
Consensus Development Conference 392.044048
Research Support, N.I.H., Intramural 144.450609
Research Support, U.S. Gov't, P.H.S. 97.102551
editorial 90.848164
Clinical Conference 84.303766
... ...
Equivalence Trial NaN
Personal Narratives NaN
Twin Study NaN
Validation Studies NaN
obituary NaN

83 rows × 1 columns

  • Which are the most representative country by year ? You may want to separate rows for each authors to get all countries involved in the paper, in an authors have multiple affiliations, take the first one. Store it in an other tibble, keep only pmid and authors, get the country for each author from the loc_cities.csv.
In [11]:
countries = list(cities.country.str.lower().drop_duplicates())

countries_pd = final_table[['pmid', 'authors']]

t = time.time()
countries_pd = countries_pd.assign(
    authors=countries_pd.authors.str.split('<AUTHOR>')).explode('authors')
time.time() - t
countries_pd
Out[11]:
pmid authors
0 32837036.0
0 32837036.0 Das SK Subir Kumar Das<AFFILIATION>Department...
1 32196410.0
1 32196410.0 Lin L Ling Lin<AFFILIATION>Department of Infe...
1 32196410.0 Lu L Lianfeng Lu<AFFILIATION>Department of In...
... ... ...
108970 25412402.0 Jedraszewski D Dawn Jedraszewski<AFFILIATION>...
108970 25412402.0 Cotoi C Chris Cotoi<AFFILIATION>None
108970 25412402.0 Haynes RB R Brian Haynes<AFFILIATION>None
108971 21994484.0
108971 21994484.0 Jaklitsch WM Walter M Jaklitsch<AFFILIATION>D...

725404 rows × 2 columns

In [12]:
t = time.time()
countries_pd = countries_pd[(countries_pd.authors != '')
                                & (-countries_pd.authors.str.contains('<AFFILIATION>None'))]
time.time() - t


t = time.time()
countries_pd = countries_pd.assign(
    authors = countries_pd.authors.map(lambda aff: aff.split('<AFFILIATION>')[1])).drop_duplicates()
time.time() - t

countries_pd
Out[12]:
pmid authors
0 32837036.0 Department of Biochemistry, College of Medicin...
1 32196410.0 Department of Infectious Diseases, Peking Unio...
1 32196410.0 Department of Infectious Diseases, Peking Unio...
2 32778421.0 Université du Québec à Chicoutimi (UQAC), Dépa...
2 32778421.0 Centre de recherche, Institut universitaire de...
... ... ...
108952 18843651.0 Department of Anesthesiology, Pharmacology and...
108956 7704896.0 Department of Biomedicine, University of Pisa,...
108969 3368329.0 National Institute of Genetics, Mishima, Japan.
108970 25412402.0 Department of Clinical Epidemiology and Biosta...
108971 21994484.0 Department of Systematic and Evolutionary Bota...

291655 rows × 2 columns

In [13]:
t = time.time()
for abr, name in zip([' USA', ' UK', ' Korea'], ['United States', 'United Kingdom', 'South Korea']):
    countries_pd.authors = countries_pd.authors.map(lambda aff: re.sub(abr,name,aff))
    
countries_pd.authors = countries_pd.authors.str.lower()
time.time() - t
Out[13]:
1.5909476280212402
In [14]:
t = time.time()
countries_pd = countries_pd[countries_pd.authors.map(lambda aff: any([country in aff for country in countries]))]
countries_matched = countries_pd.authors.map(lambda aff: [country for country in countries if country in aff][0])
countries_pd = countries_pd.assign(country = countries_matched).drop('authors', axis = 1)
time.time() - t
Out[14]:
24.330344200134277
In [15]:
t = time.time()
countries_pd = countries_pd[countries_pd.country != ''].drop_duplicates()
time.time() - t
countries_pd

t = time.time()
countries_year = countries_pd.merge(final_table[['pmid','year_pub']], how = 'left', on = 'pmid')
time.time() - t
countries_year


countries_year = countries_year.groupby(['country', 'year_pub']).agg(n = ('pmid', len)).sort_values('n', ascending = False)
countries_year = countries_year.reset_index()
countries_year
Out[15]:
country year_pub n
0 united states 2020.0 11668.0
1 china 2020.0 7888.0
2 united kingdom 2020.0 6262.0
3 italy 2020.0 5910.0
4 india 2020.0 3297.0
... ... ... ...
1904 mexico 2003.0 1.0
1905 mexico 2005.0 1.0
1906 mexico 2010.0 1.0
1907 micronesia 2014.0 1.0
1908 lebanon 2014.0 1.0

1909 rows × 3 columns

  • Select the top 25 of countries involved in coronavirus research since 2001, plot the evolution on a bar chart with plot_ly
In [16]:
wide_countries = countries_year.pivot(index="year_pub", columns="country", values="n").reset_index()
wide_countries = wide_countries[wide_countries.year_pub > 2000]

countries_to_plot = list(wide_countries.fillna(0).drop('year_pub',axis = 1).apply(sum).sort_values()[-25:].index)

fig = px.bar(wide_countries, x="year_pub", y=countries_to_plot)
fig.update_yaxes(type="log")
fig.show()