An R-package to gather bibliographic data from PubMed.

The goal of pubmedR is to gather metadata about publications, grants and clinical trials from PubMed database using NCBI REST APIs.

http://github.com/massimoaria/pubmedR

Latest version: 0.0.4, 2025-02-15

by Massimo Aria

Full Professor in Social Statistics

PhD in Computational Statistics

Laboratory and Research Group STAD Statistics, Technology, Data Analysis

Department of Economics and Statistics

University of Naples Federico II

email [email protected]

http://www.massimoaria.com

Installation

You can install the developer version of the pubmedR from GitHub with:

install.packages("devtools")
devtools::install_github("massimoaria/pubmedR")

You can install the released version of pubmedR from CRAN with:

install.packages("pubmedR")

Load the package

library(pubmedR)

A brief example

Imagine, we want to download a metadata collection of journal articles which (1) have used bibliometric approaches in their researches, (2) have been published for the past 20 years (3) and have been written in the English language.

The workflow mainly consists of four steps:

Write the query
Check the effectiveness of the query
Download the collection of document metadata
Convert the download object into a “readable” and and “usable” format

By default, the access to NCBI API system is free and does not necessarily require an “API key”. In this case, NCBI limits users to making only 3 requests per second. Users who register for an “API key” are able to make up to ten requests per second.

Obtaing a key is very simple, you just need to register for “my ncbi account” (https://www.ncbi.nlm.nih.gov/account/) then click on a button in the “account settings page” (https://www.ncbi.nlm.nih.gov/account/settings/).

Once you have an API key, set the argument api_key=“your API key” otherwise api_key=“NULL”:

# if you have got an API key
api_key <- "your API key"

# if you haven't got an API key
api_key = NULL

First step: Write a query

First of all, we define a query to submit at the NCBI PubMed system. For example, imagine we want to download a collection of journal articles using bibliometric analyses, published in the last 20 years in the English language. Translating in the query language, we have to set the following statements:

documents containing the word bibliometric and its variations in their title or abstract: “bibliometric*[Title/Abstract]”
documents are written in the English language: “english[LA]”
documents that are categorized as Journal Article: “Journal Article[PT]”
documents published from 2000 to 2020: “2000:2020[DP]”

Combining all these elements using the Boolean operator “AND”, we obtain the final query:

query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]"

Second step: Check the effectiveness of the query

Now, we want to know how many documents could be retrieved by our query.

To do that, we use the function pmQueryTotalCount:

res <- pmQueryTotalCount(query = query, api_key = api_key)

res$total_count

# [1] 2921

D$query_translation

[1] "(bibliometric[Title/Abstract] OR bibliometrica[Title/Abstract] OR bibliometrical[Title/Abstract] OR bibliometrically[Title/Abstract] OR bibliometricas[Title/Abstract] OR bibliometrician[Title/Abstract] OR bibliometricians[Title/Abstract] OR bibliometricly[Title/Abstract] OR bibliometrico[Title/Abstract] OR bibliometricos[Title/Abstract] OR bibliometrics[Title/Abstract] OR bibliometrics'[Title/Abstract] OR bibliometricsmethod[Title/Abstract] OR bibliometricstrade[Title/Abstract]) AND english[LA] AND Journal Article[PT] AND 2000[PDAT] : 2020[PDAT]"

Third step: Download the collection of document metadata

We could decide to change the query or continue to download the whole collection or a part of it (setting the limit argument lower than res$total_count).

Image, we decided to download the whole collection composed by 2921 documents:

D <- pmApiRequest(query = query, limit = res$total_count, api_key = NULL)

# Documents  200  of  2921 
# Documents  400  of  2921 
# Documents  600  of  2921 
# Documents  800  of  2921 
# Documents  1000  of  2921 
# Documents  1200  of  2921 
# Documents  1400  of  2921 
# Documents  1600  of  2921 
# Documents  1800  of  2921 
# Documents  2000  of  2921 
# Documents  2200  of  2921 
# Documents  2400  of  2921 
# Documents  2600  of  2921 
# Documents  2800  of  2921 
# Documents  2921  of  2921

The function pmApiRequest returns a list D composed by 5 objects:

“data”. It is the xml-structured list containing the bibliographic metadata collection downloaded from the PubMed database.
“query”. It a character object containing the original query formulated by the user.
“query_translation”. It a character object containing the query, translated by the NCBI Automatic Terms Translation system and submitted to the PubMed database.
“records_downloaded”. It is an integer object indicating the total number of records downloaded and stored in “data”.
“total_counts”. It is an integer object indicating the total number of records matching the query (stored in the “query_translation” object”).

Fourth step: Convert the download object into a “readable” and and “usable” format

From the xml-structured object to a “classical” data frame

Finally, we transform the xml-structured object D into a data frame, with cases corresponding to documents and variables to Field Tags as used in the bibliometrix R package (https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/, https://github.com/massimoaria/bibliometrix).

M <- pmApi2df(D)

str(M)

# 'data.frame': 2918 obs. of  27 variables:
 # $ AU    : chr  "DU L;LUO S;LIU G;WANG H;ZHENG L;ZHANG Y" "DUAN L;ZHU G" "YANG C;WANG X;TANG X;BAO X;WANG R" "FERHATOGLU SY;YAPICI N" ...
 # $ AF    : chr  "DU, LIANG;LUO, SHANXIA;LIU, GUINA;WANG, HAO;ZHENG, LINGLI;ZHANG, YONGGANG" "DUAN, LI;ZHU, GANG" "YANG, CHENGXIAN;WANG, XUE;TANG, XIAOLI;BAO, XINJIE;WANG, RENZHI" "FERHATOGLU, S YÄ±LMAZ;YAPICI, N" ...
 # $ TI    : chr  "THE 100 TOP-CITED STUDIES ABOUT PAIN AND DEPRESSION." "MAPPING THEME TRENDS AND KNOWLEDGE STRUCTURE OF MAGNETIC RESONANCE IMAGING STUDIES OF SCHIZOPHRENIA: A BIBLIOME"| __truncated__ "RESEARCH TRENDS OF STEM CELLS IN ISCHEMIC STROKE FROM 1999 TO 2018: A BIBLIOMETRIC ANALYSIS." "A BIBLIOMETRIC ANALYSIS OF THE ARTICLES FOCUSING ON THE SUBJECT OF BRAIN DEATH PUBLISHED IN SCIENTIFIC CITATION"| __truncated__ ...
 # $ SO    : chr  "FRONTIERS IN PSYCHOLOGY" "FRONTIERS IN PSYCHIATRY" "CLINICAL NEUROLOGY AND NEUROSURGERY" "TRANSPLANTATION PROCEEDINGS" ...
 # $ SO_CO : chr  "SWITZERLAND" "SWITZERLAND" "NETHERLANDS" "UNITED STATES" ...
 # $ LA    : chr  "ENG" "ENG" "ENG" "ENG" ...
 # $ DT    : chr  "JOURNAL ARTICLE" "JOURNAL ARTICLE" "JOURNAL ARTICLE" "JOURNAL ARTICLE" ...
 # $ DE    : chr  "BIBLIOMETRIC REVIEW;CITATION;CITATION ANALYSIS;DEPRESSION;PAIN;TOP-CITED" "BIBLIOMETRIC ANALYSIS;CO-OCCURRENCE ANALYSIS;MAGNETIC RESONANCE IMAGING;SCHIZOPHRENIA;SOCIAL NETWORK ANALYSIS;S"| __truncated__ "BIBLIOMETRICS;ISCHEMIC STROKE;PUBLICATIONS;STEM CELLS;VOSVIEWER" "" ...
 # $ ID    : chr  "" "" "" "" ...
 # $ MESH  : chr  "" "" "" "" ...
 # $ AB    : chr  "WITH THE ESTIMATED HIGH PREVALENCE IN THE POPULATION, THE TWO SYMPTOMS OF PAIN AND DEPRESSION THREATEN THE WELL"| __truncated__ "RECENTLY, MAGNETIC RESONANCE IMAGING (MRI) TECHNOLOGY HAS BEEN WIDELY USED TO QUANTITATIVELY ANALYZE BRAIN STRU"| __truncated__ "MANY STUDIES HAVE EVALUATED THE SAFETY AND EFFICACY OF STEM CELLS AS THERAPEUTIC AGENTS FOR ISCHEMIC STROKE. WE"| __truncated__ "ALTHOUGH THE TOPIC OF BRAIN DEATH (BD) HAS BEEN INCREASING IN POPULARITY CONSIDERABLY IN RECENT YEARS BY THE SN"| __truncated__ ...
 # $ C1    : chr  "DEPARTMENT OF PERIODICAL PRESS AND NATIONAL CLINICAL RESEARCH CENTER FOR GERIATRICS, WEST CHINA HOSPITAL, SICHU"| __truncated__ "DEPARTMENT OF PSYCHIATRY, THE FIRST AFFILIATED HOSPITAL OF CHINA MEDICAL UNIVERSITY, SHENYANG, CHINA.;DEPARTMEN"| __truncated__ "DEPARTMENT OF NEUROSURGERY, PEKING UNION MEDICAL COLLEGE HOSPITAL, PEKING UNION MEDICAL COLLEGE & CHINESE ACADE"| __truncated__ "DEPARTMENT OF ANESTHESIOLOGY AND REANIMATION, UNIVERSITY OF HEALTH SCIENCES DR. SIYAMI ERSEK TRAINING AND RESEA"| __truncated__ ...
 # $ CR    : chr  "NA" "NA" "NA" "NA" ...
 # $ TC    : num  0 0 0 0 0 0 0 0 0 0 ...
 # $ SN    : chr  "1664-1078" "1664-0640" "1872-6968" "1873-2623" ...
 # $ J9    : chr  "FRONT PSYCHOL" "FRONT PSYCHIATRY" "CLIN NEUROL NEUROSURG" "TRANSPLANT. PROC." ...
 # $ JI    : chr  "FRONT PSYCHOL" "FRONT PSYCHIATRY" "CLIN NEUROL NEUROSURG" "TRANSPLANT. PROC." ...
 # $ PY    : num  2019 2020 2020 2020 2020 ...
 # $ VL    : chr  "10" "11" "192" NA ...
 # $ DI    : chr  "10.3389/fpsyg.2019.03072" "10.3389/fpsyt.2020.00027" "10.1016/j.clineuro.2020.105740" "10.1016/j.transproceed.2020.01.034" ...
 # $ PG    : chr  "3072" "27" "105740" NA ...
 # $ UT    : chr  "32116876" "32116844" "32114325" "32111384" ...
 # $ PMID  : chr  "32116876" "32116844" "32114325" "32111384" ...
 # $ DB    : chr  "PUBMED" "PUBMED" "PUBMED" "PUBMED" ...
 # $ AU_UN : chr  "DEPARTMENT OF PERIODICAL PRESS AND NATIONAL CLINICAL RESEARCH CENTER FOR GERIATRICS, WEST CHINA HOSPITAL, SICHU"| __truncated__ "DEPARTMENT OF PSYCHIATRY, THE FIRST AFFILIATED HOSPITAL OF CHINA MEDICAL UNIVERSITY, SHENYANG, CHINA.;DEPARTMEN"| __truncated__ "DEPARTMENT OF NEUROSURGERY, PEKING UNION MEDICAL COLLEGE HOSPITAL, PEKING UNION MEDICAL COLLEGE & CHINESE ACADE"| __truncated__ "DEPARTMENT OF ANESTHESIOLOGY AND REANIMATION, UNIVERSITY OF HEALTH SCIENCES DR. SIYAMI ERSEK TRAINING AND RESEA"| __truncated__ ...
 # $ AU_CO : chr  "NA" "NA" "NA" "NA" ...
 # $ AU1_CO: chr  "NA" "NA" "NA" "NA" ...

An overview to the collection using bibliometrix

Now, we can use some bibliometrix functions to get an overview of the bibliographic collection.

bibliometrix is an R-tool for quantitative research in scientometrics and bibliometrics that includes all the main bibliometric methods of analysis (https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/, https://github.com/massimoaria/bibliometrix).

First, we install and load the bibliometrix package:

install.packages("bibliometrix")
library(bibliometrix)

Main information about the collection

Then, we add some metadata to the pubmed collection, and we use the biblioAnalysis and summary functions to perform a descriptive analysis of the data frame:

M <- convert2df(D, dbsource = "pubmed", format = "api")

results <- biblioAnalysis(M)
summary(results)

# Main Information about data
# 
#  Documents                             2918 
#  Sources (Journals, Books, etc.)       1275 
#  Keywords Plus (ID)                    2245 
#  Author's Keywords (DE)                4212 
#  Period                                2000 - 2020 
#  Average citations per documents       0 
# 
#  Authors                               8854 
#  Author Appearances                    12928 
#  Authors of single-authored documents  229 
#  Authors of multi-authored documents   8625 
#  Single-authored documents             307 
# 
#  Documents per Author                  0.33 
#  Authors per Document                  3.03 
#  Co-Authors per Documents              4.43 
#  Collaboration Index                   3.31 
#  
#  Document types                     
#  BIOGRAPHY                         4 
#  CASE REPORTS                      2 
#  COMMENT                           8 
#  COMPARATIVE STUDY                 97 
#  EDITORIAL                         2 
#  ENGLISH ABSTRACT                  1 
#  EVALUATION STUDY                  19 
#  HISTORICAL ARTICLE                82 
#  INTRODUCTORY JOURNAL ARTICLE      2 
#  JOURNAL ARTICLE                   2694 
#  LETTER                            3 
#  REVIEW                            4 
#  
# 
# Annual Scientific Production
# 
#  Year    Articles
#     2000       10
#     2001        8
#     2002       10
#     2003       16
#     2004       18
#     2005       27
#     2006       37
#     2007       24
#     2008       43
#     2009       58
#     2010       73
#     2011       93
#     2012      121
#     2013      158
#     2014      172
#     2015      225
#     2016      254
#     2017      276
#     2018      380
#     2019      544
#     2020      159
# 
# Annual Percentage Growth Rate 14.83383 
# 
# 
# Most Productive Authors
# 
#    Authors        Articles Authors        Articles Fractionalized
# 1      SWEILEH WM       62     SWEILEH WM                   25.40
# 2      ZYOUD SH         59     ZYOUD SH                     18.74
# 3      AL-JABI SW       48     HO YS                        13.89
# 4      HO YS            34     AL-JABI SW                   13.00
# 5      YOON DY          27     HUH S                         9.33
# 6      SAWALHA AF       26     BORNMANN L                    9.29
# 7      WANG Y           26     SMITH DR                      9.00
# 8      ZHANG Y          24     ÅŽENEL E                      7.70
# 9      BORNMANN L       22     YEUNG AWK                     6.22
# 10     KHOSA F          22     SHAMIM T                      6.00
# 
# 
# Top manuscripts per citations
# 
#                                        Paper          TC TCperYear
# 1  DU L, 2019, FRONT PSYCHOL                           0         0
# 2  DUAN L, 2020, FRONT PSYCHIATRY                      0         0
# 3  YANG C, 2020, CLIN NEUROL NEUROSURG                 0         0
# 4  FERHATOGLU SY, 2020, TRANSPLANT. PROC.              0         0
# 5  CHEN L, 2020, PHYTOMEDICINE                         0         0
# 6  KUNZE KN, 2020, AM J SPORTS MED                     0         0
# 7  CUOCOLO R, 2020, INSIGHTS IMAGING                   0         0
# 8  WU M, 2020, J. MATERN. FETAL. NEONATAL. MED.        0         0
# 9  LEE IS, 2020, J PAIN RES                            0         0
# 10 SANT'ANNA FH, 2020, INT. J. SYST. EVOL. MICROBIOL.  0         0
# 
# 
# Corresponding Author's Countries
# 
#   Country Articles Freq  SCP MCP MCP_Ratio
# 1      NA     2918    1 2918   0         0
# 
# 
# SCP: Single Country Publications
# 
# MCP: Multiple Country Publications
# 
# 
# Total Citations per Country
# 
#   Country      Total Citations Average Article Citations
# 1           NA               0                         0
# 
# 
# Most Relevant Sources
# 
#                                                       Sources        Articles
# 1  PLOS ONE                                                               106
# 2  SCIENTOMETRICS                                                          67
# 3  WORLD NEUROSURGERY                                                      55
# 4  ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL              36
# 5  INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH       34
# 6  MEDICINE                                                                31
# 7  NEURAL REGENERATION RESEARCH                                            29
# 8  BMJ OPEN                                                                26
# 9  JOURNAL OF THE MEDICAL LIBRARY ASSOCIATION : JMLA                       26
# 10 PEERJ                                                                   25
# 
# 
# Most Relevant Keywords
# 
#    Author Keywords (DE)      Articles Keywords-Plus (ID)     Articles
# 1      BIBLIOMETRICS              667  BIBLIOMETRICS             1545
# 2      BIBLIOMETRIC ANALYSIS      331  HUMANS                    1518
# 3      BIBLIOMETRIC               172  PERIODICALS AS TOPIC       592
# 4      CITATION ANALYSIS          123  BIOMEDICAL RESEARCH        483
# 5      H INDEX                     97  PUBLISHING                 419
# 6      PUBLICATIONS                84  JOURNAL IMPACT FACTOR      323
# 7      CITATIONS                   81  PUBLICATIONS               252
# 8      CITATION                    69  RESEARCH                   252
# 9      WEB OF SCIENCE              66  UNITED STATES              219
# 10     SCIENTOMETRICS              64  FEMALE                     174

```