| Title: | Gathering Metadata About Publications, Grants, Clinical Trials from 'PubMed' Database |
|---|---|
| Description: | A set of tools to extract bibliographic content from 'PubMed' database using 'NCBI' REST API <https://www.ncbi.nlm.nih.gov/home/develop/api/>. It includes functions to search, download, and convert 'PubMed' bibliographic records into data frames compatible with the 'bibliometrix' package. Features include programmatic query building, batch downloading by PMID, citation enrichment via 'NCBI' E-Link, and robust error handling with automatic retry logic. |
| Authors: | Massimo Aria [aut, cre] (ORCID: <https://orcid.org/0000-0002-8517-9411>) |
| Maintainer: | Massimo Aria <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.2.9000 |
| Built: | 2026-06-04 06:54:55 UTC |
| Source: | https://github.com/massimoaria/pubmedr |
It converts PubMed data, downloaded using Entrez API, into a dataframe
pmApi2df(P, format = "bibliometrix")pmApi2df(P, format = "bibliometrix")
P |
is a list following the xml PubMed structure, downloaded using the function |
format |
is a character. If |
a dataframe containing bibliographic records.
To obtain a free access to NCBI API, please visit: https://pmc.ncbi.nlm.nih.gov/tools/developers/
To obtain more information about how to write a NCBI search query, please visit: https://pubmed.ncbi.nlm.nih.gov/help/#search-tags
# Example: Querying a collection of publications query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]" D <- pmApiRequest(query = query, limit = 100, api_key = NULL) M <- pmApi2df(D)# Example: Querying a collection of publications query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]" D <- pmApiRequest(query = query, limit = 100, api_key = NULL) M <- pmApi2df(D)
It gathers metadata about publications from the NCBI PubMed database.
The use of NCBI PubMed APIs is entirely free, and doesn't necessarily require an API key.
The function pmApiRequest queries NCBI PubMed using an entrez query formulated through
the Entrez query language or the helper function pmQueryBuild.
pmApiRequest(query, limit, api_key = NULL, batch_size = 200)pmApiRequest(query, limit, api_key = NULL, batch_size = 200)
query |
is a character. It contains a search query formulated using the Entrez query language. |
limit |
is numeric. It indicates the max number of records to download. |
api_key |
is a character. It contains a valid API key for the NCBI E-utilities.
Default is |
batch_size |
is numeric. The number of records to download per API request. Default is 200. |
Official API documentation is https://www.ncbi.nlm.nih.gov/books/NBK25500/.
a list D composed by 5 objects:
| data | It is the xml-structured list containing the bibliographic metadata collection downloaded from the PubMed database. | |
| query | It a character object containing the original query formulated by the user. | |
| query_translation | It a character object containing the query, translated by the NCBI Automatic Terms Translation system and submitted to the PubMed database. | |
| records_downloaded | It is an integer object indicating the total number of records downloaded and stored in "data". | |
| total_count | It is an integer object indicating the total number of records matching the query (stored in the "query_translation" object"). |
To obtain a free access to NCBI API, please visit: https://pmc.ncbi.nlm.nih.gov/tools/developers/
To obtain more information about how to write a NCBI search query, please visit: https://pubmed.ncbi.nlm.nih.gov/help/#search-tags
query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]" D <- pmApiRequest(query = query, limit = 100, api_key = NULL)query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]" D <- pmApiRequest(query = query, limit = 100, api_key = NULL)
It retrieves the PMIDs of articles that cite a given PubMed article, using the NCBI E-Link service (PubMed Cited by).
pmCitedBy(pmid, api_key = NULL)pmCitedBy(pmid, api_key = NULL)
pmid |
is a character or numeric. A single PubMed identifier (PMID). |
api_key |
is a character. It contains a valid API key for the NCBI E-utilities.
Default is |
This function uses the NCBI E-Link endpoint with linkname "pubmed_pubmed_citedin" to find articles in PubMed that cite the given article.
Note: Citation data in PubMed is based on PubMed Central (PMC) and may not be as comprehensive as commercial citation databases (e.g. Web of Science, Scopus).
a list containing:
| pmid | The queried PMID. | |
| cited_by | A character vector of PMIDs that cite the queried article. | |
| count | The number of citing articles found. |
# Find articles that cite PMID 25824007 cites <- pmCitedBy(pmid = "25824007") cites$count cites$cited_by# Find articles that cite PMID 25824007 cites <- pmCitedBy(pmid = "25824007") cites$count cites$cited_by
A convenience wrapper that executes the full pubmedR workflow: query building, record count check, metadata download, conversion to data frame, and (optionally) citation enrichment via NCBI E-Link.
pmCollect( query = NULL, terms = NULL, fields = "Title/Abstract", language = NULL, pub_type = NULL, date_range = NULL, mesh_terms = NULL, limit = 2000, enrich = FALSE, format = "bibliometrix", api_key = NULL, batch_size = 200, verbose = TRUE )pmCollect( query = NULL, terms = NULL, fields = "Title/Abstract", language = NULL, pub_type = NULL, date_range = NULL, mesh_terms = NULL, limit = 2000, enrich = FALSE, format = "bibliometrix", api_key = NULL, batch_size = 200, verbose = TRUE )
query |
is a character. A PubMed search query in Entrez syntax.
Alternatively, if |
terms |
is a character or character vector or NULL. Search terms passed to
|
fields |
is a character or character vector. PubMed search tags used
when building the query from |
language |
is a character or NULL. Language filter for query building.
Default is |
pub_type |
is a character or NULL. Publication type filter for query building.
Default is |
date_range |
is a character vector of length 2 or NULL. Date range
in format |
mesh_terms |
is a character or character vector or NULL. MeSH terms
for query building. Default is |
limit |
is numeric. Maximum number of records to download.
Default is |
enrich |
is logical. If |
format |
is a character. Output format passed to |
api_key |
is a character or NULL. NCBI API key. Can also be set via
the environment variable |
batch_size |
is numeric. Records per API request. Default is 200. |
verbose |
is logical. If |
This function chains together the core pubmedR functions in the recommended order:
Query: If terms is provided, builds the query with
pmQueryBuild; otherwise uses the query string directly.
Count: Checks the total number of matching records with
pmQueryTotalCount.
Download: Fetches metadata with pmApiRequest.
Convert: Transforms XML to a data frame with pmApi2df.
Enrich (optional): Adds citation data with
pmEnrichCitations.
a data frame containing bibliographic records, compatible with the
bibliometrix package when format = "bibliometrix".
pmQueryBuild, pmQueryTotalCount,
pmApiRequest, pmApi2df, pmEnrichCitations
# Using a raw query string M <- pmCollect( query = "bibliometric*[Title/Abstract] AND english[LA] AND 2020:2024[DP]", limit = 50 ) # Using the query builder parameters M <- pmCollect( terms = "bibliometric*", language = "english", pub_type = "Journal Article", date_range = c("2020", "2024"), limit = 50 ) # With citation enrichment (slower, requires extra API calls) M <- pmCollect( terms = "bibliometric*", date_range = c("2023", "2024"), limit = 10, enrich = TRUE )# Using a raw query string M <- pmCollect( query = "bibliometric*[Title/Abstract] AND english[LA] AND 2020:2024[DP]", limit = 50 ) # Using the query builder parameters M <- pmCollect( terms = "bibliometric*", language = "english", pub_type = "Journal Article", date_range = c("2020", "2024"), limit = 50 ) # With citation enrichment (slower, requires extra API calls) M <- pmCollect( terms = "bibliometric*", date_range = c("2023", "2024"), limit = 10, enrich = TRUE )
Adds cited references (CR field), reference counts (NR field), and
optionally citation counts (TC field) to a dataframe created by
pmApi2df.
pmEnrichCitations( df, P = NULL, api_key = NULL, resolve_pmids = TRUE, only_multiple = FALSE, include_TC = TRUE, batch_size = 200 )pmEnrichCitations( df, P = NULL, api_key = NULL, resolve_pmids = TRUE, only_multiple = FALSE, include_TC = TRUE, batch_size = 200 )
df |
is a dataframe. A bibliometric dataframe produced by
|
P |
is the optional list returned by |
api_key |
is a character. It contains a valid API key for the NCBI
E-utilities. Default is |
resolve_pmids |
logical. When |
only_multiple |
logical. When |
include_TC |
logical. When |
batch_size |
integer. Number of records per API call when fetching metadata. Defaults to 200 (NCBI's hard cap for efetch). |
Cited references are extracted from the article's PubMed XML
(<ReferenceList>). This is more reliable than the previous E-Link
pubmed_pubmed_refs approach, which only worked for articles
deposited in PMC. References whose XML carries an ArticleId
IdType="pubmed" are resolved to bibliographic metadata in
batched efetch requests so that CR matches the WoS
convention used by bibliometrix; references with only free-text citations
are kept verbatim (uppercased).
The input dataframe with updated CR (cited references),
NR (number of references), and TC (times cited, if
include_TC = TRUE) fields.
pmExtractReferences, pmCitedBy,
pmFetchById, pmApi2df
query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]" D <- pmApiRequest(query = query, limit = 10, api_key = NULL) M <- pmApi2df(D) M <- pmEnrichCitations(M, P = D) # avoid the extra fetchquery <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]" D <- pmApiRequest(query = query, limit = 10, api_key = NULL) M <- pmApi2df(D) M <- pmEnrichCitations(M, P = D) # avoid the extra fetch
Walks a result returned by pmApiRequest or
pmFetchById and pulls the <ReferenceList> block out of
every record. Returns one row per cited <Reference>, carrying the
source PMID, the free-text citation, and (when present) the cited PMID and
DOI parsed from <ArticleIdList>.
pmExtractReferences(P)pmExtractReferences(P)
P |
A list following the PubMed XML structure produced by
|
Reference data in PubMed XML is populated when the publisher submits a
<ReferenceList> block to NLM (which is now common, but not universal).
This function does not call any web API; it merely parses what is already
present in the XML. Use pmEnrichCitations to also resolve cited
PMIDs into structured WoS-style citation strings.
A data.frame with columns
| source_pmid | The PMID of the article that cites the reference. | |
| citation | The free-text <Citation> string from PubMed. |
|
| pmid | The PMID of the cited reference (if available). | |
| doi | The DOI of the cited reference (if available). |
Returns an empty data.frame (with the same schema) if no references are found.
pmEnrichCitations, pmFetchById
D <- pmFetchById("37289732") refs <- pmExtractReferences(D) head(refs)D <- pmFetchById("37289732") refs <- pmExtractReferences(D) head(refs)
It downloads metadata for a set of PubMed articles identified by their PMID (PubMed Identifier). This is useful for retrieving specific known articles, updating existing datasets, or downloading records identified through other sources.
pmFetchById(pmids, api_key = NULL, batch_size = 200)pmFetchById(pmids, api_key = NULL, batch_size = 200)
pmids |
is a character or numeric vector. A vector of PubMed identifiers (PMIDs). |
api_key |
is a character. It contains a valid API key for the NCBI E-utilities.
Default is |
batch_size |
is numeric. The number of records to download per API request. Default is 200. |
The function uses the NCBI E-utilities efetch endpoint to retrieve records directly
by their PMIDs, without requiring a search query. Records are downloaded in batches
to respect API rate limits.
The output is compatible with pmApi2df for conversion to a dataframe.
a list following the same structure as pmApiRequest output, containing:
| data | The xml-structured list containing the bibliographic metadata. | |
| query | A character string describing the PMID-based query. | |
| query_translation | Same as query for PMID-based searches. | |
| records_downloaded | The total number of records downloaded. | |
| total_count | The total number of PMIDs requested. |
# Download specific articles by PMID pmids <- c("34813985", "34813456", "34812345") D <- pmFetchById(pmids = pmids) M <- pmApi2df(D)# Download specific articles by PMID pmids <- c("34813985", "34813456", "34812345") D <- pmFetchById(pmids = pmids) M <- pmApi2df(D)
It helps to build a valid PubMed search query using the Entrez query language, combining multiple search terms with Boolean operators.
pmQueryBuild( terms = NULL, fields = "Title/Abstract", language = NULL, pub_type = NULL, date_range = NULL, mesh_terms = NULL, author = NULL, journal = NULL, operator = "AND" )pmQueryBuild( terms = NULL, fields = "Title/Abstract", language = NULL, pub_type = NULL, date_range = NULL, mesh_terms = NULL, author = NULL, journal = NULL, operator = "AND" )
terms |
is a character or character vector. Search terms to look for in title and abstract fields. |
fields |
is a character or character vector. PubMed search tags to apply.
Default is |
language |
is a character or NULL. Language filter (e.g. "english", "french"). Default is |
pub_type |
is a character or NULL. Publication type filter (e.g. "Journal Article", "Review", "Clinical Trial").
Default is |
date_range |
is a character vector of length 2 or NULL. Date range in format |
mesh_terms |
is a character or character vector or NULL. MeSH (Medical Subject Headings) terms.
Default is |
author |
is a character or character vector or NULL. Author names. Default is |
journal |
is a character or character vector or NULL. Journal names or abbreviations. Default is |
operator |
is a character. Boolean operator to combine multiple |
The function constructs a query string compatible with NCBI's Entrez search system.
Multiple terms within the same parameter are combined with the specified operator,
while different parameters (terms, language, pub_type, etc.) are combined with AND.
For more information about PubMed search tags, visit: https://pubmed.ncbi.nlm.nih.gov/help/#search-tags
a character string containing the formatted PubMed query.
# Simple query q <- pmQueryBuild(terms = "bibliometrics", language = "english", pub_type = "Journal Article", date_range = c("2000", "2023")) # Multiple terms q <- pmQueryBuild(terms = c("machine learning", "deep learning"), operator = "OR", language = "english") # MeSH terms query q <- pmQueryBuild(mesh_terms = "COVID-19", pub_type = "Review", date_range = c("2020", "2024")) # Author search q <- pmQueryBuild(terms = "bibliometrics", author = "Aria M")# Simple query q <- pmQueryBuild(terms = "bibliometrics", language = "english", pub_type = "Journal Article", date_range = c("2000", "2023")) # Multiple terms q <- pmQueryBuild(terms = c("machine learning", "deep learning"), operator = "OR", language = "english") # MeSH terms query q <- pmQueryBuild(mesh_terms = "COVID-19", pub_type = "Review", date_range = c("2020", "2024")) # Author search q <- pmQueryBuild(terms = "bibliometrics", author = "Aria M")
It counts the number of documents that a query returns from the NCBI PubMed database.
pmQueryTotalCount(query, api_key = NULL)pmQueryTotalCount(query, api_key = NULL)
query |
is a character. It contains a search query formulated using the Entrez query language. |
api_key |
is a character. It contains a valid API key for the NCBI E-utilities. Default is |
a list. It contains three objects:
| total_count | The total number of records returned by the query | |
| query_translation | The query translation by the NCBI Automatic Terms Translation system | |
| web_history | The web history object. The NCBI provides search history features, which is useful for dealing with large lists of IDs or repeated searches. |
To obtain a free access to NCBI API, please visit: https://pmc.ncbi.nlm.nih.gov/tools/developers/
query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]" D <- pmQueryTotalCount(query = query, api_key = NULL)query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]" D <- pmQueryTotalCount(query = query, api_key = NULL)
It retrieves the PMIDs of articles that are cited by (referenced in) a given PubMed article, using the NCBI E-Link service.
pmReferences(pmid, api_key = NULL)pmReferences(pmid, api_key = NULL)
pmid |
is a character or numeric. A single PubMed identifier (PMID). |
api_key |
is a character. It contains a valid API key for the NCBI E-utilities.
Default is |
This function uses the NCBI E-Link endpoint with linkname "pubmed_pubmed_refs" to find articles in PubMed that are referenced by the given article.
Note: Reference data is extracted from PubMed Central (PMC) full-text articles and is only available when the full text is deposited in PMC. Not all PubMed articles have reference data available.
a list containing:
| pmid | The queried PMID. | |
| references | A character vector of PMIDs referenced by the queried article. | |
| count | The number of references found. |
# Find references of PMID 25824007 refs <- pmReferences(pmid = "25824007") refs$count refs$references# Find references of PMID 25824007 refs <- pmReferences(pmid = "25824007") refs$count refs$references