| Title: | Text Analysis for All |
|---|---|
| Description: | An R 'shiny' app designed for diverse text analysis tasks, offering a wide range of methodologies tailored to Natural Language Processing (NLP) needs. It is a versatile, general-purpose tool for analyzing textual data. 'tall' features a comprehensive workflow, including data cleaning, preprocessing, statistical analysis, and visualization, all integrated for effective text analysis. |
| Authors: | Massimo Aria [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-8517-9411>), Maria Spano [aut] (ORCID: <https://orcid.org/0000-0002-3103-2342>), Luca D'Aniello [aut] (ORCID: <https://orcid.org/0000-0003-1019-9212>), Corrado Cuccurullo [ctb] (ORCID: <https://orcid.org/0000-0002-7401-8575>), Michelangelo Misuraca [ctb] (ORCID: <https://orcid.org/0000-0002-8794-966X>) |
| Maintainer: | Massimo Aria <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0.9000 |
| Built: | 2026-05-15 09:06:27 UTC |
| Source: | https://github.com/massimoaria/tall |
This function calculates the IS (Absorption Index) from Morrone (1996) for all n-grams in the corpus. Only n-grams that start AND end with lexical words are considered.
calculate_ngram_is( dfTag, max_ngram = 5, term = "lemma", pos = c("NOUN", "ADJ", "ADV", "VERB"), min_freq = 1, min_IS_norm = 0 )calculate_ngram_is( dfTag, max_ngram = 5, term = "lemma", pos = c("NOUN", "ADJ", "ADV", "VERB"), min_freq = 1, min_IS_norm = 0 )
dfTag |
A data frame with tagged text data containing columns: doc_id, sentence_id, token_id, lemma/token, upos |
max_ngram |
Maximum length of n-grams to generate (default: 5) |
term |
Character string indicating which column to use: "lemma" or "token" (default: "lemma") |
pos |
Character vector of POS tags considered lexical (default: c("NOUN", "ADJ", "ADV", "VERB")) |
min_freq |
Minimum frequency threshold for n-grams (default: 1) |
min_IS_norm |
Minimum normalized IS threshold for n-grams (default: 0) |
The IS index is calculated as: IS = (sum 1/freq_i) × freq_ngram × n_lexical where freq_i is the frequency of each word in the n-gram, freq_ngram is the frequency of the n-gram, and n_lexical is the number of lexical words. IS_norm is the normalized version: IS / L^2 where L is the n-gram length.
OPTIMIZATION: Only n-grams that start AND end with lexical words (as defined by the 'pos' parameter) are generated, significantly reducing computation time.
A tibble with columns: ngram, n_length, ngram_freq, n_lexical, IS, IS_norm
## Not run: IS <- calculate_ngram_is(dfTag, max_ngram = 4, term = "lemma", min_freq = 2) head(IS) ## End(Not run)## Not run: IS <- calculate_ngram_is(dfTag, max_ngram = 4, term = "lemma", min_freq = 2) head(IS) ## End(Not run)
Compute Syntactic Complexity Metrics per Document (C++ backend)
compute_syntactic_complexity( doc_id, sent_id, token_id, head_token_id, dep_rel, upos )compute_syntactic_complexity( doc_id, sent_id, token_id, head_token_id, dep_rel, upos )
doc_id |
Character vector of document IDs |
sent_id |
Integer vector of sentence IDs |
token_id |
Integer vector of token IDs |
head_token_id |
Integer vector of head token IDs |
dep_rel |
Character vector of dependency relations |
upos |
Character vector of universal POS tags |
A data.frame with one row per document and complexity metrics
Extract Noun Phrases via Dependency Parsing (C++ backend)
extract_noun_phrases( sent_id, token_id, head_token_id, dep_rel, upos, terms, ngram_max = 5L, max_gap = 3L )extract_noun_phrases( sent_id, token_id, head_token_id, dep_rel, upos, terms, ngram_max = 5L, max_gap = 3L )
sent_id |
Integer vector of sentence IDs |
token_id |
Integer vector of token IDs |
head_token_id |
Integer vector of head token IDs |
dep_rel |
Character vector of dependency relations |
upos |
Character vector of universal POS tags |
terms |
Character vector of term values (lemma or token) |
ngram_max |
Maximum phrase length (default 5) |
max_gap |
Maximum gap between token positions (default 3) |
A data.frame with columns keyword and ngram
Extract SVO (Subject-Verb-Object) Triplets via Dependency Parsing (C++ backend)
extract_svo_triplets(sent_id, token_id, head_token_id, dep_rel, upos, terms)extract_svo_triplets(sent_id, token_id, head_token_id, dep_rel, upos, terms)
sent_id |
Integer vector of sentence IDs |
token_id |
Integer vector of token IDs |
head_token_id |
Integer vector of head token IDs |
dep_rel |
Character vector of dependency relations |
upos |
Character vector of universal POS tags |
terms |
Character vector of term values (lemma or token) |
A data.frame with columns subject, verb, object, rel_type
This dataset contains the lemmatized version of the first 10 chapters of the novel Moby-Dick by Herman Melville. The data is structured as a dataframe with multiple linguistic annotations.
data(mobydick)data(mobydick)
A dataframe with multiple rows and 26 columns:
Character: Unique document identifier
Integer: Paragraph index within the document
Integer: Sentence index within the paragraph
Character: Original sentence text
Integer: Start position of the token in the sentence
Integer: End position of the token in the sentence
Integer: Unique term identifier
Integer: Token index in the sentence
Character: Original token (word)
Character: Lemmatized form of the token
Character: Universal POS tag
Character: Language-specific POS tag
Character: Morphological features
Integer: Head token in dependency tree
Character: Dependency relation label
Character: Enhanced dependency relations
Character: Additional information
Character: Folder containing the document
Character: The word used to separate the chapters in the original book
Character: Source file name
Logical: Whether the document is selected
Logical: Whether POS was selected
Character: Highlighted sentence
Logical: Whether the document was manually selected
Logical: Whether hapax legomena were removed
Logical: Whether single-character words were removed
Character: Lemmatized form without multi-word units
Extracted and processed from the text of Moby-Dick by Herman Melville.
data(mobydick) head(mobydick)data(mobydick) head(mobydick)
Complete optimized workflow for multiword detection and processing. Uses C++ functions and data.table for maximum performance.
process_multiwords_fast(x2, stats, term = c("lemma", "token"))process_multiwords_fast(x2, stats, term = c("lemma", "token"))
x2 |
Data frame with token information |
stats |
Data frame with multiword statistics (keyword, ngram columns) |
term |
Type of term to process: "lemma" or "token" |
This function replaces the original switch block with an optimized version that uses:
C++ functions for text recoding
Vectorized operations instead of multiple mutate calls
Pre-computed lookups to avoid repeated joins
Data frame with columns: doc_id, term_id, multiword, upos_multiword, ngram
## Not run: result <- process_multiwords_fast(dfTag, multiword_stats, term = "lemma") ## End(Not run)## Not run: result <- process_multiwords_fast(dfTag, multiword_stats, term = "lemma") ## End(Not run)
Segment clustering based on the Reinert method - Simple clustering
reinert( x, k = 10, term = "token", segment_size = 40, min_segment_size = 3, min_split_members = 5, cc_test = 0.3, tsj = 3 )reinert( x, k = 10, term = "token", segment_size = 40, min_segment_size = 3, min_split_members = 5, cc_test = 0.3, tsj = 3 )
x |
tall data frame of documents |
k |
maximum number of clusters to compute |
term |
indicates the type of form "lemma" or "token". Default value is term = "lemma". |
segment_size |
number of forms by document. Default value is segment_size = 40 |
min_segment_size |
minimum number of forms by document. Default value is min_segment_size = 5 |
min_split_members |
minimum number of segment in a cluster |
cc_test |
contingency coefficient value for feature selection |
tsj |
minimum frequency value for feature selection |
See the references for original articles on the method. Special thanks to the authors of the rainette package (https://github.com/juba/rainette) for inspiring the coding approach used in this function.
The result is a list of both class hclust and reinert_tall.
Reinert M, Une methode de classification descendante hierarchique: application à l'analyse lexicale par contexte, Cahiers de l'analyse des donnees, Volume 8, Numéro 2, 1983. https://www.numdam.org/item/?id=CAD_1983__8_2_187_0
Reinert M., Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval, Bulletin de Methodologie Sociologique, Volume 26, Numero 1, 1990. doi:10.1177/075910639002600103
Barnier J., Privé F., rainette: The Reinert Method for Textual Data Clustering, 2023, doi:10.32614/CRAN.package.rainette
data(mobydick) res <- reinert( x = mobydick, k = 10, term = "token", segment_size = 40, min_segment_size = 5, min_split_members = 10, cc_test = 0.3, tsj = 3 )data(mobydick) res <- reinert( x = mobydick, k = 10, term = "token", segment_size = 40, min_segment_size = 5, min_split_members = 10, cc_test = 0.3, tsj = 3 )
This function creates a horizontal bar plot to visualize the most significant terms for each cluster, based on their Chi-squared statistics.
reinPlot(terms, nPlot = 10)reinPlot(terms, nPlot = 10)
terms |
A data frame containing terms and their associated statistics, such as Chi-squared values,
generated by the
|
nPlot |
Integer. The number of top terms to plot for each sign ( |
The function organizes the input data by Chi-squared values and selects the top terms for each sign. The plot uses different colors for positive and negative terms, with hover tooltips providing detailed information.
An interactive horizontal bar plot (using plotly) displaying the top terms for each cluster. The plot includes:
Bars representing the Chi-squared values of terms.
Hover information displaying the term and its Chi-squared value.
## Not run: data(mobydick) res <- reinert( x = mobydick, k = 10, term = "token", segment_size = 40, min_segment_size = 5, min_split_members = 10, cc_test = 0.3, tsj = 3 ) tc <- term_per_cluster(res, cutree = NULL, k = 1, negative = FALSE) fig <- reinPlot(tc$terms, nPlot = 10) ## End(Not run)## Not run: data(mobydick) res <- reinert( x = mobydick, k = 10, term = "token", segment_size = 40, min_segment_size = 5, min_split_members = 10, cc_test = 0.3, tsj = 3 ) tc <- term_per_cluster(res, cutree = NULL, k = 1, negative = FALSE) fig <- reinPlot(tc$terms, nPlot = 10) ## End(Not run)
This function summarizes the results of the Reinert clustering algorithm, including the most frequent documents and significant terms for each cluster.
The input is the result returned by the term_per_cluster function.
reinSummary(tc, n = 10)reinSummary(tc, n = 10)
tc |
A list returned by the
|
n |
Integer. The number of top terms (based on Chi-squared value) to include in the summary for each cluster and sign. Default is 10. |
This function performs the following steps:
Extracts the most frequent document for each cluster.
Summarizes the number of documents per cluster.
Selects the top n terms for each cluster, separated by positive and negative signs.
Combines the terms and segment information into a final summary table.
A data frame summarizing the clustering results. The table includes:
cluster: The cluster ID.
Positive terms: The top n positive terms for each cluster, concatenated into a single string.
Negative terms: The top n negative terms for each cluster, concatenated into a single string.
Most frequent document: The document ID that appears most frequently in each cluster.
N. of Documents per Cluster: The number of documents in each cluster.
data(mobydick) res <- reinert( x = mobydick, k = 10, term = "token", segment_size = 40, min_segment_size = 5, min_split_members = 10, cc_test = 0.3, tsj = 3 ) tc <- term_per_cluster(res, cutree = NULL, k = 1:10, negative = FALSE) S <- reinSummary(tc, n = 10) head(S, 10)data(mobydick) res <- reinert( x = mobydick, k = 10, term = "token", segment_size = 40, min_segment_size = 5, min_split_members = 10, cc_test = 0.3, tsj = 3 ) tc <- term_per_cluster(res, cutree = NULL, k = 1:10, negative = FALSE) S <- reinSummary(tc, n = 10) head(S, 10)
tall performs text analysis for all.
tall( host = "127.0.0.1", port = NULL, launch.browser = TRUE, maxUploadSize = 1000 )tall( host = "127.0.0.1", port = NULL, launch.browser = TRUE, maxUploadSize = 1000 )
host |
The IPv4 address that the application should listen on. Defaults to the shiny.host option, if set, or "127.0.0.1" if not. |
port |
is the TCP port that the application should listen on. If the port is not specified, and the shiny.port option is set (with options(shiny.port = XX)), then that port will be used. Otherwise, use a random port. |
launch.browser |
If true, the system's default web browser will be launched automatically after the app is started. Defaults to true in interactive sessions only. This value of this parameter can also be a function to call with the application's URL. |
maxUploadSize |
is a integer. The max upload file size argument. Default value is 1000 (megabyte) |
No return value, called for side effects.
This function processes the results of a document clustering algorithm based on the Reinert method. It computes the terms and their significance for each cluster, as well as the associated document segments.
term_per_cluster(res, cutree = NULL, k = 1, negative = TRUE)term_per_cluster(res, cutree = NULL, k = 1, negative = TRUE)
res |
A list containing the results of the Reinert clustering algorithm. Must include at least |
cutree |
A custom cutree structure. If |
k |
A vector of integers specifying the clusters to analyze. Default is |
negative |
Logical. If |
The function integrates document-term matrix rows for missing segments, calculates term statistics for each cluster,
and filters terms based on their significance. Terms can be excluded based on their significance (signExcluded).
A list with the following components:
terms |
A data frame of significant terms for each cluster. Columns include:
|
segments |
A data frame of document segments associated with each cluster. Columns include:
|
data(mobydick) res <- reinert( x = mobydick, k = 10, term = "token", segment_size = 40, min_segment_size = 5, min_split_members = 10, cc_test = 0.3, tsj = 3 ) tc <- term_per_cluster(res, cutree = NULL, k = 1:10, negative = FALSE) head(tc$segments, 10) head(tc$terms, 10)data(mobydick) res <- reinert( x = mobydick, k = 10, term = "token", segment_size = 40, min_segment_size = 5, min_split_members = 10, cc_test = 0.3, tsj = 3 ) tc <- term_per_cluster(res, cutree = NULL, k = 1:10, negative = FALSE) head(tc$segments, 10) head(tc$terms, 10)
Efficiently recodes text values using C++ hash tables. This is a drop-in
replacement for txt_recode but significantly faster for large vectors.
txt_recode_fast(x, from = c(), to = c(), na.rm = FALSE)txt_recode_fast(x, from = c(), to = c(), na.rm = FALSE)
x |
A character vector to recode |
from |
A character vector with values of |
to |
A character vector with values you want to use to recode to |
na.rm |
Logical, if set to TRUE, will put all values of |
This function uses C++ hash tables for O(1) lookup time, making it much faster than the pure R implementation, especially for large datasets.
Performance improvement: ~50-100x faster than base R txt_recode
for vectors with 100K+ elements.
A character vector of the same length as x where values
matching from are replaced by corresponding values in to
x <- c("NOUN", "VERB", "NOUN", "ADV") txt_recode_fast(x, from = c("VERB", "ADV"), to = c("conjugated verb", "adverb") )x <- c("NOUN", "VERB", "NOUN", "ADV") txt_recode_fast(x, from = c("VERB", "ADV"), to = c("conjugated verb", "adverb") )
Efficiently combines consecutive tokens into multiword expressions using C++. This function scans text sequentially to identify and merge n-gram patterns.
txt_recode_ngram_fast(x, compound, ngram, sep = " ")txt_recode_ngram_fast(x, compound, ngram, sep = " ")
x |
Character vector of tokens (e.g., lemmas or tokens) |
compound |
Character vector of multiword expressions to match |
ngram |
Integer vector indicating the length of each compound |
sep |
String separator to use when joining tokens (default: " ") |
When a multiword match is found:
The first position gets the combined multiword expression
Subsequent positions that were merged are set to NA
The function checks n-grams from longest to shortest to prioritize longer matches.
Performance: ~80-150x faster than pure R implementation for typical text data.
Character vector where matched n-grams are combined and subsequent tokens (that were merged) are set to NA
tokens <- c("machine", "learning", "is", "cool", "machine", "learning") compounds <- c("machine learning") ngrams <- c(2) txt_recode_ngram_fast(tokens, compounds, ngrams, " ") # Returns: c("machine learning", NA, "is", "cool", "machine learning", NA)tokens <- c("machine", "learning", "is", "cool", "machine", "learning") compounds <- c("machine learning") ngrams <- c(2) txt_recode_ngram_fast(tokens, compounds, ngrams, " ") # Returns: c("machine learning", NA, "is", "cool", "machine learning", NA)