Package: textTools 0.1.0

textTools: Functions for Text Cleansing and Text Analysis

A framework for text cleansing and analysis. Conveniently prepare and process large amounts of text for analysis. Includes various metrics for word counts/frequencies that scale efficiently. Quickly analyze large amounts of text data using a text.table (a data.table created with one word (or unit of text analysis) per row, similar to the tidytext format). Offers flexibility to efficiently work with text data stored in vectors as well as text data formatted as a text.table.

Authors:Timothy Conwell

textTools_0.1.0.tar.gz
textTools_0.1.0.zip(r-4.5)textTools_0.1.0.zip(r-4.4)textTools_0.1.0.zip(r-4.3)
textTools_0.1.0.tgz(r-4.5-any)textTools_0.1.0.tgz(r-4.4-any)textTools_0.1.0.tgz(r-4.3-any)
textTools_0.1.0.tar.gz(r-4.5-noble)textTools_0.1.0.tar.gz(r-4.4-noble)
textTools_0.1.0.tgz(r-4.4-emscripten)textTools_0.1.0.tgz(r-4.3-emscripten)
textTools.pdf |textTools.html✨
textTools/json (API)

# Install 'textTools' in R:

install.packages('textTools', repos = c('https://tconwell.r-universe.dev', 'https://cloud.r-project.org'))

On CRAN:

This package does not link to any Github/Gitlab/R-forge repository. No issue tracker or development information is available.

1.00 score 4 scripts 218 downloads 47 exports 1 dependencies

Last updated 4 years agofrom:83bcb2e07b. Checks:9 OK. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 17 2025
R-4.5-win	OK	Mar 17 2025
R-4.5-mac	OK	Mar 17 2025
R-4.5-linux	OK	Mar 17 2025
R-4.4-win	OK	Mar 17 2025
R-4.4-mac	OK	Mar 17 2025
R-4.4-linux	OK	Mar 17 2025
R-4.3-win	OK	Mar 17 2025
R-4.3-mac	OK	Mar 17 2025

Exports:as.text.table flag_words l_pos label_parts_of_speech ngrams pos regex_paragraph regex_sentence regex_word rm_frequent_words rm_infrequent_words rm_long_words rm_no_overlap rm_overlap rm_parts_of_speech rm_regexp_match rm_short_words rm_words sampleStr stopwords str_any_match str_count_intersect str_count_jaccard_similarity str_count_match str_count_nomatch str_count_positional_match str_count_positional_nomatch str_count_setdiff str_counts str_dt_col_combine str_extract_match str_extract_nomatch str_extract_positional_match str_extract_positional_nomatch str_rm_blank_space str_rm_long_words str_rm_non_alphanumeric str_rm_non_printable str_rm_numbers str_rm_punctuation str_rm_regexp_match str_rm_short_words str_rm_words str_rm_words_by_length str_stopwords_by_part_of_speech str_tolower str_weighted_count_match

Dependencies:data.table

Help page	Topics
Convert a data.table column of character vectors into a column with one row per word grouped by a grouping column. Optionally will split a column of strings into vectors of constituents.	as.text.table
Flag rows in a text.table with specific words	flag_words
Parts of speech for English words from the Moby Project.	l_pos
Add a column with the parts of speech for each word in a text.table	label_parts_of_speech
Create n-grams	ngrams
Parts of speech for English words from the Moby Project.	pos
Regular expression that might be used to split strings of text into component paragraphs.	regex_paragraph
Regular expression that might be used to split strings of text into component sentences.	regex_sentence
Regular expression that might be used to split strings of text into component words.	regex_word
Delete rows in a text.table where the number of identical records within a group is more than a certain threshold	rm_frequent_words
Delete rows in a text.table where the number of identical records within a group is less than a certain threshold	rm_infrequent_words
Delete rows in a text.table where the word has more than a minimum number of characters	rm_long_words
Delete rows in a text.table where the records within a group are not also found in other groups (overlapping records)	rm_no_overlap
Delete rows in a text.table where the records within a group are also found in other groups (overlapping records)	rm_overlap
Delete rows in a text.table where the word has a certain part of speech	rm_parts_of_speech
Delete rows in a text.table where the record has a certain pattern indicated by a regular expression	rm_regexp_match
Delete rows in a text.table where the word has less than a minimum number of characters	rm_short_words
Remove rows from a text.table with specific words	rm_words
Generates (pseudo)random strings of the specified char length	sampleStr
Vector of lowercase English stop words.	stopwords
Detect if there are any words in a vector also found in another vector.	str_any_match
Count the intersecting words in a vector that are found in another vector (only counts unique words).	str_count_intersect
Calculates the intersect divided by union of two vectors of words.	str_count_jaccard_similarity
Count the words in a vector that are found in another vector.	str_count_match
Count the words in a vector that are not found in another vector.	str_count_nomatch
Count words from a vector that are found in the same position in another vector.	str_count_positional_match
Count words from a vector that are not found in the same position in another vector.	str_count_positional_nomatch
Count the words in a vector that don't intersect with another vector (only counts unique words).	str_count_setdiff
Create a list of a vector of unique words found in x and a vector of the counts of each word in x.	str_counts
Combine columns of a data.table into a list in a new column, wraps list(unlist(c(...)))	str_dt_col_combine
Extract words from a vector that are found in another vector.	str_extract_match
Extract words from a vector that are not found in another vector.	str_extract_nomatch
Extract words from a vector that are found in the same position in another vector.	str_extract_positional_match
Extract words from a vector that are not found in the same position in another vector.	str_extract_positional_nomatch
Remove and replace excess white space from strings.	str_rm_blank_space
Remove words from a vector that have more than a maximum number of characters.	str_rm_long_words
Remove and replace non-alphanumeric characters from strings.	str_rm_non_alphanumeric
Remove and replace non-printable characters from strings.	str_rm_non_printable
Remove and replace numbers from strings.	str_rm_numbers
Remove and replace punctuation from strings.	str_rm_punctuation
Remove words from a vector that match a regular expression.	str_rm_regexp_match
Remove words from a vector that don't have a minimum number of characters.	str_rm_short_words
Remove words from a vector of words found in another vector of words.	str_rm_words
Remove words from a vector based on the number of characters in each word.	str_rm_words_by_length
Create a vector of English words associated with particular parts of speech.	str_stopwords_by_part_of_speech
Calls base::tolower(), which converts letters to lowercase. Only included to point out that base::tolower exists and should be used directly.	str_tolower
Weighted count of the words in a vector that are found in another vector.	str_weighted_count_match

Package: textTools 0.1.0

textTools: Functions for Text Cleansing and Text Analysis

Citation

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)