Title: | Functions for Text Cleansing and Text Analysis |
---|---|
Description: | A framework for text cleansing and analysis. Conveniently prepare and process large amounts of text for analysis. Includes various metrics for word counts/frequencies that scale efficiently. Quickly analyze large amounts of text data using a text.table (a data.table created with one word (or unit of text analysis) per row, similar to the tidytext format). Offers flexibility to efficiently work with text data stored in vectors as well as text data formatted as a text.table. |
Authors: | Timothy Conwell |
Maintainer: | Timothy Conwell <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.0 |
Built: | 2025-02-15 02:37:45 UTC |
Source: | https://github.com/cran/textTools |
Convert a data.table column of character vectors into a column with one row per word grouped by a grouping column. Optionally will split a column of strings into vectors of constituents.
as.text.table(x, text, split = NULL, group_by = NULL)
as.text.table(x, text, split = NULL, group_by = NULL)
x |
A data.table. |
text |
A string, the name of the column in x containing text to un-nest. |
split |
A string with a pattern to split the text in text column into constituent parts. |
group_by |
A vector of column names to group by. Doesn't work if the group by column is a list column. |
A data.table, text column un-nested to one row per word.
as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " )
as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " )
Flag rows in a text.table with specific words
flag_words(x, text, flag = "flag", words)
flag_words(x, text, flag = "flag", words)
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x to check for words to flag. |
flag |
A string, the name of the column created with the flag indicator. |
words |
A vector of words to flag x. |
A text.table, with rows marked with a 1 if the words in those rows are in the vector of words to delete, otherwise 0.
flag_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", flag = "is_stopword", words = stopwords )
flag_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", flag = "is_stopword", words = stopwords )
Parts of speech for English words/phrases from the Moby Project by Grady Ward. Words with non-ASCII characters have been removed. One row per word.
l_pos
l_pos
Data.table with 227519 rows and 3 variables #'
Lowercase English word or phrase
Lowercase English part of speech, grouped by word into a vector if a word has multiple parts of speech.
TRUE if the word record has a space (contains multiple words), else FALSE.
https://archive.org/details/mobypartofspeech03203gut
Add a column with the parts of speech for each word in a text.table
label_parts_of_speech(x, text)
label_parts_of_speech(x, text)
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x to label the parts of speech. |
A text.table, with columns added for the matching part of speech and for flagging if the part of speech is for a multi-word phrase.
label_parts_of_speech( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2" )
label_parts_of_speech( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2" )
Create n-grams
ngrams( x, text, group_by = c(), count_col_name = "count", n, ngram_prefix = NULL )
ngrams( x, text, group_by = c(), count_col_name = "count", n, ngram_prefix = NULL )
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x to build n-grams with. |
group_by |
A vector of column names to group by. Doesn't work if the group by column is a list column. |
count_col_name |
A string, the name of the output column containing the number of times each base record appears in the group. |
n |
A integer, the number of grams to make. |
ngram_prefix |
A string, a prefix to add to the output n-gram columns. |
A text.table, with columns added for n-grams (the word, the count, and percent of the time the gram follows the word).
ngrams( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", group_by = "col1", n = 2 )
ngrams( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", group_by = "col1", n = 2 )
Parts of speech for English words/phrases from the Moby Project by Grady Ward. Words with non-ASCII characters have been removed. One row per word + part of speech
pos
pos
Data.table with 246690 rows and 3 variables #'
Lowercase English word or phrase
Lowercase English part of speech, one per row
TRUE if the word record has a space (contains multiple words), else FALSE.
https://archive.org/details/mobypartofspeech03203gut
"\n", A regular expression to split strings when encountering a new line.
regex_paragraph
regex_paragraph
A string
"[.?!]\s", A regular expression to split strings when encountering a common end of sentence punctuation pattern.
regex_sentence
regex_sentence
A string
" ", A regular expression to split strings when encountering a space.
regex_word
regex_word
A string
Delete rows in a text.table where the number of identical records within a group is more than a certain threshold
rm_frequent_words( x, text, count_col_name = NULL, group_by = c(), max_count, max_count_is_ratio = FALSE, total_count_col = NULL )
rm_frequent_words( x, text, count_col_name = NULL, group_by = c(), max_count, max_count_is_ratio = FALSE, total_count_col = NULL )
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x used to determine deletion of rows based on the term frequency. |
count_col_name |
A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts. |
group_by |
A vector of column names to group by. Doesn't work if the group by column is a list column. |
max_count |
A number, the maximum number of times a word can occur to keep. |
max_count_is_ratio |
TRUE/FALSE, if TRUE, implies the value passed to max_count should be considered a ratio. |
total_count_col |
Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if max_count_is_ratio is TRUE. |
A text.table, with rows having a duplicate count over a certain threshold deleted.
rm_frequent_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", count_col_name = "count", max_count = 1 )
rm_frequent_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", count_col_name = "count", max_count = 1 )
Delete rows in a text.table where the number of identical records within a group is less than a certain threshold
rm_infrequent_words( x, text, count_col_name = NULL, group_by = c(), min_count, min_count_is_ratio = FALSE, total_count_col = NULL )
rm_infrequent_words( x, text, count_col_name = NULL, group_by = c(), min_count, min_count_is_ratio = FALSE, total_count_col = NULL )
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x used to determine deletion of rows based on the term frequency. |
count_col_name |
A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts. |
group_by |
A vector of column names to group by. Doesn't work if the group by column is a list column. |
min_count |
A number, the minimum number of times a word must occur to keep. |
min_count_is_ratio |
TRUE/FALSE, if TRUE, implies the value passed to min_count should be considered a ratio. |
total_count_col |
Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if min_count_is_ratio is TRUE. |
A text.table, with rows having a duplicate count of less than a certain threshold deleted.
rm_infrequent_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", count_col_name = "count", min_count = 4 ) rm_infrequent_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper and it is the nice kind of dog."), tolower("The dog is extremely nice because it does the dishes and it is cool.") ) ) ), text = "col2", split = " " ), text = "col2", count_col_name = "count", group_by = "col1", min_count = 2 )
rm_infrequent_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", count_col_name = "count", min_count = 4 ) rm_infrequent_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper and it is the nice kind of dog."), tolower("The dog is extremely nice because it does the dishes and it is cool.") ) ) ), text = "col2", split = " " ), text = "col2", count_col_name = "count", group_by = "col1", min_count = 2 )
Delete rows in a text.table where the word has more than a minimum number of characters
rm_long_words(x, text, max_char_length)
rm_long_words(x, text, max_char_length)
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x used to determine deletion of rows based on the number of characters. |
max_char_length |
A number, the maximum number of characters allowed to not delete a row. |
A text.table, with rows having more than a certain number of characters deleted.
rm_long_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", max_char_length = 4 )
rm_long_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", max_char_length = 4 )
Delete rows in a text.table where the records within a group are not also found in other groups (overlapping records)
rm_no_overlap(x, text, group_by = c())
rm_no_overlap(x, text, group_by = c())
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x to determine deletion of rows based on the lack of presence of overlapping records. |
group_by |
A vector of column names to group by. Doesn't work if the group by column is a list column. |
A text.table, with rows not having records found in multiple groups (overlapping records) deleted.
rm_no_overlap( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", group_by = "col1" )
rm_no_overlap( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", group_by = "col1" )
Delete rows in a text.table where the records within a group are also found in other groups (overlapping records)
rm_overlap(x, text, group_by = c())
rm_overlap(x, text, group_by = c())
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x to determine deletion of rows based on the presence of overlapping records. |
group_by |
A vector of column names to group by. Doesn't work if the group by column is a list column. |
A text.table, with rows having records found in multiple groups (overlapping records) deleted.
rm_overlap( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", group_by = "col1" )
rm_overlap( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", group_by = "col1" )
Delete rows in a text.table where the word has a certain part of speech
rm_parts_of_speech( x, text, pos_delete = c("adjective", "adverb", "conjunction", "definite article", "interjection", "noun", "noun phrase", "plural", "preposition", "pronoun", "verb (intransitive)", "verb (transitive)", "verb (usu participle)") )
rm_parts_of_speech( x, text, pos_delete = c("adjective", "adverb", "conjunction", "definite article", "interjection", "noun", "noun phrase", "plural", "preposition", "pronoun", "verb (intransitive)", "verb (transitive)", "verb (usu participle)") )
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x used to determine deletion of rows based on the part of speech. |
pos_delete |
A vector of parts of speech to delete. At least one of the following: 'adjective', 'adverb', 'conjunction', 'definite article', 'interjection', 'noun', 'noun phrase', 'plural', 'preposition', 'pronoun', 'verb (intransitive)', 'verb (transitive)', 'verb (usu participle)' |
A text.table, with rows matching a part of speech deleted.
rm_parts_of_speech( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", pos_delete = "conjunction" )
rm_parts_of_speech( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", pos_delete = "conjunction" )
Delete rows in a text.table where the record has a certain pattern indicated by a regular expression
rm_regexp_match(x, text, pattern)
rm_regexp_match(x, text, pattern)
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x used to determine deletion of rows based on the regular expression. |
pattern |
A regular expression, gets passed to grepl(). |
A text.table, with rows having a certain pattern indicated by a regular expression deleted.
rm_regexp_match( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", pattern = "do" )
rm_regexp_match( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", pattern = "do" )
Delete rows in a text.table where the word has less than a minimum number of characters
rm_short_words(x, text, min_char_length)
rm_short_words(x, text, min_char_length)
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x used to determine deletion of rows based on the number of characters. |
min_char_length |
A number, the minimum number of characters required to not delete a row. |
A text.table, with rows having less than a certain number of characters deleted.
rm_short_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", min_char_length = 4 )
rm_short_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2", min_char_length = 4 )
Remove rows from a text.table with specific words
rm_words(x, text, words = stopwords)
rm_words(x, text, words = stopwords)
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x to check for words to delete. |
words |
A vector of words to delete from x. |
A text.table, with rows deleted if the words in those rows are in the vector of words to delete.
rm_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2" )
rm_words( as.text.table( x = as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ) ) ), text = "col2", split = " " ), text = "col2" )
Generates (pseudo)random strings of the specified char length
sampleStr(char)
sampleStr(char)
char |
A integer, the number of chars to include in the output string. |
A string.
sampleStr(10)
sampleStr(10)
Unique lowercase English stop words from 3 lexicons combined into one vector. Combines snowball, onix, and SMART lists of stopwords.
stopwords
stopwords
A vector of 728 unique English stop words in lowercase
http://snowball.tartarus.org/algorithms/english/stop.txt
http://www.lextek.com/manuals/onix/stopwords1.html
http://www.lextek.com/manuals/onix/stopwords2.html
Detect if there are any words in a vector also found in another vector.
str_any_match(x, y)
str_any_match(x, y)
x |
A vector of words. |
y |
A vector of words to test against. |
TRUE/FALSE, TRUE if any words in x are also in y
str_any_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("the") ) str_any_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("apple") )
str_any_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("the") ) str_any_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("apple") )
Count the intersecting words in a vector that are found in another vector (only counts unique words).
str_count_intersect(x, y)
str_count_intersect(x, y)
x |
A vector of words. |
y |
A vector of words to test against. |
A number, the count of unique words in x also in y
str_count_intersect( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") )
str_count_intersect( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") )
Calculates the intersect divided by union of two vectors of words.
str_count_jaccard_similarity(x, y)
str_count_jaccard_similarity(x, y)
x |
A vector of words. |
y |
A vector of words to test against. |
A number, the intersect divided by union of two vectors of words.
str_count_jaccard_similarity( x = c("a", "dog", "went", "to", "the", "store"), y = c("this", "dog", "went", "to", "the", "store") )
str_count_jaccard_similarity( x = c("a", "dog", "went", "to", "the", "store"), y = c("this", "dog", "went", "to", "the", "store") )
Count the words in a vector that are found in another vector.
str_count_match(x, y, ratio = FALSE)
str_count_match(x, y, ratio = FALSE)
x |
A vector of words. |
y |
A vector of words to test against. |
ratio |
TRUE/FALSE, if TRUE, returns the number of words in x with a match in y divided by the number of words in x. |
A number, the count of words in x also in y
str_count_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") ) str_count_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store"), ratio = TRUE )
str_count_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") ) str_count_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store"), ratio = TRUE )
Count the words in a vector that are not found in another vector.
str_count_nomatch(x, y, ratio = FALSE)
str_count_nomatch(x, y, ratio = FALSE)
x |
A vector of words. |
y |
A vector of words to test against. |
ratio |
TRUE/FALSE, if TRUE, returns the number of words in x without a match in y divided by the number of words in x. |
A number, the count of words in x and not in y
str_count_nomatch( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") ) str_count_nomatch( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "store"), ratio = TRUE )
str_count_nomatch( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") ) str_count_nomatch( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "store"), ratio = TRUE )
Count words from a vector that are found in the same position in another vector.
str_count_positional_match(x, y, ratio = FALSE)
str_count_positional_match(x, y, ratio = FALSE)
x |
A vector of words. |
y |
A vector of words to test against. |
ratio |
TRUE/FALSE, if TRUE, returns the number of words in x with a positional match in y divided by the number of words in x. |
A count of the words in x with matches in the same position in y.
str_count_positional_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("this", "dog", "ran", "from", "the", "store") )
str_count_positional_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("this", "dog", "ran", "from", "the", "store") )
Count words from a vector that are not found in the same position in another vector.
str_count_positional_nomatch(x, y, ratio = FALSE)
str_count_positional_nomatch(x, y, ratio = FALSE)
x |
A vector of words. |
y |
A vector of words to test against. |
ratio |
TRUE/FALSE, if TRUE, returns the number of words in x without a positional match in y divided by the number of words in x. |
A count of the words in x without matches in the same position in y.
str_count_positional_nomatch( x = c("a", "cool", "dog", "went", "to", "the", "store"), y = c("a", "dog", "ran", "from", "the", "store") )
str_count_positional_nomatch( x = c("a", "cool", "dog", "went", "to", "the", "store"), y = c("a", "dog", "ran", "from", "the", "store") )
Count the words in a vector that don't intersect with another vector (only counts unique words).
str_count_setdiff(x, y)
str_count_setdiff(x, y)
x |
A vector of words. |
y |
A vector of words to test against. |
A number, the count of unique words in x not also in y
str_count_setdiff( x = c("a", "dog", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") )
str_count_setdiff( x = c("a", "dog", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") )
Create a list of a vector of unique words found in x and a vector of the counts of each word in x.
str_counts(x)
str_counts(x)
x |
A vector of words. |
A list, position one is a vector of unique and sorted words from x, position two is a vector of the counts for each word.
str_counts( x = c("a", "dog", "went", "to", "the", "store", "and", "a", "dog", "went", "to", "another", "store") )
str_counts( x = c("a", "dog", "went", "to", "the", "store", "and", "a", "dog", "went", "to", "another", "store") )
Combine columns of a data.table into a list in a new column, wraps list(unlist(c(...)))
str_dt_col_combine(...)
str_dt_col_combine(...)
... |
Unquoted column names of a data.table. |
A list, with columns combined into a vector if grouped properly
as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ), col3 = c( "test 1", "test 2" ) ) )[, col4 := .(str_dt_col_combine(col2, col3)), col1]
as.data.table( list( col1 = c( "a", "b" ), col2 = c( tolower("The dog is nice because it picked up the newspaper."), tolower("The dog is extremely nice because it does the dishes.") ), col3 = c( "test 1", "test 2" ) ) )[, col4 := .(str_dt_col_combine(col2, col3)), col1]
Extract words from a vector that are found in another vector.
str_extract_match(x, y)
str_extract_match(x, y)
x |
A vector of words. |
y |
A vector of words to test against. |
x, with the words not found in y removed.
str_extract_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") )
str_extract_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") )
Extract words from a vector that are not found in another vector.
str_extract_nomatch(x, y)
str_extract_nomatch(x, y)
x |
A vector of words. |
y |
A vector of words to test against. |
x, with the words found in y removed.
str_extract_nomatch( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") )
str_extract_nomatch( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "to", "store") )
Extract words from a vector that are found in the same position in another vector.
str_extract_positional_match(x, y)
str_extract_positional_match(x, y)
x |
A vector of words. |
y |
A vector of words to test against. |
x, with the words without matches in the same position in y removed.
str_extract_positional_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("this", "dog", "ran", "from", "the", "store") )
str_extract_positional_match( x = c("a", "dog", "went", "to", "the", "store"), y = c("this", "dog", "ran", "from", "the", "store") )
Extract words from a vector that are not found in the same position in another vector.
str_extract_positional_nomatch(x, y)
str_extract_positional_nomatch(x, y)
x |
A vector of words. |
y |
A vector of words to test against. |
x, with the words with matches found in the same position in y removed.
str_extract_positional_nomatch( x = c("a", "dog", "went", "to", "the", "store"), y = c("a", "crazy", "dog", "ran", "from", "the", "store") )
str_extract_positional_nomatch( x = c("a", "dog", "went", "to", "the", "store"), y = c("a", "crazy", "dog", "ran", "from", "the", "store") )
Remove and replace excess white space from strings.
str_rm_blank_space(x, replacement = " ")
str_rm_blank_space(x, replacement = " ")
x |
A vector or string. |
replacement |
A string to replace the blank space with, defaults to " ", which replaces excess space with a single space. |
x, with extra white space removed/replaced.
str_rm_blank_space(c("this is a test. ", " will it work? "))
str_rm_blank_space(c("this is a test. ", " will it work? "))
Remove words from a vector that have more than a maximum number of characters.
str_rm_long_words(x, max_char_length)
str_rm_long_words(x, max_char_length)
x |
A vector of words. |
max_char_length |
An integer, the maximum number of characters a word can have to not be removed. |
x, with the words not having a character count less than or equal to the max_char_length removed.
str_rm_long_words( x = c("a", "dog", "went", "to", "the", "store"), max_char_length = 2 )
str_rm_long_words( x = c("a", "dog", "went", "to", "the", "store"), max_char_length = 2 )
Remove and replace non-alphanumeric characters from strings.
str_rm_non_alphanumeric(x, replacement = " ")
str_rm_non_alphanumeric(x, replacement = " ")
x |
A vector or string. |
replacement |
A string to replace the numbers with, defaults to " ". |
x, with non-alphanumeric (A-z, 0-9) characters removed/replaced.
str_rm_non_alphanumeric(c("test 67890 * % $ "))
str_rm_non_alphanumeric(c("test 67890 * % $ "))
Remove and replace non-printable characters from strings.
str_rm_non_printable(x, replacement = " ")
str_rm_non_printable(x, replacement = " ")
x |
A vector or string. |
replacement |
A string to replace the numbers with, defaults to " ". |
x, with non-printable characters removed/replaced.
str_rm_non_printable(c("test \n\n\n67890 * % $ "))
str_rm_non_printable(c("test \n\n\n67890 * % $ "))
Remove and replace numbers from strings.
str_rm_numbers(x, replacement = "")
str_rm_numbers(x, replacement = "")
x |
A vector or string. |
replacement |
A string to replace the numbers with, defaults to "". |
x, with numbers 0-9 removed/replaced.
str_rm_numbers(c("test 1a234b5", "test 67890"))
str_rm_numbers(c("test 1a234b5", "test 67890"))
Remove and replace punctuation from strings.
str_rm_punctuation(x, replacement = "")
str_rm_punctuation(x, replacement = "")
x |
A vector or string. |
replacement |
A string to replace the punctuation with, defaults to "". |
x, with punctuation removed/replaced.
str_rm_punctuation(c("wait, is this is a test?", "Tests: . ! ?"))
str_rm_punctuation(c("wait, is this is a test?", "Tests: . ! ?"))
Remove words from a vector that match a regular expression.
str_rm_regexp_match(x, pattern)
str_rm_regexp_match(x, pattern)
x |
A vector of words. |
pattern |
A regular expression. |
x, with the words matching the regular expression removed.
str_rm_regexp_match( x = c("a", "dog", "went", "to", "the", "store"), pattern = "to" )
str_rm_regexp_match( x = c("a", "dog", "went", "to", "the", "store"), pattern = "to" )
Remove words from a vector that don't have a minimum number of characters.
str_rm_short_words(x, min_char_length)
str_rm_short_words(x, min_char_length)
x |
A vector of words. |
min_char_length |
An integer, the minimum number of characters a word can have to not be removed. |
x, with the words not having a character count greater than or equal to the min_char_length removed.
str_rm_short_words( x = c("a", "dog", "went", "to", "the", "store"), min_char_length = 3 )
str_rm_short_words( x = c("a", "dog", "went", "to", "the", "store"), min_char_length = 3 )
Remove words from a vector of words found in another vector of words.
str_rm_words(x, y = stopwords)
str_rm_words(x, y = stopwords)
x |
A vector of words. |
y |
A vector of words to delete from x, defaults to English stop words. |
x, with the words found in y removed.
str_rm_words( x = c("a", "dog", "went", "to", "the", "store"), y = stopwords ) str_rm_words( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "store") )
str_rm_words( x = c("a", "dog", "went", "to", "the", "store"), y = stopwords ) str_rm_words( x = c("a", "dog", "went", "to", "the", "store"), y = c("dog", "store") )
Remove words from a vector based on the number of characters in each word.
str_rm_words_by_length(x, min_char_length = 0, max_char_length = Inf)
str_rm_words_by_length(x, min_char_length = 0, max_char_length = Inf)
x |
A vector of words. |
min_char_length |
An integer, the minimum number of characters a word can have to not be removed. |
max_char_length |
An integer, the maximum number of characters a word can have to not be removed. |
x, with the words not having a character count of at least the min_char_length and at most the max_char_length removed.
str_rm_words_by_length( x = c("a", "dog", "went", "to", "the", "store"), min_char_length = 3 )
str_rm_words_by_length( x = c("a", "dog", "went", "to", "the", "store"), min_char_length = 3 )
Create a vector of English words associated with particular parts of speech.
str_stopwords_by_part_of_speech( parts = c("adjective", "adverb", "conjunction", "definite article", "interjection", "noun", "noun phrase", "plural", "preposition", "pronoun", "verb (intransitive)", "verb (transitive)", "verb (usu participle)"), include_multi_word = FALSE )
str_stopwords_by_part_of_speech( parts = c("adjective", "adverb", "conjunction", "definite article", "interjection", "noun", "noun phrase", "plural", "preposition", "pronoun", "verb (intransitive)", "verb (transitive)", "verb (usu participle)"), include_multi_word = FALSE )
parts |
A vector, at least one of the following: 'adjective', 'adverb', 'conjunction', 'definite article', 'interjection', 'noun', 'noun phrase', 'plural', 'preposition', 'pronoun', 'verb (intransitive)', 'verb (transitive)', 'verb (usu participle)' |
include_multi_word |
TRUE/FALSE, if TRUE, includes records from the pos data.table where multi_word == TRUE, otherwise excludes these records. |
A vector of words matching the part of speech shown in the data.table pos.
str_stopwords_by_part_of_speech('adjective')
str_stopwords_by_part_of_speech('adjective')
Calls base::tolower(), which converts letters to lowercase. Only included to point out that base::tolower exists and should be used directly.
str_tolower(x)
str_tolower(x)
x |
A vector or string. |
x, converted to lowercase.
str_tolower(c("ALLCAPS", "Some capS"))
str_tolower(c("ALLCAPS", "Some capS"))
Weighted count of the words in a vector that are found in another vector.
str_weighted_count_match(x, y)
str_weighted_count_match(x, y)
x |
A list of words and counts created by str_counts(x). |
y |
A list of words and counts created by str_counts(y). |
A number, the count of words in x also in y scaled by the number of times each word appears in x and y. If a word appears 3 times in x and 2 times in y, the result is 6, assuming no other words match.
str_weighted_count_match( x = str_counts(c("a", "dog", "dog", "went", "to", "the", "store")), y = str_counts(c("dog", "dog", "dog")) )
str_weighted_count_match( x = str_counts(c("a", "dog", "dog", "went", "to", "the", "store")), y = str_counts(c("dog", "dog", "dog")) )