Package 'textTools'

Title: Functions for Text Cleansing and Text Analysis
Description: A framework for text cleansing and analysis. Conveniently prepare and process large amounts of text for analysis. Includes various metrics for word counts/frequencies that scale efficiently. Quickly analyze large amounts of text data using a text.table (a data.table created with one word (or unit of text analysis) per row, similar to the tidytext format). Offers flexibility to efficiently work with text data stored in vectors as well as text data formatted as a text.table.
Authors: Timothy Conwell
Maintainer: Timothy Conwell <[email protected]>
License: GPL (>= 2)
Version: 0.1.0
Built: 2025-02-15 02:37:45 UTC
Source: https://github.com/cran/textTools

Help Index


Convert a data.table column of character vectors into a column with one row per word grouped by a grouping column. Optionally will split a column of strings into vectors of constituents.

Description

Convert a data.table column of character vectors into a column with one row per word grouped by a grouping column. Optionally will split a column of strings into vectors of constituents.

Usage

as.text.table(x, text, split = NULL, group_by = NULL)

Arguments

x

A data.table.

text

A string, the name of the column in x containing text to un-nest.

split

A string with a pattern to split the text in text column into constituent parts.

group_by

A vector of column names to group by. Doesn't work if the group by column is a list column.

Value

A data.table, text column un-nested to one row per word.

Examples

as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
)

Flag rows in a text.table with specific words

Description

Flag rows in a text.table with specific words

Usage

flag_words(x, text, flag = "flag", words)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x to check for words to flag.

flag

A string, the name of the column created with the flag indicator.

words

A vector of words to flag x.

Value

A text.table, with rows marked with a 1 if the words in those rows are in the vector of words to delete, otherwise 0.

Examples

flag_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
flag = "is_stopword",
words = stopwords
)

Parts of speech for English words from the Moby Project.

Description

Parts of speech for English words/phrases from the Moby Project by Grady Ward. Words with non-ASCII characters have been removed. One row per word.

Usage

l_pos

Format

Data.table with 227519 rows and 3 variables #'

word

Lowercase English word or phrase

pos

Lowercase English part of speech, grouped by word into a vector if a word has multiple parts of speech.

multi_word

TRUE if the word record has a space (contains multiple words), else FALSE.

Source

https://archive.org/details/mobypartofspeech03203gut


Add a column with the parts of speech for each word in a text.table

Description

Add a column with the parts of speech for each word in a text.table

Usage

label_parts_of_speech(x, text)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x to label the parts of speech.

Value

A text.table, with columns added for the matching part of speech and for flagging if the part of speech is for a multi-word phrase.

Examples

label_parts_of_speech(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2"
)

Create n-grams

Description

Create n-grams

Usage

ngrams(
  x,
  text,
  group_by = c(),
  count_col_name = "count",
  n,
  ngram_prefix = NULL
)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x to build n-grams with.

group_by

A vector of column names to group by. Doesn't work if the group by column is a list column.

count_col_name

A string, the name of the output column containing the number of times each base record appears in the group.

n

A integer, the number of grams to make.

ngram_prefix

A string, a prefix to add to the output n-gram columns.

Value

A text.table, with columns added for n-grams (the word, the count, and percent of the time the gram follows the word).

Examples

ngrams(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
group_by = "col1",
n = 2
)

Parts of speech for English words from the Moby Project.

Description

Parts of speech for English words/phrases from the Moby Project by Grady Ward. Words with non-ASCII characters have been removed. One row per word + part of speech

Usage

pos

Format

Data.table with 246690 rows and 3 variables #'

word

Lowercase English word or phrase

pos

Lowercase English part of speech, one per row

multi_word

TRUE if the word record has a space (contains multiple words), else FALSE.

Source

https://archive.org/details/mobypartofspeech03203gut


Regular expression that might be used to split strings of text into component paragraphs.

Description

"\n", A regular expression to split strings when encountering a new line.

Usage

regex_paragraph

Format

A string


Regular expression that might be used to split strings of text into component sentences.

Description

"[.?!]\s", A regular expression to split strings when encountering a common end of sentence punctuation pattern.

Usage

regex_sentence

Format

A string


Regular expression that might be used to split strings of text into component words.

Description

" ", A regular expression to split strings when encountering a space.

Usage

regex_word

Format

A string


Delete rows in a text.table where the number of identical records within a group is more than a certain threshold

Description

Delete rows in a text.table where the number of identical records within a group is more than a certain threshold

Usage

rm_frequent_words(
  x,
  text,
  count_col_name = NULL,
  group_by = c(),
  max_count,
  max_count_is_ratio = FALSE,
  total_count_col = NULL
)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x used to determine deletion of rows based on the term frequency.

count_col_name

A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts.

group_by

A vector of column names to group by. Doesn't work if the group by column is a list column.

max_count

A number, the maximum number of times a word can occur to keep.

max_count_is_ratio

TRUE/FALSE, if TRUE, implies the value passed to max_count should be considered a ratio.

total_count_col

Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if max_count_is_ratio is TRUE.

Value

A text.table, with rows having a duplicate count over a certain threshold deleted.

Examples

rm_frequent_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
count_col_name = "count",
max_count = 1
)

Delete rows in a text.table where the number of identical records within a group is less than a certain threshold

Description

Delete rows in a text.table where the number of identical records within a group is less than a certain threshold

Usage

rm_infrequent_words(
  x,
  text,
  count_col_name = NULL,
  group_by = c(),
  min_count,
  min_count_is_ratio = FALSE,
  total_count_col = NULL
)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x used to determine deletion of rows based on the term frequency.

count_col_name

A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts.

group_by

A vector of column names to group by. Doesn't work if the group by column is a list column.

min_count

A number, the minimum number of times a word must occur to keep.

min_count_is_ratio

TRUE/FALSE, if TRUE, implies the value passed to min_count should be considered a ratio.

total_count_col

Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if min_count_is_ratio is TRUE.

Value

A text.table, with rows having a duplicate count of less than a certain threshold deleted.

Examples

rm_infrequent_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
count_col_name = "count",
min_count = 4
)

rm_infrequent_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the
        newspaper and it is the nice kind of dog."),
        tolower("The dog is extremely nice because it does the dishes
        and it is cool.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
count_col_name = "count",
group_by = "col1",
min_count = 2
)

Delete rows in a text.table where the word has more than a minimum number of characters

Description

Delete rows in a text.table where the word has more than a minimum number of characters

Usage

rm_long_words(x, text, max_char_length)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x used to determine deletion of rows based on the number of characters.

max_char_length

A number, the maximum number of characters allowed to not delete a row.

Value

A text.table, with rows having more than a certain number of characters deleted.

Examples

rm_long_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
max_char_length = 4
)

Delete rows in a text.table where the records within a group are not also found in other groups (overlapping records)

Description

Delete rows in a text.table where the records within a group are not also found in other groups (overlapping records)

Usage

rm_no_overlap(x, text, group_by = c())

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x to determine deletion of rows based on the lack of presence of overlapping records.

group_by

A vector of column names to group by. Doesn't work if the group by column is a list column.

Value

A text.table, with rows not having records found in multiple groups (overlapping records) deleted.

Examples

rm_no_overlap(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
group_by = "col1"
)

Delete rows in a text.table where the records within a group are also found in other groups (overlapping records)

Description

Delete rows in a text.table where the records within a group are also found in other groups (overlapping records)

Usage

rm_overlap(x, text, group_by = c())

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x to determine deletion of rows based on the presence of overlapping records.

group_by

A vector of column names to group by. Doesn't work if the group by column is a list column.

Value

A text.table, with rows having records found in multiple groups (overlapping records) deleted.

Examples

rm_overlap(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
group_by = "col1"
)

Delete rows in a text.table where the word has a certain part of speech

Description

Delete rows in a text.table where the word has a certain part of speech

Usage

rm_parts_of_speech(
  x,
  text,
  pos_delete = c("adjective", "adverb", "conjunction", "definite article",
    "interjection", "noun", "noun phrase", "plural", "preposition", "pronoun",
    "verb (intransitive)", "verb (transitive)", "verb (usu participle)")
)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x used to determine deletion of rows based on the part of speech.

pos_delete

A vector of parts of speech to delete. At least one of the following: 'adjective', 'adverb', 'conjunction', 'definite article', 'interjection', 'noun', 'noun phrase', 'plural', 'preposition', 'pronoun', 'verb (intransitive)', 'verb (transitive)', 'verb (usu participle)'

Value

A text.table, with rows matching a part of speech deleted.

Examples

rm_parts_of_speech(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
pos_delete = "conjunction"
)

Delete rows in a text.table where the record has a certain pattern indicated by a regular expression

Description

Delete rows in a text.table where the record has a certain pattern indicated by a regular expression

Usage

rm_regexp_match(x, text, pattern)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x used to determine deletion of rows based on the regular expression.

pattern

A regular expression, gets passed to grepl().

Value

A text.table, with rows having a certain pattern indicated by a regular expression deleted.

Examples

rm_regexp_match(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
pattern = "do"
)

Delete rows in a text.table where the word has less than a minimum number of characters

Description

Delete rows in a text.table where the word has less than a minimum number of characters

Usage

rm_short_words(x, text, min_char_length)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x used to determine deletion of rows based on the number of characters.

min_char_length

A number, the minimum number of characters required to not delete a row.

Value

A text.table, with rows having less than a certain number of characters deleted.

Examples

rm_short_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
min_char_length = 4
)

Remove rows from a text.table with specific words

Description

Remove rows from a text.table with specific words

Usage

rm_words(x, text, words = stopwords)

Arguments

x

A text.table created by as.text.table().

text

A string, the name of the column in x to check for words to delete.

words

A vector of words to delete from x.

Value

A text.table, with rows deleted if the words in those rows are in the vector of words to delete.

Examples

rm_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2"
)

Generates (pseudo)random strings of the specified char length

Description

Generates (pseudo)random strings of the specified char length

Usage

sampleStr(char)

Arguments

char

A integer, the number of chars to include in the output string.

Value

A string.

Examples

sampleStr(10)

Vector of lowercase English stop words.

Description

Unique lowercase English stop words from 3 lexicons combined into one vector. Combines snowball, onix, and SMART lists of stopwords.

Usage

stopwords

Format

A vector of 728 unique English stop words in lowercase

Source

http://snowball.tartarus.org/algorithms/english/stop.txt

http://www.lextek.com/manuals/onix/stopwords1.html

http://www.lextek.com/manuals/onix/stopwords2.html


Detect if there are any words in a vector also found in another vector.

Description

Detect if there are any words in a vector also found in another vector.

Usage

str_any_match(x, y)

Arguments

x

A vector of words.

y

A vector of words to test against.

Value

TRUE/FALSE, TRUE if any words in x are also in y

Examples

str_any_match(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("the")
)
str_any_match(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("apple")
)

Count the intersecting words in a vector that are found in another vector (only counts unique words).

Description

Count the intersecting words in a vector that are found in another vector (only counts unique words).

Usage

str_count_intersect(x, y)

Arguments

x

A vector of words.

y

A vector of words to test against.

Value

A number, the count of unique words in x also in y

Examples

str_count_intersect(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("dog", "to", "store")
)

Calculates the intersect divided by union of two vectors of words.

Description

Calculates the intersect divided by union of two vectors of words.

Usage

str_count_jaccard_similarity(x, y)

Arguments

x

A vector of words.

y

A vector of words to test against.

Value

A number, the intersect divided by union of two vectors of words.

Examples

str_count_jaccard_similarity(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("this", "dog", "went", "to", "the", "store")
)

Count the words in a vector that are found in another vector.

Description

Count the words in a vector that are found in another vector.

Usage

str_count_match(x, y, ratio = FALSE)

Arguments

x

A vector of words.

y

A vector of words to test against.

ratio

TRUE/FALSE, if TRUE, returns the number of words in x with a match in y divided by the number of words in x.

Value

A number, the count of words in x also in y

Examples

str_count_match(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("dog", "to", "store")
)
str_count_match(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("dog", "to", "store"),
ratio = TRUE
)

Count the words in a vector that are not found in another vector.

Description

Count the words in a vector that are not found in another vector.

Usage

str_count_nomatch(x, y, ratio = FALSE)

Arguments

x

A vector of words.

y

A vector of words to test against.

ratio

TRUE/FALSE, if TRUE, returns the number of words in x without a match in y divided by the number of words in x.

Value

A number, the count of words in x and not in y

Examples

str_count_nomatch(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("dog", "to", "store")
)
str_count_nomatch(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("dog", "store"),
ratio = TRUE
)

Count words from a vector that are found in the same position in another vector.

Description

Count words from a vector that are found in the same position in another vector.

Usage

str_count_positional_match(x, y, ratio = FALSE)

Arguments

x

A vector of words.

y

A vector of words to test against.

ratio

TRUE/FALSE, if TRUE, returns the number of words in x with a positional match in y divided by the number of words in x.

Value

A count of the words in x with matches in the same position in y.

Examples

str_count_positional_match(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("this", "dog", "ran", "from", "the", "store")
)

Count words from a vector that are not found in the same position in another vector.

Description

Count words from a vector that are not found in the same position in another vector.

Usage

str_count_positional_nomatch(x, y, ratio = FALSE)

Arguments

x

A vector of words.

y

A vector of words to test against.

ratio

TRUE/FALSE, if TRUE, returns the number of words in x without a positional match in y divided by the number of words in x.

Value

A count of the words in x without matches in the same position in y.

Examples

str_count_positional_nomatch(
x = c("a", "cool", "dog", "went", "to", "the", "store"),
y = c("a", "dog", "ran", "from", "the", "store")
)

Count the words in a vector that don't intersect with another vector (only counts unique words).

Description

Count the words in a vector that don't intersect with another vector (only counts unique words).

Usage

str_count_setdiff(x, y)

Arguments

x

A vector of words.

y

A vector of words to test against.

Value

A number, the count of unique words in x not also in y

Examples

str_count_setdiff(
x = c("a", "dog", "dog", "went", "to", "the", "store"),
y = c("dog", "to", "store")
)

Create a list of a vector of unique words found in x and a vector of the counts of each word in x.

Description

Create a list of a vector of unique words found in x and a vector of the counts of each word in x.

Usage

str_counts(x)

Arguments

x

A vector of words.

Value

A list, position one is a vector of unique and sorted words from x, position two is a vector of the counts for each word.

Examples

str_counts(
x = c("a", "dog", "went", "to", "the", "store", "and", "a", "dog", "went", "to", "another", "store")
)

Combine columns of a data.table into a list in a new column, wraps list(unlist(c(...)))

Description

Combine columns of a data.table into a list in a new column, wraps list(unlist(c(...)))

Usage

str_dt_col_combine(...)

Arguments

...

Unquoted column names of a data.table.

Value

A list, with columns combined into a vector if grouped properly

Examples

as.data.table(
list(
  col1 = c(
    "a",
    "b"
  ),
  col2 = c(
    tolower("The dog is nice because it picked up the newspaper."),
    tolower("The dog is extremely nice because it does the dishes.")
  ),
  col3 = c(
    "test 1",
    "test 2"
  )
)
)[, col4 := .(str_dt_col_combine(col2, col3)), col1]

Extract words from a vector that are found in another vector.

Description

Extract words from a vector that are found in another vector.

Usage

str_extract_match(x, y)

Arguments

x

A vector of words.

y

A vector of words to test against.

Value

x, with the words not found in y removed.

Examples

str_extract_match(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("dog", "to", "store")
)

Extract words from a vector that are not found in another vector.

Description

Extract words from a vector that are not found in another vector.

Usage

str_extract_nomatch(x, y)

Arguments

x

A vector of words.

y

A vector of words to test against.

Value

x, with the words found in y removed.

Examples

str_extract_nomatch(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("dog", "to", "store")
)

Extract words from a vector that are found in the same position in another vector.

Description

Extract words from a vector that are found in the same position in another vector.

Usage

str_extract_positional_match(x, y)

Arguments

x

A vector of words.

y

A vector of words to test against.

Value

x, with the words without matches in the same position in y removed.

Examples

str_extract_positional_match(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("this", "dog", "ran", "from", "the", "store")
)

Extract words from a vector that are not found in the same position in another vector.

Description

Extract words from a vector that are not found in the same position in another vector.

Usage

str_extract_positional_nomatch(x, y)

Arguments

x

A vector of words.

y

A vector of words to test against.

Value

x, with the words with matches found in the same position in y removed.

Examples

str_extract_positional_nomatch(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("a", "crazy", "dog", "ran", "from", "the", "store")
)

Remove and replace excess white space from strings.

Description

Remove and replace excess white space from strings.

Usage

str_rm_blank_space(x, replacement = " ")

Arguments

x

A vector or string.

replacement

A string to replace the blank space with, defaults to " ", which replaces excess space with a single space.

Value

x, with extra white space removed/replaced.

Examples

str_rm_blank_space(c("this     is   a test.  ", "    will it    work? "))

Remove words from a vector that have more than a maximum number of characters.

Description

Remove words from a vector that have more than a maximum number of characters.

Usage

str_rm_long_words(x, max_char_length)

Arguments

x

A vector of words.

max_char_length

An integer, the maximum number of characters a word can have to not be removed.

Value

x, with the words not having a character count less than or equal to the max_char_length removed.

Examples

str_rm_long_words(
x = c("a", "dog", "went", "to", "the", "store"),
max_char_length = 2
)

Remove and replace non-alphanumeric characters from strings.

Description

Remove and replace non-alphanumeric characters from strings.

Usage

str_rm_non_alphanumeric(x, replacement = " ")

Arguments

x

A vector or string.

replacement

A string to replace the numbers with, defaults to " ".

Value

x, with non-alphanumeric (A-z, 0-9) characters removed/replaced.

Examples

str_rm_non_alphanumeric(c("test 67890 * % $ "))

Remove and replace non-printable characters from strings.

Description

Remove and replace non-printable characters from strings.

Usage

str_rm_non_printable(x, replacement = " ")

Arguments

x

A vector or string.

replacement

A string to replace the numbers with, defaults to " ".

Value

x, with non-printable characters removed/replaced.

Examples

str_rm_non_printable(c("test \n\n\n67890 * % $ "))

Remove and replace numbers from strings.

Description

Remove and replace numbers from strings.

Usage

str_rm_numbers(x, replacement = "")

Arguments

x

A vector or string.

replacement

A string to replace the numbers with, defaults to "".

Value

x, with numbers 0-9 removed/replaced.

Examples

str_rm_numbers(c("test 1a234b5", "test 67890"))

Remove and replace punctuation from strings.

Description

Remove and replace punctuation from strings.

Usage

str_rm_punctuation(x, replacement = "")

Arguments

x

A vector or string.

replacement

A string to replace the punctuation with, defaults to "".

Value

x, with punctuation removed/replaced.

Examples

str_rm_punctuation(c("wait, is this is a test?", "Tests: . ! ?"))

Remove words from a vector that match a regular expression.

Description

Remove words from a vector that match a regular expression.

Usage

str_rm_regexp_match(x, pattern)

Arguments

x

A vector of words.

pattern

A regular expression.

Value

x, with the words matching the regular expression removed.

Examples

str_rm_regexp_match(
x = c("a", "dog", "went", "to", "the", "store"),
pattern = "to"
)

Remove words from a vector that don't have a minimum number of characters.

Description

Remove words from a vector that don't have a minimum number of characters.

Usage

str_rm_short_words(x, min_char_length)

Arguments

x

A vector of words.

min_char_length

An integer, the minimum number of characters a word can have to not be removed.

Value

x, with the words not having a character count greater than or equal to the min_char_length removed.

Examples

str_rm_short_words(
x = c("a", "dog", "went", "to", "the", "store"),
min_char_length = 3
)

Remove words from a vector of words found in another vector of words.

Description

Remove words from a vector of words found in another vector of words.

Usage

str_rm_words(x, y = stopwords)

Arguments

x

A vector of words.

y

A vector of words to delete from x, defaults to English stop words.

Value

x, with the words found in y removed.

Examples

str_rm_words(
x = c("a", "dog", "went", "to", "the", "store"),
y = stopwords
)

str_rm_words(
x = c("a", "dog", "went", "to", "the", "store"),
y = c("dog", "store")
)

Remove words from a vector based on the number of characters in each word.

Description

Remove words from a vector based on the number of characters in each word.

Usage

str_rm_words_by_length(x, min_char_length = 0, max_char_length = Inf)

Arguments

x

A vector of words.

min_char_length

An integer, the minimum number of characters a word can have to not be removed.

max_char_length

An integer, the maximum number of characters a word can have to not be removed.

Value

x, with the words not having a character count of at least the min_char_length and at most the max_char_length removed.

Examples

str_rm_words_by_length(
x = c("a", "dog", "went", "to", "the", "store"),
min_char_length = 3
)

Create a vector of English words associated with particular parts of speech.

Description

Create a vector of English words associated with particular parts of speech.

Usage

str_stopwords_by_part_of_speech(
  parts = c("adjective", "adverb", "conjunction", "definite article", "interjection",
    "noun", "noun phrase", "plural", "preposition", "pronoun", "verb (intransitive)",
    "verb (transitive)", "verb (usu participle)"),
  include_multi_word = FALSE
)

Arguments

parts

A vector, at least one of the following: 'adjective', 'adverb', 'conjunction', 'definite article', 'interjection', 'noun', 'noun phrase', 'plural', 'preposition', 'pronoun', 'verb (intransitive)', 'verb (transitive)', 'verb (usu participle)'

include_multi_word

TRUE/FALSE, if TRUE, includes records from the pos data.table where multi_word == TRUE, otherwise excludes these records.

Value

A vector of words matching the part of speech shown in the data.table pos.

Examples

str_stopwords_by_part_of_speech('adjective')

Calls base::tolower(), which converts letters to lowercase. Only included to point out that base::tolower exists and should be used directly.

Description

Calls base::tolower(), which converts letters to lowercase. Only included to point out that base::tolower exists and should be used directly.

Usage

str_tolower(x)

Arguments

x

A vector or string.

Value

x, converted to lowercase.

Examples

str_tolower(c("ALLCAPS", "Some capS"))

Weighted count of the words in a vector that are found in another vector.

Description

Weighted count of the words in a vector that are found in another vector.

Usage

str_weighted_count_match(x, y)

Arguments

x

A list of words and counts created by str_counts(x).

y

A list of words and counts created by str_counts(y).

Value

A number, the count of words in x also in y scaled by the number of times each word appears in x and y. If a word appears 3 times in x and 2 times in y, the result is 6, assuming no other words match.

Examples

str_weighted_count_match(
x = str_counts(c("a", "dog", "dog", "went", "to", "the", "store")),
y = str_counts(c("dog", "dog", "dog"))
)