Rename values in a vector based on a wordlist

This function provides an interface for forcats::fct_recode(), forcats::fct_explicit_na(), and forcats::fct_relevel() in such a way that a data wordlist can be imported from a data frame.

clean_spelling(
  x = character(),
  wordlist = data.frame(),
  from = 1,
  to = 2,
  quiet = FALSE,
  warn_default = TRUE,
  anchor_regex = TRUE
)

Arguments

x	a character or factor vector
wordlist	a matrix or data frame defining mis-spelled words or keys in one column (`from`) and replacement values (`to`) in another column. There are keywords that can be appended to the `from` column for addressing default values and missing data.
from	a column name or position defining words or keys to be replaced
to	a column name or position defining replacement values
quiet	a `logical` indicating if warnings should be issued if no replacement is made; if `FALSE`, these warnings will be disabled
warn_default	a `logical`. When a `.default` keyword is set and `warn_default = TRUE`, a warning will be issued listing the variables that were changed to the default value. This can be used to update your wordlist.
anchor_regex	a `logical`. When `TRUE` (default), any regex within the keywork

Value

a vector of the same type as x with mis-spelled labels cleaned. Note that factors will be arranged by the order presented in the data wordlist; other levels will appear afterwards.

Details

Keys (`from` column)

The from column of the wordlist will contain the keys that you want to match in your current data set. These are expected to match exactly with the exception of three reserved keywords that start with a full stop:

.regex [pattern]: will replace anything matching [pattern]. This is executed before any other replacements are made. The [pattern] should be an unquoted, valid, PERL-flavored regular expression. Any whitespace padding the regular expression is discarded.
.missing: replaces any missing values (see NOTE)
.default: replaces ALL values that are not defined in the wordlist and are not missing.

Values (second column)

The values will replace their respective keys exactly as they are presented.

There is currently one recognised keyword that can be placed in the to column of your wordlist:

.na: Replace keys with missing data. When used in combination with the .missing keyword (in column 1), it can allow you to differentiate between explicit and implicit missing data.

Note

If there are any missing values in the from column (keys), then they are automatically converted to the character "NA" with a warning. If you want to target missing data with your wordlist, use the .missing keyword. The .regex keyword uses gsub() with the perl = TRUE option for replacement.

Examples


corrections <- data.frame(
  bad = c("foubar", "foobr", "fubar", "unknown", ".missing"),
  good = c("foobar", "foobar", "foobar", ".na", "missing"),
  stringsAsFactors = FALSE
)
corrections
#>        bad    good
#> 1   foubar  foobar
#> 2    foobr  foobar
#> 3    fubar  foobar
#> 4  unknown     .na
#> 5 .missing missing

# create some fake data
my_data <- c(letters[1:5], sample(corrections$bad[-5], 10, replace = TRUE))
my_data[sample(6:15, 2)] <- NA  # with missing elements

clean_spelling(my_data, corrections)
#>  [1] "a"       "b"       "c"       "d"       "e"       "missing" "foobar" 
#>  [8] NA        "foobar"  "foobar"  NA        "foobar"  "foobar"  "missing"
#> [15] "foobar" 

# You can use regular expressions to simplify your list
corrections <- data.frame(
  bad =  c(".regex f[ou][^m].+?r$", "unknown", ".missing"),
  good = c("foobar",                ".na",     "missing"),
  stringsAsFactors = FALSE
)

# You can also set a default value
corrections_with_default <- rbind(corrections, c(bad = ".default", good = "unknown"))
corrections_with_default
#>                     bad    good
#> 1 .regex f[ou][^m].+?r$  foobar
#> 2               unknown     .na
#> 3              .missing missing
#> 4              .default unknown

# a warning will be issued about the data that were converted
clean_spelling(my_data, corrections_with_default)
#> Warning: 'a', 'b', 'c', 'd', 'e' were changed to the default value ('unknown')
#>  [1] "unknown" "unknown" "unknown" "unknown" "unknown" "missing" "foobar" 
#>  [8] NA        "foobar"  "foobar"  NA        "foobar"  "foobar"  "missing"
#> [15] "foobar" 

# use the warn_default = FALSE, if you are absolutely sure you don't want it.
clean_spelling(my_data, corrections_with_default, warn_default = FALSE)
#>  [1] "unknown" "unknown" "unknown" "unknown" "unknown" "missing" "foobar" 
#>  [8] NA        "foobar"  "foobar"  NA        "foobar"  "foobar"  "missing"
#> [15] "foobar" 

# The function will give you a warning if the wordlist does not
# match the data
clean_spelling(letters, corrections)
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"

# The can be used for translating survey output

words <- data.frame(
  option_code = c(".regex ^[yY][eE]?[sS]?",
                  ".regex ^[nN][oO]?",
                  ".regex ^[uU][nN]?[kK]?",
                  ".missing"),
  option_name = c("Yes", "No", ".na", "Missing"),
  stringsAsFactors = FALSE
)
clean_spelling(c("Y", "Y", NA, "No", "U", "UNK", "N"), words)
#> [1] "Yes"     "Yes"     "Missing" "No"      NA        NA        "No"

Arguments

Value

Details

Keys (`from` column)

Values (second column)

Note

See also

Examples

Contents

Author

Rename values in a vector based on a wordlist

Arguments

Value

Details

Keys (from column)

Values (second column)

Note

See also

Examples

Contents

Author

Keys (`from` column)