This function provides an interface for forcats::fct_recode()
,
forcats::fct_explicit_na()
, and forcats::fct_relevel()
in such a way that
a data wordlist can be imported from a data frame.
clean_spelling( x = character(), wordlist = data.frame(), from = 1, to = 2, quiet = FALSE, warn_default = TRUE, anchor_regex = TRUE )
x | a character or factor vector |
---|---|
wordlist | a matrix or data frame defining mis-spelled words or keys
in one column ( |
from | a column name or position defining words or keys to be replaced |
to | a column name or position defining replacement values |
quiet | a |
warn_default | a |
anchor_regex | a |
a vector of the same type as x
with mis-spelled labels cleaned.
Note that factors will be arranged by the order presented in the data
wordlist; other levels will appear afterwards.
from
column)The from
column of the wordlist will contain the keys that you want to
match in your current data set. These are expected to match exactly with
the exception of three reserved keywords that start with a full stop:
.regex [pattern]
: will replace anything matching [pattern]
. This
is executed before any other replacements are made. The [pattern]
should be an unquoted, valid, PERL-flavored regular expression. Any
whitespace padding the regular expression is discarded.
.missing
: replaces any missing values (see NOTE)
.default
: replaces ALL values that are not defined in the wordlist
and are not missing.
The values will replace their respective keys exactly as they are presented.
There is currently one recognised keyword that can be placed in the to
column of your wordlist:
.na
: Replace keys with missing data. When used in combination with the
.missing
keyword (in column 1), it can allow you to differentiate
between explicit and implicit missing data.
If there are any missing values in the from
column (keys), then they
are automatically converted to the character "NA" with a warning. If you want
to target missing data with your wordlist, use the .missing
keyword. The
.regex
keyword uses gsub()
with the perl = TRUE
option for replacement.
matchmaker::match_vec()
, which this function wraps and
matchmaker::match_df()
for an implementation that acts across
multiple variables in a data frame.
corrections <- data.frame( bad = c("foubar", "foobr", "fubar", "unknown", ".missing"), good = c("foobar", "foobar", "foobar", ".na", "missing"), stringsAsFactors = FALSE ) corrections#> bad good #> 1 foubar foobar #> 2 foobr foobar #> 3 fubar foobar #> 4 unknown .na #> 5 .missing missing# create some fake data my_data <- c(letters[1:5], sample(corrections$bad[-5], 10, replace = TRUE)) my_data[sample(6:15, 2)] <- NA # with missing elements clean_spelling(my_data, corrections)#> [1] "a" "b" "c" "d" "e" "missing" "foobar" #> [8] NA "foobar" "foobar" NA "foobar" "foobar" "missing" #> [15] "foobar"# You can use regular expressions to simplify your list corrections <- data.frame( bad = c(".regex f[ou][^m].+?r$", "unknown", ".missing"), good = c("foobar", ".na", "missing"), stringsAsFactors = FALSE ) # You can also set a default value corrections_with_default <- rbind(corrections, c(bad = ".default", good = "unknown")) corrections_with_default#> bad good #> 1 .regex f[ou][^m].+?r$ foobar #> 2 unknown .na #> 3 .missing missing #> 4 .default unknown# a warning will be issued about the data that were converted clean_spelling(my_data, corrections_with_default)#> Warning: 'a', 'b', 'c', 'd', 'e' were changed to the default value ('unknown')#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "missing" "foobar" #> [8] NA "foobar" "foobar" NA "foobar" "foobar" "missing" #> [15] "foobar"# use the warn_default = FALSE, if you are absolutely sure you don't want it. clean_spelling(my_data, corrections_with_default, warn_default = FALSE)#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "missing" "foobar" #> [8] NA "foobar" "foobar" NA "foobar" "foobar" "missing" #> [15] "foobar"# The function will give you a warning if the wordlist does not # match the data clean_spelling(letters, corrections)#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" #> [20] "t" "u" "v" "w" "x" "y" "z"# The can be used for translating survey output words <- data.frame( option_code = c(".regex ^[yY][eE]?[sS]?", ".regex ^[nN][oO]?", ".regex ^[uU][nN]?[kK]?", ".missing"), option_name = c("Yes", "No", ".na", "Missing"), stringsAsFactors = FALSE ) clean_spelling(c("Y", "Y", NA, "No", "U", "UNK", "N"), words)#> [1] "Yes" "Yes" "Missing" "No" NA NA "No"