This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).

clean_variable_spelling(
  x = data.frame(),
  wordlists = list(),
  from = 1,
  to = 2,
  spelling_vars = 3,
  sort_by = NULL,
  classes = NULL,
  warn = FALSE
)

Arguments

x

a data.frame

wordlists

a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the wordlists by column in x (see spelling_vars).

from

a column name or position defining words or keys to be replaced

to

a column name or position defining replacement values

spelling_vars

character or integer. If wordlists is a data frame, then this column in defines the columns in x corresponding to each section of the wordlists data frame. This defaults to 3, indicating the third column is to be used.

sort_by

a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.

classes

a vector of class definitions for each of the columns. If this is not provided, the classes will be read from the columns themselves. Practically, this is used in clean_data() to mark columns as protected.

warn

if TRUE, warnings and errors from clean_spelling() will be shown as a single warning. Defaults to FALSE, which shows nothing.

Value

a data frame with re-defined data based on the dictionary

Details

By default, this applies the function clean_spelling() to all columns specified by the column names listed in spelling_vars, or, if a global dictionary is used, this includes all character and factor columns as well.

spelling_vars

Spelling variables within wordlists represent keys that you want to match to column names in x (the data set). These are expected to match exactly with the exception of two reserved keywords that starts with a full stop:

  • .regex [pattern]: any column whose name is matched by [pattern]. The [pattern] should be an unquoted, valid, PERL-flavored regular expression.

  • .global: any column (see Section Global wordlists)

Global wordlists

A global wordlist is a set of definitions applied to all valid columns of x indiscriminantly.

  • .global spelling_var: If you want to apply a set of definitions to all valid columns in addition to specified columns, then you can include a .global group in the spelling_var column of your wordlists data frame. This is useful for setting up a dictionary of common spelling errors. NOTE: specific variable definitions will override global defintions. For example: if you have a column for cardinal directions and a definiton for N = North, then the global variable N = no will not override that. See Example.

  • spelling_vars = NULL: If you want your data frame to be applied to all character/factor columns indiscriminantly, then setting spelling_vars = NULL will use that wordlist globally.

Note

This function will only parse character and factor columns to protect numeric and Date columns from conversion to character.

See also

matchmaker::match_df(), which this function wraps.

Examples

# Read in dictionary and coded date examples -------------------- wordlist <- read.csv(linelist_example("spelling-dictionary.csv"), stringsAsFactors = FALSE) dat <- read.csv(linelist_example("coded-data.csv"), stringsAsFactors = FALSE) dat$date <- as.Date(dat$date) # Clean spelling based on wordlist ------------------------------ wordlist # show the wordlist
#> options values grp orders #> 1 y Yes readmission 1 #> 2 n No readmission 2 #> 3 u Unknown readmission 3 #> 4 .missing Missing readmission 4 #> 5 0 Yes treated 1 #> 6 1 No treated 2 #> 7 .missing Missing treated 3 #> 8 1 Facility 1 facility 1 #> 9 2 Facility 2 facility 2 #> 10 3 Facility 3 facility 3 #> 11 4 Facility 4 facility 4 #> 12 5 Facility 5 facility 5 #> 13 6 Facility 6 facility 6 #> 14 7 Facility 7 facility 7 #> 15 8 Facility 8 facility 8 #> 16 9 Facility 9 facility 9 #> 17 10 Facility 10 facility 10 #> 18 .default Unknown facility 11 #> 19 0 0-9 age_group 1 #> 20 10 10-19 age_group 2 #> 21 20 20-29 age_group 3 #> 22 30 30-39 age_group 4 #> 23 40 40-49 age_group 5 #> 24 50 50+ age_group 6 #> 25 high High .regex ^lab_result_ 1 #> 26 norm Normal .regex ^lab_result_ 2 #> 27 inc Inconclusive .regex ^lab_result_ 3 #> 28 y yes .global Inf #> 29 n no .global Inf #> 30 u unknown .global Inf #> 31 unk unknown .global Inf #> 32 oui yes .global Inf #> 33 .missing missing .global Inf
head(dat) # show the data
#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 <NA> 0 C 10 unk #> 2 e80a37 2019-07-07 y 0 3 10 inc #> 3 b72883 2019-07-07 y 1 8 30 inc #> 4 c9ee86 2019-07-09 n 1 4 40 inc #> 5 40bc7a 2019-07-12 n 1 6 0 norm #> 6 46566e 2019-07-14 y NA B 50 unk #> lab_result_02 lab_result_03 has_symptoms followup #> 1 high inc <NA> u #> 2 unk norm y oui #> 3 norm inc oui #> 4 inc unk y oui #> 5 unk norm <NA> n #> 6 unk inc <NA> <NA>
res1 <- clean_variable_spelling(dat, wordlists = wordlist, from = "options", to = "values", spelling_vars = "grp") head(res1)
#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown #> lab_result_02 lab_result_03 has_symptoms followup #> 1 High Inconclusive missing unknown #> 2 unknown Normal yes yes #> 3 Normal Inconclusive missing yes #> 4 Inconclusive unknown yes yes #> 5 unknown Normal missing no #> 6 unknown Inconclusive missing missing
# You can ensure the order of the factors are correct by specifying # a column that defines order. dat[] <- lapply(dat, as.factor) as.list(head(dat))
#> $id #> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e #> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389 #> #> $date #> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14 #> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16 #> #> $readmission #> [1] <NA> y y n n y #> Levels: n u y #> #> $treated #> [1] 0 0 1 1 1 <NA> #> Levels: 0 1 #> #> $facility #> [1] C 3 8 4 6 B #> Levels: 1 10 2 3 4 5 6 7 8 9 A B C #> #> $age_group #> [1] 10 10 30 40 0 50 #> Levels: 0 10 20 30 40 50 #> #> $lab_result_01 #> [1] unk inc inc inc norm unk #> Levels: high inc norm unk #> #> $lab_result_02 #> [1] high unk norm inc unk unk #> Levels: high inc norm unk #> #> $lab_result_03 #> [1] inc norm inc unk norm inc #> Levels: high inc norm unk #> #> $has_symptoms #> [1] <NA> y y <NA> <NA> #> Levels: n oui u unk y #> #> $followup #> [1] u oui oui oui n <NA> #> Levels: n oui u unk y #>
res2 <- clean_variable_spelling(dat, wordlists = wordlist, from = "options", to = "values", spelling_vars = "grp", sort_by = "orders") head(res2)
#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown #> lab_result_02 lab_result_03 has_symptoms followup #> 1 High Inconclusive missing unknown #> 2 unknown Normal yes yes #> 3 Normal Inconclusive missing yes #> 4 Inconclusive unknown yes yes #> 5 unknown Normal missing no #> 6 unknown Inconclusive missing missing
as.list(head(res2))
#> $id #> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e #> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389 #> #> $date #> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14 #> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16 #> #> $readmission #> [1] Missing Yes Yes No No Yes #> Levels: Yes No Unknown Missing #> #> $treated #> [1] Yes Yes No No No Missing #> Levels: Yes No Missing #> #> $facility #> [1] Unknown Facility 3 Facility 8 Facility 4 Facility 6 Unknown #> 11 Levels: Facility 1 Facility 2 Facility 3 Facility 4 ... Unknown #> #> $age_group #> [1] 10-19 10-19 30-39 40-49 0-9 50+ #> Levels: 0-9 10-19 20-29 30-39 40-49 50+ #> #> $lab_result_01 #> [1] unknown Inconclusive Inconclusive Inconclusive Normal #> [6] unknown #> Levels: High Normal Inconclusive unknown #> #> $lab_result_02 #> [1] High unknown Normal Inconclusive unknown #> [6] unknown #> Levels: High Normal Inconclusive unknown #> #> $lab_result_03 #> [1] Inconclusive Normal Inconclusive unknown Normal #> [6] Inconclusive #> Levels: High Normal Inconclusive unknown #> #> $has_symptoms #> [1] missing yes missing yes missing missing #> Levels: yes no unknown missing #> #> $followup #> [1] unknown yes yes yes no missing #> Levels: yes no unknown missing #>