This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).

match_df(
  x = data.frame(),
  dictionary = list(),
  from = 1,
  to = 2,
  by = 3,
  order = NULL,
  warn = FALSE
)

Arguments

x

a character or factor vector

dictionary

a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the dictionary by column in x (see by).

from

a column name or position defining words or keys to be replaced

to

a column name or position defining replacement values

by

character or integer. If dictionary is a data frame, then this column in defines the columns in x corresponding to each section of the dictionary data frame. This defaults to 3, indicating the third column is to be used.

order

a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.

warn

if TRUE, warnings and errors from match_vec() will be shown as a single warning. Defaults to FALSE, which shows nothing.

Value

a data frame with re-defined data based on the dictionary

Details

By default, this applies the function match_vec() to all columns specified by the column names listed in by, or, if a global dictionary is used, this includes all character and factor columns as well.

by column

Spelling variables within dictionary represent keys that you want to match to column names in x (the data set). These are expected to match exactly with the exception of two reserved keywords that starts with a full stop:

  • .regex [pattern]: any column whose name is matched by [pattern]. The [pattern] should be an unquoted, valid, PERL-flavored regular expression.

  • .global: any column (see Section Global dictionary)

Global dictionary

A global dictionary is a set of definitions applied to all valid columns of x indiscriminantly.

  • .global keyword in by: If you want to apply a set of definitions to all valid columns in addition to specified columns, then you can include a .global group in the by column of your dictionary data frame. This is useful for setting up a dictionary of common spelling errors. NOTE: specific variable definitions will override global defintions. For example: if you have a column for cardinal directions and a definiton for N = North, then the global variable N = no will not override that. See Example.

  • by = NULL: If you want your data frame to be applied to all character/factor columns indiscriminantly, then setting by = NULL will use that dictionary globally.

See also

match_vec(), which this function wraps.

Examples

# Read in dictionary and coded date examples -------------------- dict <- read.csv(matchmaker_example("spelling-dictionary.csv"), stringsAsFactors = FALSE) dat <- read.csv(matchmaker_example("coded-data.csv"), stringsAsFactors = FALSE) dat$date <- as.Date(dat$date) # Clean spelling based on dictionary ----------------------------- dict # show the dict
#> options values grp orders #> 1 y Yes readmission 1 #> 2 n No readmission 2 #> 3 u Unknown readmission 3 #> 4 .missing Missing readmission 4 #> 5 0 Yes treated 1 #> 6 1 No treated 2 #> 7 .missing Missing treated 3 #> 8 1 Facility 1 facility 1 #> 9 2 Facility 2 facility 2 #> 10 3 Facility 3 facility 3 #> 11 4 Facility 4 facility 4 #> 12 5 Facility 5 facility 5 #> 13 6 Facility 6 facility 6 #> 14 7 Facility 7 facility 7 #> 15 8 Facility 8 facility 8 #> 16 9 Facility 9 facility 9 #> 17 10 Facility 10 facility 10 #> 18 .default Unknown facility 11 #> 19 0 0-9 age_group 1 #> 20 10 10-19 age_group 2 #> 21 20 20-29 age_group 3 #> 22 30 30-39 age_group 4 #> 23 40 40-49 age_group 5 #> 24 50 50+ age_group 6 #> 25 high High .regex ^lab_result_ 1 #> 26 norm Normal .regex ^lab_result_ 2 #> 27 inc Inconclusive .regex ^lab_result_ 3 #> 28 y yes .global Inf #> 29 n no .global Inf #> 30 u unknown .global Inf #> 31 unk unknown .global Inf #> 32 oui yes .global Inf #> 33 .missing missing .global Inf
head(dat) # show the data
#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 <NA> 0 C 10 unk #> 2 e80a37 2019-07-07 y 0 3 10 inc #> 3 b72883 2019-07-07 y 1 8 30 inc #> 4 c9ee86 2019-07-09 n 1 4 40 inc #> 5 40bc7a 2019-07-12 n 1 6 0 norm #> 6 46566e 2019-07-14 y NA B 50 unk #> lab_result_02 lab_result_03 has_symptoms followup #> 1 high inc <NA> u #> 2 unk norm y oui #> 3 norm inc oui #> 4 inc unk y oui #> 5 unk norm <NA> n #> 6 unk inc <NA> <NA>
res1 <- match_df(dat, dictionary = dict, from = "options", to = "values", by = "grp") head(res1)
#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown #> lab_result_02 lab_result_03 has_symptoms followup #> 1 High Inconclusive missing unknown #> 2 unknown Normal yes yes #> 3 Normal Inconclusive missing yes #> 4 Inconclusive unknown yes yes #> 5 unknown Normal missing no #> 6 unknown Inconclusive missing missing
# Show warnings/errors from each column -------------------------- # Internally, the `match_vec()` function can be quite noisy with warnings for # various reasons. Thus, by default, the `match_df()` function will keep # these quiet, but you can have them printed to your console if you use the # warn = TRUE option: res1 <- match_df(dat, dictionary = dict, from = "options", to = "values", by = "grp", warn = TRUE)
#>
#> ── Warnings were found in the following columns ──
#>
#> ● age_group
#> 1. ⚠ None of the variables in `age_group` were found in the global #> dictionary. Did you use the correct dictionary?
#> ● facility
#> 1. ⚠ None of the variables in `facility` were found in the global #> dictionary. Did you use the correct dictionary?
#> 2. ⚠ 'A', 'B', 'C' were changed to the default value ('Unknown')
#> ● readmission
#> 1. ⚠ None of the variables in `readmission` were found in the global #> dictionary. Did you use the correct dictionary?
#> ● treated
#> 1. ⚠ None of the variables in `treated` were found in the global #> dictionary. Did you use the correct dictionary?
#> ● id
#> 1. ⚠ None of the variables in `id` were found in `dict`. Did you use the #> correct dictionary?
head(res1)
#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown #> lab_result_02 lab_result_03 has_symptoms followup #> 1 High Inconclusive missing unknown #> 2 unknown Normal yes yes #> 3 Normal Inconclusive missing yes #> 4 Inconclusive unknown yes yes #> 5 unknown Normal missing no #> 6 unknown Inconclusive missing missing
# You can ensure the order of the factors are correct by specifying # a column that defines order. dat[] <- lapply(dat, as.factor) as.list(head(dat))
#> $id #> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e #> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389 #> #> $date #> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14 #> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16 #> #> $readmission #> [1] <NA> y y n n y #> Levels: n u y #> #> $treated #> [1] 0 0 1 1 1 <NA> #> Levels: 0 1 #> #> $facility #> [1] C 3 8 4 6 B #> Levels: 1 10 2 3 4 5 6 7 8 9 A B C #> #> $age_group #> [1] 10 10 30 40 0 50 #> Levels: 0 10 20 30 40 50 #> #> $lab_result_01 #> [1] unk inc inc inc norm unk #> Levels: high inc norm unk #> #> $lab_result_02 #> [1] high unk norm inc unk unk #> Levels: high inc norm unk #> #> $lab_result_03 #> [1] inc norm inc unk norm inc #> Levels: high inc norm unk #> #> $has_symptoms #> [1] <NA> y y <NA> <NA> #> Levels: n oui u unk y #> #> $followup #> [1] u oui oui oui n <NA> #> Levels: n oui u unk y #>
res2 <- match_df(dat, dictionary = dict, from = "options", to = "values", by = "grp", order = "orders") head(res2)
#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown #> lab_result_02 lab_result_03 has_symptoms followup #> 1 High Inconclusive missing unknown #> 2 unknown Normal yes yes #> 3 Normal Inconclusive missing yes #> 4 Inconclusive unknown yes yes #> 5 unknown Normal missing no #> 6 unknown Inconclusive missing missing
as.list(head(res2))
#> $id #> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e #> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389 #> #> $date #> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14 #> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16 #> #> $readmission #> [1] Missing Yes Yes No No Yes #> Levels: Yes No Unknown Missing #> #> $treated #> [1] Yes Yes No No No Missing #> Levels: Yes No Missing #> #> $facility #> [1] Unknown Facility 3 Facility 8 Facility 4 Facility 6 Unknown #> 11 Levels: Facility 1 Facility 2 Facility 3 Facility 4 ... Unknown #> #> $age_group #> [1] 10-19 10-19 30-39 40-49 0-9 50+ #> Levels: 0-9 10-19 20-29 30-39 40-49 50+ #> #> $lab_result_01 #> [1] unknown Inconclusive Inconclusive Inconclusive Normal #> [6] unknown #> Levels: High Normal Inconclusive unknown #> #> $lab_result_02 #> [1] High unknown Normal Inconclusive unknown #> [6] unknown #> Levels: High Normal Inconclusive unknown #> #> $lab_result_03 #> [1] Inconclusive Normal Inconclusive unknown Normal #> [6] Inconclusive #> Levels: High Normal Inconclusive unknown #> #> $has_symptoms #> [1] missing yes missing yes missing missing #> Levels: yes no unknown missing #> #> $followup #> [1] unknown yes yes yes no missing #> Levels: yes no unknown missing #>