R/match_df.R
match_df.Rd
This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).
match_df( x = data.frame(), dictionary = list(), from = 1, to = 2, by = 3, order = NULL, warn = FALSE )
x | a character or factor vector |
---|---|
dictionary | a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the dictionary by column in |
from | a column name or position defining words or keys to be replaced |
to | a column name or position defining replacement values |
by | character or integer. If |
order | a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted. |
warn | if |
a data frame with re-defined data based on the dictionary
By default, this applies the function match_vec()
to all
columns specified by the column names listed in by
, or, if a
global dictionary is used, this includes all character
and factor
columns as well.
by
columnSpelling variables within dictionary
represent keys that you want to match
to column names in x
(the data set). These are expected to match exactly
with the exception of two reserved keywords that starts with a full stop:
.regex [pattern]
: any column whose name is matched by [pattern]
. The
[pattern]
should be an unquoted, valid, PERL-flavored regular expression.
.global
: any column (see Section Global dictionary)
A global dictionary is a set of definitions applied to all valid columns of
x
indiscriminantly.
.global keyword in by
: If you want to apply a set of definitions to
all valid columns in addition to specified columns, then you can include a
.global
group in the by
column of your dictionary
data frame. This is
useful for setting up a dictionary of common spelling errors. NOTE:
specific variable definitions will override global defintions. For
example: if you have a column for cardinal directions and a definiton for
N = North
, then the global variable N = no
will not override that. See
Example.
by = NULL
: If you want your data frame to be applied to
all character/factor columns indiscriminantly, then setting
by = NULL
will use that dictionary globally.
match_vec()
, which this function wraps.
# Read in dictionary and coded date examples -------------------- dict <- read.csv(matchmaker_example("spelling-dictionary.csv"), stringsAsFactors = FALSE) dat <- read.csv(matchmaker_example("coded-data.csv"), stringsAsFactors = FALSE) dat$date <- as.Date(dat$date) # Clean spelling based on dictionary ----------------------------- dict # show the dict#> options values grp orders #> 1 y Yes readmission 1 #> 2 n No readmission 2 #> 3 u Unknown readmission 3 #> 4 .missing Missing readmission 4 #> 5 0 Yes treated 1 #> 6 1 No treated 2 #> 7 .missing Missing treated 3 #> 8 1 Facility 1 facility 1 #> 9 2 Facility 2 facility 2 #> 10 3 Facility 3 facility 3 #> 11 4 Facility 4 facility 4 #> 12 5 Facility 5 facility 5 #> 13 6 Facility 6 facility 6 #> 14 7 Facility 7 facility 7 #> 15 8 Facility 8 facility 8 #> 16 9 Facility 9 facility 9 #> 17 10 Facility 10 facility 10 #> 18 .default Unknown facility 11 #> 19 0 0-9 age_group 1 #> 20 10 10-19 age_group 2 #> 21 20 20-29 age_group 3 #> 22 30 30-39 age_group 4 #> 23 40 40-49 age_group 5 #> 24 50 50+ age_group 6 #> 25 high High .regex ^lab_result_ 1 #> 26 norm Normal .regex ^lab_result_ 2 #> 27 inc Inconclusive .regex ^lab_result_ 3 #> 28 y yes .global Inf #> 29 n no .global Inf #> 30 u unknown .global Inf #> 31 unk unknown .global Inf #> 32 oui yes .global Inf #> 33 .missing missing .global Inf#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 <NA> 0 C 10 unk #> 2 e80a37 2019-07-07 y 0 3 10 inc #> 3 b72883 2019-07-07 y 1 8 30 inc #> 4 c9ee86 2019-07-09 n 1 4 40 inc #> 5 40bc7a 2019-07-12 n 1 6 0 norm #> 6 46566e 2019-07-14 y NA B 50 unk #> lab_result_02 lab_result_03 has_symptoms followup #> 1 high inc <NA> u #> 2 unk norm y oui #> 3 norm inc oui #> 4 inc unk y oui #> 5 unk norm <NA> n #> 6 unk inc <NA> <NA>#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown #> lab_result_02 lab_result_03 has_symptoms followup #> 1 High Inconclusive missing unknown #> 2 unknown Normal yes yes #> 3 Normal Inconclusive missing yes #> 4 Inconclusive unknown yes yes #> 5 unknown Normal missing no #> 6 unknown Inconclusive missing missing# Show warnings/errors from each column -------------------------- # Internally, the `match_vec()` function can be quite noisy with warnings for # various reasons. Thus, by default, the `match_df()` function will keep # these quiet, but you can have them printed to your console if you use the # warn = TRUE option: res1 <- match_df(dat, dictionary = dict, from = "options", to = "values", by = "grp", warn = TRUE)#>#>#>#>#> ⚠ None of the variables in `age_group` were found in the global #> dictionary. Did you use the correct dictionary?#>#> ⚠ None of the variables in `facility` were found in the global #> dictionary. Did you use the correct dictionary?#> ⚠ 'A', 'B', 'C' were changed to the default value ('Unknown')#>#> ⚠ None of the variables in `readmission` were found in the global #> dictionary. Did you use the correct dictionary?#>#> ⚠ None of the variables in `treated` were found in the global #> dictionary. Did you use the correct dictionary?#>#> ⚠ None of the variables in `id` were found in `dict`. Did you use the #> correct dictionary?head(res1)#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown #> lab_result_02 lab_result_03 has_symptoms followup #> 1 High Inconclusive missing unknown #> 2 unknown Normal yes yes #> 3 Normal Inconclusive missing yes #> 4 Inconclusive unknown yes yes #> 5 unknown Normal missing no #> 6 unknown Inconclusive missing missing# You can ensure the order of the factors are correct by specifying # a column that defines order. dat[] <- lapply(dat, as.factor) as.list(head(dat))#> $id #> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e #> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389 #> #> $date #> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14 #> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16 #> #> $readmission #> [1] <NA> y y n n y #> Levels: n u y #> #> $treated #> [1] 0 0 1 1 1 <NA> #> Levels: 0 1 #> #> $facility #> [1] C 3 8 4 6 B #> Levels: 1 10 2 3 4 5 6 7 8 9 A B C #> #> $age_group #> [1] 10 10 30 40 0 50 #> Levels: 0 10 20 30 40 50 #> #> $lab_result_01 #> [1] unk inc inc inc norm unk #> Levels: high inc norm unk #> #> $lab_result_02 #> [1] high unk norm inc unk unk #> Levels: high inc norm unk #> #> $lab_result_03 #> [1] inc norm inc unk norm inc #> Levels: high inc norm unk #> #> $has_symptoms #> [1] <NA> y y <NA> <NA> #> Levels: n oui u unk y #> #> $followup #> [1] u oui oui oui n <NA> #> Levels: n oui u unk y #>res2 <- match_df(dat, dictionary = dict, from = "options", to = "values", by = "grp", order = "orders") head(res2)#> id date readmission treated facility age_group lab_result_01 #> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19 unknown #> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19 Inconclusive #> 3 b72883 2019-07-07 Yes No Facility 8 30-39 Inconclusive #> 4 c9ee86 2019-07-09 No No Facility 4 40-49 Inconclusive #> 5 40bc7a 2019-07-12 No No Facility 6 0-9 Normal #> 6 46566e 2019-07-14 Yes Missing Unknown 50+ unknown #> lab_result_02 lab_result_03 has_symptoms followup #> 1 High Inconclusive missing unknown #> 2 unknown Normal yes yes #> 3 Normal Inconclusive missing yes #> 4 Inconclusive unknown yes yes #> 5 unknown Normal missing no #> 6 unknown Inconclusive missing missing#> $id #> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e #> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389 #> #> $date #> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14 #> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16 #> #> $readmission #> [1] Missing Yes Yes No No Yes #> Levels: Yes No Unknown Missing #> #> $treated #> [1] Yes Yes No No No Missing #> Levels: Yes No Missing #> #> $facility #> [1] Unknown Facility 3 Facility 8 Facility 4 Facility 6 Unknown #> 11 Levels: Facility 1 Facility 2 Facility 3 Facility 4 ... Unknown #> #> $age_group #> [1] 10-19 10-19 30-39 40-49 0-9 50+ #> Levels: 0-9 10-19 20-29 30-39 40-49 50+ #> #> $lab_result_01 #> [1] unknown Inconclusive Inconclusive Inconclusive Normal #> [6] unknown #> Levels: High Normal Inconclusive unknown #> #> $lab_result_02 #> [1] High unknown Normal Inconclusive unknown #> [6] unknown #> Levels: High Normal Inconclusive unknown #> #> $lab_result_03 #> [1] Inconclusive Normal Inconclusive unknown Normal #> [6] Inconclusive #> Levels: High Normal Inconclusive unknown #> #> $has_symptoms #> [1] missing yes missing yes missing missing #> Levels: yes no unknown missing #> #> $followup #> [1] unknown yes yes yes no missing #> Levels: yes no unknown missing #>