The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:
The matchmaker package has two user-facing functions that perform dictionary-based cleaning:
match_vec() will translate the values in a single vectormatch_df() will translate values in all specified columns of a data frameEach of these functions have four manditory options:
x: your data. This will be a vector or data frame depending on the function.dictionary: This is a data frame with at least two columns specifying keys and values to modifyfrom: a character or number specifying which column contains the keysto: a character or number specifying which column contains the valuesMostly, users will be working with match_df() to transform values across specific columns. A typical workflow would be to:
library("matchmaker")
# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)
# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE
)This is the top of our data set, generated for example purposes
| id | date | readmission | treated | facility | age_group | lab_result_01 | lab_result_02 | lab_result_03 | has_symptoms | followup | 
|---|---|---|---|---|---|---|---|---|---|---|
| ef267c | 2019-07-08 | NA | 0 | C | 10 | unk | high | inc | NA | u | 
| e80a37 | 2019-07-07 | y | 0 | 3 | 10 | inc | unk | norm | y | oui | 
| b72883 | 2019-07-07 | y | 1 | 8 | 30 | inc | norm | inc | oui | |
| c9ee86 | 2019-07-09 | n | 1 | 4 | 40 | inc | inc | unk | y | oui | 
| 40bc7a | 2019-07-12 | n | 1 | 6 | 0 | norm | unk | norm | NA | n | 
| 46566e | 2019-07-14 | y | NA | B | 50 | unk | unk | inc | NA | NA | 
The dictionary looks like this:
| options | values | grp | orders | 
|---|---|---|---|
| y | Yes | readmission | 1 | 
| n | No | readmission | 2 | 
| u | Unknown | readmission | 3 | 
| .missing | Missing | readmission | 4 | 
| 0 | Yes | treated | 1 | 
| 1 | No | treated | 2 | 
| .missing | Missing | treated | 3 | 
| 1 | Facility 1 | facility | 1 | 
| 2 | Facility 2 | facility | 2 | 
| 3 | Facility 3 | facility | 3 | 
| 4 | Facility 4 | facility | 4 | 
| 5 | Facility 5 | facility | 5 | 
| 6 | Facility 6 | facility | 6 | 
| 7 | Facility 7 | facility | 7 | 
| 8 | Facility 8 | facility | 8 | 
| 9 | Facility 9 | facility | 9 | 
| 10 | Facility 10 | facility | 10 | 
| .default | Unknown | facility | 11 | 
| 0 | 0-9 | age_group | 1 | 
| 10 | 10-19 | age_group | 2 | 
| 20 | 20-29 | age_group | 3 | 
| 30 | 30-39 | age_group | 4 | 
| 40 | 40-49 | age_group | 5 | 
| 50 | 50+ | age_group | 6 | 
| high | High | .regex ^lab_result_ | 1 | 
| norm | Normal | .regex ^lab_result_ | 2 | 
| inc | Inconclusive | .regex ^lab_result_ | 3 | 
| y | yes | .global | Inf | 
| n | no | .global | Inf | 
| u | unknown | .global | Inf | 
| unk | unknown | .global | Inf | 
| oui | yes | .global | Inf | 
| .missing | missing | .global | Inf | 
# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp"
)
head(cleaned)
#>       id       date readmission treated    facility age_group
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+
#>   lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
#> 1       unknown          High  Inconclusive      missing  unknown
#> 2  Inconclusive       unknown        Normal          yes      yes
#> 3  Inconclusive        Normal  Inconclusive      missing      yes
#> 4  Inconclusive  Inconclusive       unknown          yes      yes
#> 5        Normal       unknown        Normal      missing       no
#> 6       unknown       unknown  Inconclusive      missing  missing