The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:

  • preservation of factor orders
  • ability to specify explicit and implicit missing values
  • option to replace by fuzzy matching (regular expressions, anchored by default)
  • optional variable selection by fuzzy matching

Installation

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

Example

The matchmaker package has two user-facing functions that perform dictionary-based cleaning:

  • match_vec() will translate the values in a single vector
  • match_df() will translate values in all specified columns of a data frame

Each of these functions have four manditory options:

  • x: your data. This will be a vector or data frame depending on the function.
  • dictionary: This is a data frame with at least two columns specifying keys and values to modify
  • from: a character or number specifying which column contains the keys
  • to: a character or number specifying which column contains the values

Mostly, users will be working with match_df() to transform values across specific columns. A typical workflow would be to:

  1. construct your dictionary in a spreadsheet program based on your data
  2. read in your data and dictionary to data frames in R
  3. match!

Data

This is the top of our data set, generated for example purposes

id date readmission treated facility age_group lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
ef267c 2019-07-08 NA 0 C 10 unk high inc NA u
e80a37 2019-07-07 y 0 3 10 inc unk norm y oui
b72883 2019-07-07 y 1 8 30 inc norm inc oui
c9ee86 2019-07-09 n 1 4 40 inc inc unk y oui
40bc7a 2019-07-12 n 1 6 0 norm unk norm NA n
46566e 2019-07-14 y NA B 50 unk unk inc NA NA

Dictionary

The dictionary looks like this:

options values grp orders
y Yes readmission 1
n No readmission 2
u Unknown readmission 3
.missing Missing readmission 4
0 Yes treated 1
1 No treated 2
.missing Missing treated 3
1 Facility 1 facility 1
2 Facility 2 facility 2
3 Facility 3 facility 3
4 Facility 4 facility 4
5 Facility 5 facility 5
6 Facility 6 facility 6
7 Facility 7 facility 7
8 Facility 8 facility 8
9 Facility 9 facility 9
10 Facility 10 facility 10
.default Unknown facility 11
0 0-9 age_group 1
10 10-19 age_group 2
20 20-29 age_group 3
30 30-39 age_group 4
40 40-49 age_group 5
50 50+ age_group 6
high High .regex ^lab_result_ 1
norm Normal .regex ^lab_result_ 2
inc Inconclusive .regex ^lab_result_ 3
y yes .global Inf
n no .global Inf
u unknown .global Inf
unk unknown .global Inf
oui yes .global Inf
.missing missing .global Inf