Clean a data.frame

This function applies several cleaning procedures to an input data.frame, by standardising variable names, labels used categorical variables (characters of factors), and setting dates to Date objects. Optionally, an intelligent date search can be used on character strings to extract dates from various formats mixed with other text. See details for more information.

clean_data(
  x,
  sep = "_",
  force_Date = TRUE,
  guess_dates = FALSE,
  error_tolerance = 0.5,
  wordlists = NULL,
  spelling_vars = 3,
  sort_by = NULL,
  warn_spelling = FALSE,
  protect = FALSE,
  ...
)

Arguments

x	a `data.frame`
sep	The separator used between words, and defaults to the underscore `_`.
force_Date	a `logical` or `integer` vector indicating the columns . If `logical`, indicating if `POSIXct` and `POSIXlt` objects should be converted to `Date` objects; defaults to `TRUE`; you should use this if your dates are only precise to the day (i.e. no time information within days).
guess_dates	a `logical` or `integer` vector indicating which columns should be guessed , assuming these columns store character strings or `factors`; this feature is experimental; see `guess_dates()` for more information.
error_tolerance	a number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.1 (10 percent)
wordlists	a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the wordlists by column in `x` (see `spelling_vars`).
spelling_vars	character or integer. If `wordlists` is a data frame, then this column in defines the columns in `x` corresponding to each section of the `wordlists` data frame. This defaults to `3`, indicating the third column is to be used.
sort_by	a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.
warn_spelling	if `TRUE`, errors and warnings from `clean_spelling()` will be aggregated and presented for each column that issues them. The default value is `FALSE`, which means that all errors and warnings will be ignored.
protect	a logical or numeric vector defining the columns to protect from any manipulation. Note: columns in `protect` will override any columns in either `force_Date` or `guess_dates`.
...	further arguments passed on to `guess_dates()`

Value

A data.frame with standardised labels for characters and factors.

Note

Creating your wordlist

When creating the wordlist for clean_variable_spelling(), it's important to remember that the data will first be cleaned with clean_variable_labels(), which will remove any capitalisation, accents, and replace all punctuation and spaces with "_".

Examples


## make toy data
toy_data <- messy_data(20)

## show data
toy_data
#>       'ID Date of Onset. DisCharge.. GENDER_  Épi.Case_définition
#> 1  lvijbf     2018-01-08  18/01/2018   FEMALE           Confirmed
#> 2  glzvne     2018-01-06  16/01/2018   Female           confirmed
#> 3  gslqeq     2018-01-07  17/01/2018     MALE          not a case
#> 4  mfaamr     2018-01-05  15/01/2018     male         suspected  
#> 5  ewkadg     2018-01-07  17/01/2018     MALE           suspected
#> 6  wxbhis     2018-01-10  20/01/2018   female           suspected
#> 7  ifmmyu     2018-01-06  16/01/2018   Female         suspected  
#> 8  cmrwtk     2018-01-06  16/01/2018   female            PROBABLE
#> 9  dbirsl     2018-01-09  19/01/2018     male          Not.a.Case
#> 10 sqjlmp     2018-01-03  13/01/2018     Male          not a case
#> 11 gudacg     2018-01-08  18/01/2018     MALE         suspected  
#> 12 qflgkj     2018-01-06  16/01/2018     male           suspected
#> 13 hvqlto     2018-01-06  16/01/2018     MALE          not a case
#> 14 rlvuzj     2018-01-03  13/01/2018   female          Not.a.Case
#> 15 cdmfnb     2018-01-07  17/01/2018   FEMALE          not a case
#> 16 yesrej     2018-01-05  15/01/2018     MALE           suspected
#> 17 ldsxty     2018-01-03  13/01/2018     male          Not.a.Case
#> 18 qwtxjx     2018-01-04  14/01/2018     male          Not.a.Case
#> 19 wiyjis     2018-01-10  20/01/2018   female          not a case
#> 20 ysoqow     2018-01-07  17/01/2018   Female          Not.a.Case
#>           messy/dates       lat      lon
#> 1                <NA> 11.582811 48.26727
#> 2          01-12-2001 12.582206 46.19974
#> 3              female 13.704558 46.68366
#> 4              female 16.489427 47.76145
#> 5                male  9.714188 46.85687
#> 6     // 24//12//1989 11.441429 47.21552
#> 7              female 16.078899 47.17223
#> 8          2018 10 19 14.400111 48.94434
#> 9          2018-10-18 13.235185 47.36031
#> 10         01-12-2001 12.432755 45.82377
#> 11    // 24//12//1989 13.995818 47.78286
#> 12         2018 10 19 16.219886 48.57991
#> 13         2018 10 19 16.722272 49.95717
#> 14             female 13.772498 47.04740
#> 15 that's 24/12/1989! 10.062713 46.13261
#> 16               male 14.435050 47.03233
#> 17               <NA> 13.263820 48.79245
#> 18         01-12-2001  9.268153 49.81503
#> 19         01-12-2001 11.158673 47.06303
#> 20    // 24//12//1989 15.408058 48.86130


## clean variable names, store in new object, show results
clean_data <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.1)
clean_data
#>        id date_of_onset  discharge gender epi_case_definition       messy_dates
#> 1  lvijbf    2018-01-08 2018-01-18 female           confirmed              <NA>
#> 2  glzvne    2018-01-06 2018-01-16 female           confirmed        01_12_2001
#> 3  gslqeq    2018-01-07 2018-01-17   male          not_a_case            female
#> 4  mfaamr    2018-01-05 2018-01-15   male           suspected            female
#> 5  ewkadg    2018-01-07 2018-01-17   male           suspected              male
#> 6  wxbhis    2018-01-10 2018-01-20 female           suspected        24_12_1989
#> 7  ifmmyu    2018-01-06 2018-01-16 female           suspected            female
#> 8  cmrwtk    2018-01-06 2018-01-16 female            probable        2018_10_19
#> 9  dbirsl    2018-01-09 2018-01-19   male          not_a_case        2018_10_18
#> 10 sqjlmp    2018-01-03 2018-01-13   male          not_a_case        01_12_2001
#> 11 gudacg    2018-01-08 2018-01-18   male           suspected        24_12_1989
#> 12 qflgkj    2018-01-06 2018-01-16   male           suspected        2018_10_19
#> 13 hvqlto    2018-01-06 2018-01-16   male          not_a_case        2018_10_19
#> 14 rlvuzj    2018-01-03 2018-01-13 female          not_a_case            female
#> 15 cdmfnb    2018-01-07 2018-01-17 female          not_a_case that_s_24_12_1989
#> 16 yesrej    2018-01-05 2018-01-15   male           suspected              male
#> 17 ldsxty    2018-01-03 2018-01-13   male          not_a_case              <NA>
#> 18 qwtxjx    2018-01-04 2018-01-14   male          not_a_case        01_12_2001
#> 19 wiyjis    2018-01-10 2018-01-20 female          not_a_case        01_12_2001
#> 20 ysoqow    2018-01-07 2018-01-17 female          not_a_case        24_12_1989
#>          lat      lon
#> 1  11.582811 48.26727
#> 2  12.582206 46.19974
#> 3  13.704558 46.68366
#> 4  16.489427 47.76145
#> 5   9.714188 46.85687
#> 6  11.441429 47.21552
#> 7  16.078899 47.17223
#> 8  14.400111 48.94434
#> 9  13.235185 47.36031
#> 10 12.432755 45.82377
#> 11 13.995818 47.78286
#> 12 16.219886 48.57991
#> 13 16.722272 49.95717
#> 14 13.772498 47.04740
#> 15 10.062713 46.13261
#> 16 14.435050 47.03233
#> 17 13.263820 48.79245
#> 18  9.268153 49.81503
#> 19 11.158673 47.06303
#> 20 15.408058 48.86130

clean_data2 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8)
clean_data2
#>        id date_of_onset  discharge gender epi_case_definition messy_dates
#> 1  lvijbf    2018-01-08 2018-01-18 female           confirmed        <NA>
#> 2  glzvne    2018-01-06 2018-01-16 female           confirmed  2001-12-01
#> 3  gslqeq    2018-01-07 2018-01-17   male          not_a_case        <NA>
#> 4  mfaamr    2018-01-05 2018-01-15   male           suspected        <NA>
#> 5  ewkadg    2018-01-07 2018-01-17   male           suspected        <NA>
#> 6  wxbhis    2018-01-10 2018-01-20 female           suspected  1989-12-24
#> 7  ifmmyu    2018-01-06 2018-01-16 female           suspected        <NA>
#> 8  cmrwtk    2018-01-06 2018-01-16 female            probable  2018-10-19
#> 9  dbirsl    2018-01-09 2018-01-19   male          not_a_case  2018-10-18
#> 10 sqjlmp    2018-01-03 2018-01-13   male          not_a_case  2001-12-01
#> 11 gudacg    2018-01-08 2018-01-18   male           suspected  1989-12-24
#> 12 qflgkj    2018-01-06 2018-01-16   male           suspected  2018-10-19
#> 13 hvqlto    2018-01-06 2018-01-16   male          not_a_case  2018-10-19
#> 14 rlvuzj    2018-01-03 2018-01-13 female          not_a_case        <NA>
#> 15 cdmfnb    2018-01-07 2018-01-17 female          not_a_case  1989-12-24
#> 16 yesrej    2018-01-05 2018-01-15   male           suspected        <NA>
#> 17 ldsxty    2018-01-03 2018-01-13   male          not_a_case        <NA>
#> 18 qwtxjx    2018-01-04 2018-01-14   male          not_a_case  2001-12-01
#> 19 wiyjis    2018-01-10 2018-01-20 female          not_a_case  2001-12-01
#> 20 ysoqow    2018-01-07 2018-01-17 female          not_a_case  1989-12-24
#>          lat      lon
#> 1  11.582811 48.26727
#> 2  12.582206 46.19974
#> 3  13.704558 46.68366
#> 4  16.489427 47.76145
#> 5   9.714188 46.85687
#> 6  11.441429 47.21552
#> 7  16.078899 47.17223
#> 8  14.400111 48.94434
#> 9  13.235185 47.36031
#> 10 12.432755 45.82377
#> 11 13.995818 47.78286
#> 12 16.219886 48.57991
#> 13 16.722272 49.95717
#> 14 13.772498 47.04740
#> 15 10.062713 46.13261
#> 16 14.435050 47.03233
#> 17 13.263820 48.79245
#> 18  9.268153 49.81503
#> 19 11.158673 47.06303
#> 20 15.408058 48.86130

## clean variable names, but keep our "messy/dates" column
to_protect <- names(toy_data) %in% "messy/dates"
clean_data3 <- clean_data(toy_data,
                          guess_dates = TRUE,
                          error_tolerance = 0.8,
                          protect = to_protect
                         )
clean_data3
#>        id date_of_onset  discharge gender epi_case_definition
#> 1  lvijbf    2018-01-08 2018-01-18 female           confirmed
#> 2  glzvne    2018-01-06 2018-01-16 female           confirmed
#> 3  gslqeq    2018-01-07 2018-01-17   male          not_a_case
#> 4  mfaamr    2018-01-05 2018-01-15   male           suspected
#> 5  ewkadg    2018-01-07 2018-01-17   male           suspected
#> 6  wxbhis    2018-01-10 2018-01-20 female           suspected
#> 7  ifmmyu    2018-01-06 2018-01-16 female           suspected
#> 8  cmrwtk    2018-01-06 2018-01-16 female            probable
#> 9  dbirsl    2018-01-09 2018-01-19   male          not_a_case
#> 10 sqjlmp    2018-01-03 2018-01-13   male          not_a_case
#> 11 gudacg    2018-01-08 2018-01-18   male           suspected
#> 12 qflgkj    2018-01-06 2018-01-16   male           suspected
#> 13 hvqlto    2018-01-06 2018-01-16   male          not_a_case
#> 14 rlvuzj    2018-01-03 2018-01-13 female          not_a_case
#> 15 cdmfnb    2018-01-07 2018-01-17 female          not_a_case
#> 16 yesrej    2018-01-05 2018-01-15   male           suspected
#> 17 ldsxty    2018-01-03 2018-01-13   male          not_a_case
#> 18 qwtxjx    2018-01-04 2018-01-14   male          not_a_case
#> 19 wiyjis    2018-01-10 2018-01-20 female          not_a_case
#> 20 ysoqow    2018-01-07 2018-01-17 female          not_a_case
#>           messy/dates       lat      lon
#> 1                <NA> 11.582811 48.26727
#> 2          01-12-2001 12.582206 46.19974
#> 3              female 13.704558 46.68366
#> 4              female 16.489427 47.76145
#> 5                male  9.714188 46.85687
#> 6     // 24//12//1989 11.441429 47.21552
#> 7              female 16.078899 47.17223
#> 8          2018 10 19 14.400111 48.94434
#> 9          2018-10-18 13.235185 47.36031
#> 10         01-12-2001 12.432755 45.82377
#> 11    // 24//12//1989 13.995818 47.78286
#> 12         2018 10 19 16.219886 48.57991
#> 13         2018 10 19 16.722272 49.95717
#> 14             female 13.772498 47.04740
#> 15 that's 24/12/1989! 10.062713 46.13261
#> 16               male 14.435050 47.03233
#> 17               <NA> 13.263820 48.79245
#> 18         01-12-2001  9.268153 49.81503
#> 19         01-12-2001 11.158673 47.06303
#> 20    // 24//12//1989 15.408058 48.86130

## Using a wordlist  -------------------------------

# location data with mis-spellings, French, and English.
messy_locations <- c("hopsital", "h\u00f4pital", "hospital",
                     "m\u00e9dical", "clinic",
                     "feild", "field")
toy_data$location <- factor(sample(messy_locations, 20, replace = TRUE))

# show data 
toy_data$location
#>  [1] hopsital clinic   hopsital clinic   hôpital  field    clinic   hospital
#>  [9] field    hôpital  hospital hopsital field    hopsital hopsital hôpital 
#> [17] hôpital  hospital clinic   hôpital 
#> Levels: clinic field hôpital hopsital hospital


# add a wordlist
wordlist <- data.frame(
  from  = c("hopsital", "hopital",  "medical", "feild"),
  to    = c("hospital", "hospital", "clinic",  "field"),
  variables = rep("location", 4),
  stringsAsFactors = FALSE
)

clean_data4 <- clean_data(toy_data,
                          wordlists     = wordlist,
                          spelling_vars = "variables"
                         )
clean_data4
#>        id date_of_onset  discharge gender epi_case_definition       messy_dates
#> 1  lvijbf    2018-01-08 18_01_2018 female           confirmed              <NA>
#> 2  glzvne    2018-01-06 16_01_2018 female           confirmed        01_12_2001
#> 3  gslqeq    2018-01-07 17_01_2018   male          not_a_case            female
#> 4  mfaamr    2018-01-05 15_01_2018   male           suspected            female
#> 5  ewkadg    2018-01-07 17_01_2018   male           suspected              male
#> 6  wxbhis    2018-01-10 20_01_2018 female           suspected        24_12_1989
#> 7  ifmmyu    2018-01-06 16_01_2018 female           suspected            female
#> 8  cmrwtk    2018-01-06 16_01_2018 female            probable        2018_10_19
#> 9  dbirsl    2018-01-09 19_01_2018   male          not_a_case        2018_10_18
#> 10 sqjlmp    2018-01-03 13_01_2018   male          not_a_case        01_12_2001
#> 11 gudacg    2018-01-08 18_01_2018   male           suspected        24_12_1989
#> 12 qflgkj    2018-01-06 16_01_2018   male           suspected        2018_10_19
#> 13 hvqlto    2018-01-06 16_01_2018   male          not_a_case        2018_10_19
#> 14 rlvuzj    2018-01-03 13_01_2018 female          not_a_case            female
#> 15 cdmfnb    2018-01-07 17_01_2018 female          not_a_case that_s_24_12_1989
#> 16 yesrej    2018-01-05 15_01_2018   male           suspected              male
#> 17 ldsxty    2018-01-03 13_01_2018   male          not_a_case              <NA>
#> 18 qwtxjx    2018-01-04 14_01_2018   male          not_a_case        01_12_2001
#> 19 wiyjis    2018-01-10 20_01_2018 female          not_a_case        01_12_2001
#> 20 ysoqow    2018-01-07 17_01_2018 female          not_a_case        24_12_1989
#>          lat      lon location
#> 1  11.582811 48.26727 hospital
#> 2  12.582206 46.19974   clinic
#> 3  13.704558 46.68366 hospital
#> 4  16.489427 47.76145   clinic
#> 5   9.714188 46.85687 hospital
#> 6  11.441429 47.21552    field
#> 7  16.078899 47.17223   clinic
#> 8  14.400111 48.94434 hospital
#> 9  13.235185 47.36031    field
#> 10 12.432755 45.82377 hospital
#> 11 13.995818 47.78286 hospital
#> 12 16.219886 48.57991 hospital
#> 13 16.722272 49.95717    field
#> 14 13.772498 47.04740 hospital
#> 15 10.062713 46.13261 hospital
#> 16 14.435050 47.03233 hospital
#> 17 13.263820 48.79245 hospital
#> 18  9.268153 49.81503 hospital
#> 19 11.158673 47.06303   clinic
#> 20 15.408058 48.86130 hospital
clean_data4$location
#>  [1] hospital clinic   hospital clinic   hospital field    clinic   hospital
#>  [9] field    hospital hospital hospital field    hospital hospital hospital
#> [17] hospital hospital clinic   hospital
#> Levels: hospital clinic field

Arguments

Value

Note

Creating your wordlist

See also

Examples

Contents

Author