This function applies several cleaning procedures to an input data.frame, by standardising variable names, labels used categorical variables (characters of factors), and setting dates to Date objects. Optionally, an intelligent date search can be used on character strings to extract dates from various formats mixed with other text. See details for more information.

clean_data(
  x,
  sep = "_",
  force_Date = TRUE,
  guess_dates = FALSE,
  error_tolerance = 0.5,
  wordlists = NULL,
  spelling_vars = 3,
  sort_by = NULL,
  warn_spelling = FALSE,
  protect = FALSE,
  ...
)

Arguments

x

a data.frame

sep

The separator used between words, and defaults to the underscore _.

force_Date

a logical or integer vector indicating the columns . If logical, indicating if POSIXct and POSIXlt objects should be converted to Date objects; defaults to TRUE; you should use this if your dates are only precise to the day (i.e. no time information within days).

guess_dates

a logical or integer vector indicating which columns should be guessed , assuming these columns store character strings or factors; this feature is experimental; see guess_dates() for more information.

error_tolerance

a number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.1 (10 percent)

wordlists

a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the wordlists by column in x (see spelling_vars).

spelling_vars

character or integer. If wordlists is a data frame, then this column in defines the columns in x corresponding to each section of the wordlists data frame. This defaults to 3, indicating the third column is to be used.

sort_by

a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.

warn_spelling

if TRUE, errors and warnings from clean_spelling() will be aggregated and presented for each column that issues them. The default value is FALSE, which means that all errors and warnings will be ignored.

protect

a logical or numeric vector defining the columns to protect from any manipulation. Note: columns in protect will override any columns in either force_Date or guess_dates.

...

further arguments passed on to guess_dates()

Value

A data.frame with standardised labels for characters and factors.

Note

Creating your wordlist

When creating the wordlist for clean_variable_spelling(), it's important to remember that the data will first be cleaned with clean_variable_labels(), which will remove any capitalisation, accents, and replace all punctuation and spaces with "_".

See also

This function wraps three other functions: clean_variable_names() - to handle variable names, clean_variables() - to handle character/factor variables, clean_dates() - to handle dates.

Examples

## make toy data toy_data <- messy_data(20) ## show data toy_data
#> 'ID Date of Onset. DisCharge.. GENDER_ Épi.Case_définition #> 1 lvijbf 2018-01-08 18/01/2018 FEMALE Confirmed #> 2 glzvne 2018-01-06 16/01/2018 Female confirmed #> 3 gslqeq 2018-01-07 17/01/2018 MALE not a case #> 4 mfaamr 2018-01-05 15/01/2018 male suspected #> 5 ewkadg 2018-01-07 17/01/2018 MALE suspected #> 6 wxbhis 2018-01-10 20/01/2018 female suspected #> 7 ifmmyu 2018-01-06 16/01/2018 Female suspected #> 8 cmrwtk 2018-01-06 16/01/2018 female PROBABLE #> 9 dbirsl 2018-01-09 19/01/2018 male Not.a.Case #> 10 sqjlmp 2018-01-03 13/01/2018 Male not a case #> 11 gudacg 2018-01-08 18/01/2018 MALE suspected #> 12 qflgkj 2018-01-06 16/01/2018 male suspected #> 13 hvqlto 2018-01-06 16/01/2018 MALE not a case #> 14 rlvuzj 2018-01-03 13/01/2018 female Not.a.Case #> 15 cdmfnb 2018-01-07 17/01/2018 FEMALE not a case #> 16 yesrej 2018-01-05 15/01/2018 MALE suspected #> 17 ldsxty 2018-01-03 13/01/2018 male Not.a.Case #> 18 qwtxjx 2018-01-04 14/01/2018 male Not.a.Case #> 19 wiyjis 2018-01-10 20/01/2018 female not a case #> 20 ysoqow 2018-01-07 17/01/2018 Female Not.a.Case #> messy/dates lat lon #> 1 <NA> 11.582811 48.26727 #> 2 01-12-2001 12.582206 46.19974 #> 3 female 13.704558 46.68366 #> 4 female 16.489427 47.76145 #> 5 male 9.714188 46.85687 #> 6 // 24//12//1989 11.441429 47.21552 #> 7 female 16.078899 47.17223 #> 8 2018 10 19 14.400111 48.94434 #> 9 2018-10-18 13.235185 47.36031 #> 10 01-12-2001 12.432755 45.82377 #> 11 // 24//12//1989 13.995818 47.78286 #> 12 2018 10 19 16.219886 48.57991 #> 13 2018 10 19 16.722272 49.95717 #> 14 female 13.772498 47.04740 #> 15 that's 24/12/1989! 10.062713 46.13261 #> 16 male 14.435050 47.03233 #> 17 <NA> 13.263820 48.79245 #> 18 01-12-2001 9.268153 49.81503 #> 19 01-12-2001 11.158673 47.06303 #> 20 // 24//12//1989 15.408058 48.86130
## clean variable names, store in new object, show results clean_data <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.1) clean_data
#> id date_of_onset discharge gender epi_case_definition messy_dates #> 1 lvijbf 2018-01-08 2018-01-18 female confirmed <NA> #> 2 glzvne 2018-01-06 2018-01-16 female confirmed 01_12_2001 #> 3 gslqeq 2018-01-07 2018-01-17 male not_a_case female #> 4 mfaamr 2018-01-05 2018-01-15 male suspected female #> 5 ewkadg 2018-01-07 2018-01-17 male suspected male #> 6 wxbhis 2018-01-10 2018-01-20 female suspected 24_12_1989 #> 7 ifmmyu 2018-01-06 2018-01-16 female suspected female #> 8 cmrwtk 2018-01-06 2018-01-16 female probable 2018_10_19 #> 9 dbirsl 2018-01-09 2018-01-19 male not_a_case 2018_10_18 #> 10 sqjlmp 2018-01-03 2018-01-13 male not_a_case 01_12_2001 #> 11 gudacg 2018-01-08 2018-01-18 male suspected 24_12_1989 #> 12 qflgkj 2018-01-06 2018-01-16 male suspected 2018_10_19 #> 13 hvqlto 2018-01-06 2018-01-16 male not_a_case 2018_10_19 #> 14 rlvuzj 2018-01-03 2018-01-13 female not_a_case female #> 15 cdmfnb 2018-01-07 2018-01-17 female not_a_case that_s_24_12_1989 #> 16 yesrej 2018-01-05 2018-01-15 male suspected male #> 17 ldsxty 2018-01-03 2018-01-13 male not_a_case <NA> #> 18 qwtxjx 2018-01-04 2018-01-14 male not_a_case 01_12_2001 #> 19 wiyjis 2018-01-10 2018-01-20 female not_a_case 01_12_2001 #> 20 ysoqow 2018-01-07 2018-01-17 female not_a_case 24_12_1989 #> lat lon #> 1 11.582811 48.26727 #> 2 12.582206 46.19974 #> 3 13.704558 46.68366 #> 4 16.489427 47.76145 #> 5 9.714188 46.85687 #> 6 11.441429 47.21552 #> 7 16.078899 47.17223 #> 8 14.400111 48.94434 #> 9 13.235185 47.36031 #> 10 12.432755 45.82377 #> 11 13.995818 47.78286 #> 12 16.219886 48.57991 #> 13 16.722272 49.95717 #> 14 13.772498 47.04740 #> 15 10.062713 46.13261 #> 16 14.435050 47.03233 #> 17 13.263820 48.79245 #> 18 9.268153 49.81503 #> 19 11.158673 47.06303 #> 20 15.408058 48.86130
clean_data2 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8) clean_data2
#> id date_of_onset discharge gender epi_case_definition messy_dates #> 1 lvijbf 2018-01-08 2018-01-18 female confirmed <NA> #> 2 glzvne 2018-01-06 2018-01-16 female confirmed 2001-12-01 #> 3 gslqeq 2018-01-07 2018-01-17 male not_a_case <NA> #> 4 mfaamr 2018-01-05 2018-01-15 male suspected <NA> #> 5 ewkadg 2018-01-07 2018-01-17 male suspected <NA> #> 6 wxbhis 2018-01-10 2018-01-20 female suspected 1989-12-24 #> 7 ifmmyu 2018-01-06 2018-01-16 female suspected <NA> #> 8 cmrwtk 2018-01-06 2018-01-16 female probable 2018-10-19 #> 9 dbirsl 2018-01-09 2018-01-19 male not_a_case 2018-10-18 #> 10 sqjlmp 2018-01-03 2018-01-13 male not_a_case 2001-12-01 #> 11 gudacg 2018-01-08 2018-01-18 male suspected 1989-12-24 #> 12 qflgkj 2018-01-06 2018-01-16 male suspected 2018-10-19 #> 13 hvqlto 2018-01-06 2018-01-16 male not_a_case 2018-10-19 #> 14 rlvuzj 2018-01-03 2018-01-13 female not_a_case <NA> #> 15 cdmfnb 2018-01-07 2018-01-17 female not_a_case 1989-12-24 #> 16 yesrej 2018-01-05 2018-01-15 male suspected <NA> #> 17 ldsxty 2018-01-03 2018-01-13 male not_a_case <NA> #> 18 qwtxjx 2018-01-04 2018-01-14 male not_a_case 2001-12-01 #> 19 wiyjis 2018-01-10 2018-01-20 female not_a_case 2001-12-01 #> 20 ysoqow 2018-01-07 2018-01-17 female not_a_case 1989-12-24 #> lat lon #> 1 11.582811 48.26727 #> 2 12.582206 46.19974 #> 3 13.704558 46.68366 #> 4 16.489427 47.76145 #> 5 9.714188 46.85687 #> 6 11.441429 47.21552 #> 7 16.078899 47.17223 #> 8 14.400111 48.94434 #> 9 13.235185 47.36031 #> 10 12.432755 45.82377 #> 11 13.995818 47.78286 #> 12 16.219886 48.57991 #> 13 16.722272 49.95717 #> 14 13.772498 47.04740 #> 15 10.062713 46.13261 #> 16 14.435050 47.03233 #> 17 13.263820 48.79245 #> 18 9.268153 49.81503 #> 19 11.158673 47.06303 #> 20 15.408058 48.86130
## clean variable names, but keep our "messy/dates" column to_protect <- names(toy_data) %in% "messy/dates" clean_data3 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8, protect = to_protect ) clean_data3
#> id date_of_onset discharge gender epi_case_definition #> 1 lvijbf 2018-01-08 2018-01-18 female confirmed #> 2 glzvne 2018-01-06 2018-01-16 female confirmed #> 3 gslqeq 2018-01-07 2018-01-17 male not_a_case #> 4 mfaamr 2018-01-05 2018-01-15 male suspected #> 5 ewkadg 2018-01-07 2018-01-17 male suspected #> 6 wxbhis 2018-01-10 2018-01-20 female suspected #> 7 ifmmyu 2018-01-06 2018-01-16 female suspected #> 8 cmrwtk 2018-01-06 2018-01-16 female probable #> 9 dbirsl 2018-01-09 2018-01-19 male not_a_case #> 10 sqjlmp 2018-01-03 2018-01-13 male not_a_case #> 11 gudacg 2018-01-08 2018-01-18 male suspected #> 12 qflgkj 2018-01-06 2018-01-16 male suspected #> 13 hvqlto 2018-01-06 2018-01-16 male not_a_case #> 14 rlvuzj 2018-01-03 2018-01-13 female not_a_case #> 15 cdmfnb 2018-01-07 2018-01-17 female not_a_case #> 16 yesrej 2018-01-05 2018-01-15 male suspected #> 17 ldsxty 2018-01-03 2018-01-13 male not_a_case #> 18 qwtxjx 2018-01-04 2018-01-14 male not_a_case #> 19 wiyjis 2018-01-10 2018-01-20 female not_a_case #> 20 ysoqow 2018-01-07 2018-01-17 female not_a_case #> messy/dates lat lon #> 1 <NA> 11.582811 48.26727 #> 2 01-12-2001 12.582206 46.19974 #> 3 female 13.704558 46.68366 #> 4 female 16.489427 47.76145 #> 5 male 9.714188 46.85687 #> 6 // 24//12//1989 11.441429 47.21552 #> 7 female 16.078899 47.17223 #> 8 2018 10 19 14.400111 48.94434 #> 9 2018-10-18 13.235185 47.36031 #> 10 01-12-2001 12.432755 45.82377 #> 11 // 24//12//1989 13.995818 47.78286 #> 12 2018 10 19 16.219886 48.57991 #> 13 2018 10 19 16.722272 49.95717 #> 14 female 13.772498 47.04740 #> 15 that's 24/12/1989! 10.062713 46.13261 #> 16 male 14.435050 47.03233 #> 17 <NA> 13.263820 48.79245 #> 18 01-12-2001 9.268153 49.81503 #> 19 01-12-2001 11.158673 47.06303 #> 20 // 24//12//1989 15.408058 48.86130
## Using a wordlist ------------------------------- # location data with mis-spellings, French, and English. messy_locations <- c("hopsital", "h\u00f4pital", "hospital", "m\u00e9dical", "clinic", "feild", "field") toy_data$location <- factor(sample(messy_locations, 20, replace = TRUE)) # show data toy_data$location
#> [1] hopsital clinic hopsital clinic hôpital field clinic hospital #> [9] field hôpital hospital hopsital field hopsital hopsital hôpital #> [17] hôpital hospital clinic hôpital #> Levels: clinic field hôpital hopsital hospital
# add a wordlist wordlist <- data.frame( from = c("hopsital", "hopital", "medical", "feild"), to = c("hospital", "hospital", "clinic", "field"), variables = rep("location", 4), stringsAsFactors = FALSE ) clean_data4 <- clean_data(toy_data, wordlists = wordlist, spelling_vars = "variables" ) clean_data4
#> id date_of_onset discharge gender epi_case_definition messy_dates #> 1 lvijbf 2018-01-08 18_01_2018 female confirmed <NA> #> 2 glzvne 2018-01-06 16_01_2018 female confirmed 01_12_2001 #> 3 gslqeq 2018-01-07 17_01_2018 male not_a_case female #> 4 mfaamr 2018-01-05 15_01_2018 male suspected female #> 5 ewkadg 2018-01-07 17_01_2018 male suspected male #> 6 wxbhis 2018-01-10 20_01_2018 female suspected 24_12_1989 #> 7 ifmmyu 2018-01-06 16_01_2018 female suspected female #> 8 cmrwtk 2018-01-06 16_01_2018 female probable 2018_10_19 #> 9 dbirsl 2018-01-09 19_01_2018 male not_a_case 2018_10_18 #> 10 sqjlmp 2018-01-03 13_01_2018 male not_a_case 01_12_2001 #> 11 gudacg 2018-01-08 18_01_2018 male suspected 24_12_1989 #> 12 qflgkj 2018-01-06 16_01_2018 male suspected 2018_10_19 #> 13 hvqlto 2018-01-06 16_01_2018 male not_a_case 2018_10_19 #> 14 rlvuzj 2018-01-03 13_01_2018 female not_a_case female #> 15 cdmfnb 2018-01-07 17_01_2018 female not_a_case that_s_24_12_1989 #> 16 yesrej 2018-01-05 15_01_2018 male suspected male #> 17 ldsxty 2018-01-03 13_01_2018 male not_a_case <NA> #> 18 qwtxjx 2018-01-04 14_01_2018 male not_a_case 01_12_2001 #> 19 wiyjis 2018-01-10 20_01_2018 female not_a_case 01_12_2001 #> 20 ysoqow 2018-01-07 17_01_2018 female not_a_case 24_12_1989 #> lat lon location #> 1 11.582811 48.26727 hospital #> 2 12.582206 46.19974 clinic #> 3 13.704558 46.68366 hospital #> 4 16.489427 47.76145 clinic #> 5 9.714188 46.85687 hospital #> 6 11.441429 47.21552 field #> 7 16.078899 47.17223 clinic #> 8 14.400111 48.94434 hospital #> 9 13.235185 47.36031 field #> 10 12.432755 45.82377 hospital #> 11 13.995818 47.78286 hospital #> 12 16.219886 48.57991 hospital #> 13 16.722272 49.95717 field #> 14 13.772498 47.04740 hospital #> 15 10.062713 46.13261 hospital #> 16 14.435050 47.03233 hospital #> 17 13.263820 48.79245 hospital #> 18 9.268153 49.81503 hospital #> 19 11.158673 47.06303 clinic #> 20 15.408058 48.86130 hospital
clean_data4$location
#> [1] hospital clinic hospital clinic hospital field clinic hospital #> [9] field hospital hospital hospital field hospital hospital hospital #> [17] hospital hospital clinic hospital #> Levels: hospital clinic field