This function detects variables of data.frame
which are effectively
representing dates, and converts them to Date
objects. When variables are
character strings or factors, the function will try to convert dates with
various pre-defined formats (see details). For each variable, the most
common date format is automatically detected, and dates not following it are
set to NA
(i.e. missing). It uses a tolerance threshold for the amount of
entries which cannot be converted to date (error_tolerance
). By default,
tolerance is set to 0.1
, meaning 10% of errors in dates entry is allowed
for a given variable. If there are more errors, this variable is assumed not
to be a date, and left untouched.
clean_dates( x, force_Date = TRUE, guess_dates = TRUE, error_tolerance = 0.5, ..., classes = NULL )
x | a |
---|---|
force_Date | a |
guess_dates | a |
error_tolerance | a number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.1 (10 percent) |
... | further arguments passed on to |
classes | a vector of class definitions for each of the columns. If this
is not provided, the classes will be read from the columns themselves.
Practically, this is used in |
A data.frame
with standardised dates.
guess_dates()
to extract dates from a messy input vector
## make toy data onsets <- as.POSIXct("2018-01-01", tz = "UTC") onsets <- seq(onsets, by = "1 day", length.out = 10) onsets <- sample(onsets, 20, replace = TRUE) onsets2 <- format(as.Date(onsets), "%d/%m/%Y") onsets3 <- format(as.Date(onsets), "%d %m %Y") outcomes <- onsets + 1e7 admissions <- onsets + 86400 + sample(86400, 20) admissions[1:5] <- NA discharges <- admissions + (86400 * sample(5, 20, replace = TRUE)) + sample(86400, 20) onset_with_errors <- onsets2 onset_with_errors[c(1,20)] <- c("male", "confirmed") mixed_info <- onsets3 mixed_info[1:10] <- sample(c("bleeding", "fever"), 10, replace = TRUE) gender <- sample(c("male", "female"), 20, replace = TRUE) case_type <- c("confirmed", "probable", "suspected", "not a case") case <- sample(case_type, 20, replace = TRUE) toy_data <- data.frame("Date of Onset." = onsets, "onset 2" = onsets2, "ONSET 3" = onsets3, "onset_4" = onset_with_errors, "date admission" = admissions, "DATE.of.DISCHARGE" = discharges, "GENDER_ " = gender, "Épi.Case_définition" = case, "date of admission" = admissions, "Date-of_discharge" = discharges, "extra" = mixed_info, stringsAsFactors = FALSE, check.names = FALSE) ## show data toy_data#> Date of Onset. onset 2 ONSET 3 onset_4 date admission #> 1 2018-01-10 10/01/2018 10 01 2018 male <NA> #> 2 2018-01-08 08/01/2018 08 01 2018 08/01/2018 <NA> #> 3 2018-01-02 02/01/2018 02 01 2018 02/01/2018 <NA> #> 4 2018-01-04 04/01/2018 04 01 2018 04/01/2018 <NA> #> 5 2018-01-05 05/01/2018 05 01 2018 05/01/2018 <NA> #> 6 2018-01-07 07/01/2018 07 01 2018 07/01/2018 2018-01-08 18:20:09 #> 7 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 09:34:39 #> 8 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 20:40:49 #> 9 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 04:47:37 #> 10 2018-01-03 03/01/2018 03 01 2018 03/01/2018 2018-01-04 18:33:36 #> 11 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 20:28:07 #> 12 2018-01-06 06/01/2018 06 01 2018 06/01/2018 2018-01-07 14:16:32 #> 13 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 01:51:22 #> 14 2018-01-03 03/01/2018 03 01 2018 03/01/2018 2018-01-04 12:51:18 #> 15 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 09:38:50 #> 16 2018-01-09 09/01/2018 09 01 2018 09/01/2018 2018-01-10 12:57:26 #> 17 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 19:32:14 #> 18 2018-01-06 06/01/2018 06 01 2018 06/01/2018 2018-01-07 15:02:59 #> 19 2018-01-07 07/01/2018 07 01 2018 07/01/2018 2018-01-08 02:53:43 #> 20 2018-01-07 07/01/2018 07 01 2018 confirmed 2018-01-08 11:31:40 #> DATE.of.DISCHARGE GENDER_ Épi.Case_définition date of admission #> 1 <NA> male probable <NA> #> 2 <NA> male probable <NA> #> 3 <NA> female suspected <NA> #> 4 <NA> female suspected <NA> #> 5 <NA> female not a case <NA> #> 6 2018-01-14 16:57:18 female suspected 2018-01-08 18:20:09 #> 7 2018-01-08 05:01:34 male confirmed 2018-01-05 09:34:39 #> 8 2018-01-08 09:42:15 female not a case 2018-01-05 20:40:49 #> 9 2018-01-07 07:13:17 male not a case 2018-01-05 04:47:37 #> 10 2018-01-06 10:50:14 male confirmed 2018-01-04 18:33:36 #> 11 2018-01-08 19:21:05 male confirmed 2018-01-06 20:28:07 #> 12 2018-01-08 16:07:12 male not a case 2018-01-07 14:16:32 #> 13 2018-01-07 13:45:23 male suspected 2018-01-06 01:51:22 #> 14 2018-01-08 13:20:23 female confirmed 2018-01-04 12:51:18 #> 15 2018-01-07 01:11:50 male probable 2018-01-05 09:38:50 #> 16 2018-01-12 10:54:32 female confirmed 2018-01-10 12:57:26 #> 17 2018-01-09 00:18:16 female confirmed 2018-01-06 19:32:14 #> 18 2018-01-10 23:02:40 male not a case 2018-01-07 15:02:59 #> 19 2018-01-09 04:19:17 female not a case 2018-01-08 02:53:43 #> 20 2018-01-13 00:47:03 female probable 2018-01-08 11:31:40 #> Date-of_discharge extra #> 1 <NA> fever #> 2 <NA> bleeding #> 3 <NA> fever #> 4 <NA> bleeding #> 5 <NA> bleeding #> 6 2018-01-14 16:57:18 fever #> 7 2018-01-08 05:01:34 bleeding #> 8 2018-01-08 09:42:15 bleeding #> 9 2018-01-07 07:13:17 fever #> 10 2018-01-06 10:50:14 fever #> 11 2018-01-08 19:21:05 05 01 2018 #> 12 2018-01-08 16:07:12 06 01 2018 #> 13 2018-01-07 13:45:23 05 01 2018 #> 14 2018-01-08 13:20:23 03 01 2018 #> 15 2018-01-07 01:11:50 04 01 2018 #> 16 2018-01-12 10:54:32 09 01 2018 #> 17 2018-01-09 00:18:16 05 01 2018 #> 18 2018-01-10 23:02:40 06 01 2018 #> 19 2018-01-09 04:19:17 07 01 2018 #> 20 2018-01-13 00:47:03 07 01 2018str(toy_data)#> 'data.frame': 20 obs. of 11 variables: #> $ Date of Onset. : POSIXct, format: "2018-01-10" "2018-01-08" ... #> $ onset 2 : chr "10/01/2018" "08/01/2018" "02/01/2018" "04/01/2018" ... #> $ ONSET 3 : chr "10 01 2018" "08 01 2018" "02 01 2018" "04 01 2018" ... #> $ onset_4 : chr "male" "08/01/2018" "02/01/2018" "04/01/2018" ... #> $ date admission : POSIXct, format: NA NA ... #> $ DATE.of.DISCHARGE : POSIXct, format: NA NA ... #> $ GENDER_ : chr "male" "male" "female" "female" ... #> $ Épi.Case_définition: chr "probable" "probable" "suspected" "suspected" ... #> $ date of admission : POSIXct, format: NA NA ... #> $ Date-of_discharge : POSIXct, format: NA NA ... #> $ extra : chr "fever" "bleeding" "fever" "bleeding" ...## clean variable names, store in new object, show results clean_data <- clean_variable_names(toy_data)#> Warning: Some variable names were duplicated after cleaning and had suffixes attached: #> #> Date-of_discharge -> date_of_discharge_1clean_data1 <- clean_dates(clean_data, first_date = "2018-01-01") clean_data1#> date_of_onset onset_2 onset_3 onset_4 date_admission #> 1 2018-01-10 2018-01-10 2018-01-10 <NA> <NA> #> 2 2018-01-08 2018-01-08 2018-01-08 2018-01-08 <NA> #> 3 2018-01-02 2018-01-02 2018-01-02 2018-01-02 <NA> #> 4 2018-01-04 2018-01-04 2018-01-04 2018-01-04 <NA> #> 5 2018-01-05 2018-01-05 2018-01-05 2018-01-05 <NA> #> 6 2018-01-07 2018-01-07 2018-01-07 2018-01-07 2018-01-08 #> 7 2018-01-04 2018-01-04 2018-01-04 2018-01-04 2018-01-05 #> 8 2018-01-04 2018-01-04 2018-01-04 2018-01-04 2018-01-05 #> 9 2018-01-04 2018-01-04 2018-01-04 2018-01-04 2018-01-05 #> 10 2018-01-03 2018-01-03 2018-01-03 2018-01-03 2018-01-04 #> 11 2018-01-05 2018-01-05 2018-01-05 2018-01-05 2018-01-06 #> 12 2018-01-06 2018-01-06 2018-01-06 2018-01-06 2018-01-07 #> 13 2018-01-05 2018-01-05 2018-01-05 2018-01-05 2018-01-06 #> 14 2018-01-03 2018-01-03 2018-01-03 2018-01-03 2018-01-04 #> 15 2018-01-04 2018-01-04 2018-01-04 2018-01-04 2018-01-05 #> 16 2018-01-09 2018-01-09 2018-01-09 2018-01-09 2018-01-10 #> 17 2018-01-05 2018-01-05 2018-01-05 2018-01-05 2018-01-06 #> 18 2018-01-06 2018-01-06 2018-01-06 2018-01-06 2018-01-07 #> 19 2018-01-07 2018-01-07 2018-01-07 2018-01-07 2018-01-08 #> 20 2018-01-07 2018-01-07 2018-01-07 <NA> 2018-01-08 #> date_of_discharge gender epi_case_definition date_of_admission #> 1 <NA> male probable <NA> #> 2 <NA> male probable <NA> #> 3 <NA> female suspected <NA> #> 4 <NA> female suspected <NA> #> 5 <NA> female not a case <NA> #> 6 2018-01-14 female suspected 2018-01-08 #> 7 2018-01-08 male confirmed 2018-01-05 #> 8 2018-01-08 female not a case 2018-01-05 #> 9 2018-01-07 male not a case 2018-01-05 #> 10 2018-01-06 male confirmed 2018-01-04 #> 11 2018-01-08 male confirmed 2018-01-06 #> 12 2018-01-08 male not a case 2018-01-07 #> 13 2018-01-07 male suspected 2018-01-06 #> 14 2018-01-08 female confirmed 2018-01-04 #> 15 2018-01-07 male probable 2018-01-05 #> 16 2018-01-12 female confirmed 2018-01-10 #> 17 2018-01-09 female confirmed 2018-01-06 #> 18 2018-01-10 male not a case 2018-01-07 #> 19 2018-01-09 female not a case 2018-01-08 #> 20 2018-01-13 female probable 2018-01-08 #> date_of_discharge_1 extra #> 1 <NA> <NA> #> 2 <NA> <NA> #> 3 <NA> <NA> #> 4 <NA> <NA> #> 5 <NA> <NA> #> 6 2018-01-14 <NA> #> 7 2018-01-08 <NA> #> 8 2018-01-08 <NA> #> 9 2018-01-07 <NA> #> 10 2018-01-06 <NA> #> 11 2018-01-08 2018-01-05 #> 12 2018-01-08 2018-01-06 #> 13 2018-01-07 2018-01-05 #> 14 2018-01-08 2018-01-03 #> 15 2018-01-07 2018-01-04 #> 16 2018-01-12 2018-01-09 #> 17 2018-01-09 2018-01-05 #> 18 2018-01-10 2018-01-06 #> 19 2018-01-09 2018-01-07 #> 20 2018-01-13 2018-01-07## Only clean the columns that have the words "date" or "admission" in them the_date_cols <- grep("(date|admission)", names(clean_data)) the_date_cols#> [1] 1 5 6 9 10clean_data2 <- clean_dates(clean_data, first_date = "2018-01-01", force_Date = the_date_cols, guess_dates = the_date_cols) clean_data2#> date_of_onset onset_2 onset_3 onset_4 date_admission #> 1 2018-01-10 10/01/2018 10 01 2018 male <NA> #> 2 2018-01-08 08/01/2018 08 01 2018 08/01/2018 <NA> #> 3 2018-01-02 02/01/2018 02 01 2018 02/01/2018 <NA> #> 4 2018-01-04 04/01/2018 04 01 2018 04/01/2018 <NA> #> 5 2018-01-05 05/01/2018 05 01 2018 05/01/2018 <NA> #> 6 2018-01-07 07/01/2018 07 01 2018 07/01/2018 2018-01-08 #> 7 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 #> 8 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 #> 9 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 #> 10 2018-01-03 03/01/2018 03 01 2018 03/01/2018 2018-01-04 #> 11 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 #> 12 2018-01-06 06/01/2018 06 01 2018 06/01/2018 2018-01-07 #> 13 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 #> 14 2018-01-03 03/01/2018 03 01 2018 03/01/2018 2018-01-04 #> 15 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 #> 16 2018-01-09 09/01/2018 09 01 2018 09/01/2018 2018-01-10 #> 17 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 #> 18 2018-01-06 06/01/2018 06 01 2018 06/01/2018 2018-01-07 #> 19 2018-01-07 07/01/2018 07 01 2018 07/01/2018 2018-01-08 #> 20 2018-01-07 07/01/2018 07 01 2018 confirmed 2018-01-08 #> date_of_discharge gender epi_case_definition date_of_admission #> 1 <NA> male probable <NA> #> 2 <NA> male probable <NA> #> 3 <NA> female suspected <NA> #> 4 <NA> female suspected <NA> #> 5 <NA> female not a case <NA> #> 6 2018-01-14 female suspected 2018-01-08 #> 7 2018-01-08 male confirmed 2018-01-05 #> 8 2018-01-08 female not a case 2018-01-05 #> 9 2018-01-07 male not a case 2018-01-05 #> 10 2018-01-06 male confirmed 2018-01-04 #> 11 2018-01-08 male confirmed 2018-01-06 #> 12 2018-01-08 male not a case 2018-01-07 #> 13 2018-01-07 male suspected 2018-01-06 #> 14 2018-01-08 female confirmed 2018-01-04 #> 15 2018-01-07 male probable 2018-01-05 #> 16 2018-01-12 female confirmed 2018-01-10 #> 17 2018-01-09 female confirmed 2018-01-06 #> 18 2018-01-10 male not a case 2018-01-07 #> 19 2018-01-09 female not a case 2018-01-08 #> 20 2018-01-13 female probable 2018-01-08 #> date_of_discharge_1 extra #> 1 <NA> fever #> 2 <NA> bleeding #> 3 <NA> fever #> 4 <NA> bleeding #> 5 <NA> bleeding #> 6 2018-01-14 fever #> 7 2018-01-08 bleeding #> 8 2018-01-08 bleeding #> 9 2018-01-07 fever #> 10 2018-01-06 fever #> 11 2018-01-08 05 01 2018 #> 12 2018-01-08 06 01 2018 #> 13 2018-01-07 05 01 2018 #> 14 2018-01-08 03 01 2018 #> 15 2018-01-07 04 01 2018 #> 16 2018-01-12 09 01 2018 #> 17 2018-01-09 05 01 2018 #> 18 2018-01-10 06 01 2018 #> 19 2018-01-09 07 01 2018 #> 20 2018-01-13 07 01 2018str(clean_data2)#> 'data.frame': 20 obs. of 11 variables: #> $ date_of_onset : Date, format: "2018-01-10" "2018-01-08" ... #> $ onset_2 : chr "10/01/2018" "08/01/2018" "02/01/2018" "04/01/2018" ... #> $ onset_3 : chr "10 01 2018" "08 01 2018" "02 01 2018" "04 01 2018" ... #> $ onset_4 : chr "male" "08/01/2018" "02/01/2018" "04/01/2018" ... #> $ date_admission : Date, format: NA NA ... #> $ date_of_discharge : Date, format: NA NA ... #> $ gender : chr "male" "male" "female" "female" ... #> $ epi_case_definition: chr "probable" "probable" "suspected" "suspected" ... #> $ date_of_admission : Date, format: NA NA ... #> $ date_of_discharge_1: Date, format: NA NA ... #> $ extra : chr "fever" "bleeding" "fever" "bleeding" ... #> - attr(*, "comment")= Named chr "Date of Onset." "onset 2" "ONSET 3" "onset_4" ... #> ..- attr(*, "names")= chr "date_of_onset" "onset_2" "onset_3" "onset_4" ...## A more complex example: clean date and admissions, but avoid the discharge ## column, since the timestamp is important the_date_cols <- grepl("(date|admission)", names(clean_data)) discharge <- grepl("discharge", names(clean_data)) ## set names so that these are easier to track names(the_date_cols) <- names(clean_data) -> names(discharge) the_date_cols # columns we want#> date_of_onset onset_2 onset_3 onset_4 #> TRUE FALSE FALSE FALSE #> date_admission date_of_discharge gender epi_case_definition #> TRUE TRUE FALSE FALSE #> date_of_admission date_of_discharge_1 extra #> TRUE TRUE FALSE!discharge # columns that are not the discharge columns ("!" means "not")#> date_of_onset onset_2 onset_3 onset_4 #> TRUE TRUE TRUE TRUE #> date_admission date_of_discharge gender epi_case_definition #> TRUE FALSE TRUE TRUE #> date_of_admission date_of_discharge_1 extra #> TRUE FALSE TRUEto_keep <- the_date_cols & !discharge # removing the discharge column clean_data3 <- clean_dates(clean_data, first_date = "2018-01-01", force_Date = to_keep, guess_dates = to_keep) clean_data3#> date_of_onset onset_2 onset_3 onset_4 date_admission #> 1 2018-01-10 10/01/2018 10 01 2018 male <NA> #> 2 2018-01-08 08/01/2018 08 01 2018 08/01/2018 <NA> #> 3 2018-01-02 02/01/2018 02 01 2018 02/01/2018 <NA> #> 4 2018-01-04 04/01/2018 04 01 2018 04/01/2018 <NA> #> 5 2018-01-05 05/01/2018 05 01 2018 05/01/2018 <NA> #> 6 2018-01-07 07/01/2018 07 01 2018 07/01/2018 2018-01-08 #> 7 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 #> 8 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 #> 9 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 #> 10 2018-01-03 03/01/2018 03 01 2018 03/01/2018 2018-01-04 #> 11 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 #> 12 2018-01-06 06/01/2018 06 01 2018 06/01/2018 2018-01-07 #> 13 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 #> 14 2018-01-03 03/01/2018 03 01 2018 03/01/2018 2018-01-04 #> 15 2018-01-04 04/01/2018 04 01 2018 04/01/2018 2018-01-05 #> 16 2018-01-09 09/01/2018 09 01 2018 09/01/2018 2018-01-10 #> 17 2018-01-05 05/01/2018 05 01 2018 05/01/2018 2018-01-06 #> 18 2018-01-06 06/01/2018 06 01 2018 06/01/2018 2018-01-07 #> 19 2018-01-07 07/01/2018 07 01 2018 07/01/2018 2018-01-08 #> 20 2018-01-07 07/01/2018 07 01 2018 confirmed 2018-01-08 #> date_of_discharge gender epi_case_definition date_of_admission #> 1 <NA> male probable <NA> #> 2 <NA> male probable <NA> #> 3 <NA> female suspected <NA> #> 4 <NA> female suspected <NA> #> 5 <NA> female not a case <NA> #> 6 2018-01-14 16:57:18 female suspected 2018-01-08 #> 7 2018-01-08 05:01:34 male confirmed 2018-01-05 #> 8 2018-01-08 09:42:15 female not a case 2018-01-05 #> 9 2018-01-07 07:13:17 male not a case 2018-01-05 #> 10 2018-01-06 10:50:14 male confirmed 2018-01-04 #> 11 2018-01-08 19:21:05 male confirmed 2018-01-06 #> 12 2018-01-08 16:07:12 male not a case 2018-01-07 #> 13 2018-01-07 13:45:23 male suspected 2018-01-06 #> 14 2018-01-08 13:20:23 female confirmed 2018-01-04 #> 15 2018-01-07 01:11:50 male probable 2018-01-05 #> 16 2018-01-12 10:54:32 female confirmed 2018-01-10 #> 17 2018-01-09 00:18:16 female confirmed 2018-01-06 #> 18 2018-01-10 23:02:40 male not a case 2018-01-07 #> 19 2018-01-09 04:19:17 female not a case 2018-01-08 #> 20 2018-01-13 00:47:03 female probable 2018-01-08 #> date_of_discharge_1 extra #> 1 <NA> fever #> 2 <NA> bleeding #> 3 <NA> fever #> 4 <NA> bleeding #> 5 <NA> bleeding #> 6 2018-01-14 16:57:18 fever #> 7 2018-01-08 05:01:34 bleeding #> 8 2018-01-08 09:42:15 bleeding #> 9 2018-01-07 07:13:17 fever #> 10 2018-01-06 10:50:14 fever #> 11 2018-01-08 19:21:05 05 01 2018 #> 12 2018-01-08 16:07:12 06 01 2018 #> 13 2018-01-07 13:45:23 05 01 2018 #> 14 2018-01-08 13:20:23 03 01 2018 #> 15 2018-01-07 01:11:50 04 01 2018 #> 16 2018-01-12 10:54:32 09 01 2018 #> 17 2018-01-09 00:18:16 05 01 2018 #> 18 2018-01-10 23:02:40 06 01 2018 #> 19 2018-01-09 04:19:17 07 01 2018 #> 20 2018-01-13 00:47:03 07 01 2018str(clean_data3)#> 'data.frame': 20 obs. of 11 variables: #> $ date_of_onset : Date, format: "2018-01-10" "2018-01-08" ... #> $ onset_2 : chr "10/01/2018" "08/01/2018" "02/01/2018" "04/01/2018" ... #> $ onset_3 : chr "10 01 2018" "08 01 2018" "02 01 2018" "04 01 2018" ... #> $ onset_4 : chr "male" "08/01/2018" "02/01/2018" "04/01/2018" ... #> $ date_admission : Date, format: NA NA ... #> $ date_of_discharge : POSIXct, format: NA NA ... #> $ gender : chr "male" "male" "female" "female" ... #> $ epi_case_definition: chr "probable" "probable" "suspected" "suspected" ... #> $ date_of_admission : Date, format: NA NA ... #> $ date_of_discharge_1: POSIXct, format: NA NA ... #> $ extra : chr "fever" "bleeding" "fever" "bleeding" ... #> - attr(*, "comment")= Named chr "Date of Onset." "onset 2" "ONSET 3" "onset_4" ... #> ..- attr(*, "names")= chr "date_of_onset" "onset_2" "onset_3" "onset_4" ...