This function is a generic, with methods for factor
and character
objects. It lists all unique values in the input, ranks them from the most to
the least frequent, and keeps the top n
values. Other values are replaced
by the chosen replacement. As an option, the user can specify a subset of the
input data to define dominant values. Under the hood, this uses
forcats::fct_lump()
and forcats::fct_recode()
.
top_values(x, n, ...) # S3 method for default top_values(x, n, ...) # S3 method for factor top_values( x, n, replacement = "other", subset = NULL, ties_method = "first", ... ) # S3 method for character top_values( x, n, replacement = "other", subset = NULL, ties_method = "first", ... )
x | a |
---|---|
n | the number of levels or values to keep |
... | further arguments passed to |
replacement | a single value to replace the less frequent values with |
subset | a |
ties_method | how to deal with ties when ranking factor levels, which is
passed on to |
This function is an opinionated wrapper around forcats::fct_lump()
with the following changes:
characters are not auto-converted to factor
the default ties method defaults to "first" instead of "min"
if n = nlevels(x) - 1
, then the nth level is still converted to the
value of replacement
(forcats will assume you didn't want to convert
the nth level)
it is possible to convert the replacement to NA
#> x #> g i a d h e f b c j #> 15 14 11 11 10 9 9 7 7 7## keep top values top_values(x, 2) # top 2#> [1] "other" "other" "other" "other" "i" "other" "other" "other" "other" #> [10] "g" "other" "other" "i" "g" "other" "other" "i" "i" #> [19] "other" "other" "i" "g" "other" "other" "other" "other" "other" #> [28] "other" "other" "other" "other" "other" "other" "other" "g" "other" #> [37] "i" "i" "other" "g" "i" "other" "i" "other" "other" #> [46] "other" "g" "other" "other" "other" "g" "other" "other" "other" #> [55] "other" "other" "g" "other" "other" "other" "other" "i" "g" #> [64] "other" "g" "i" "g" "other" "other" "g" "other" "other" #> [73] "other" "i" "other" "other" "g" "i" "other" "g" "other" #> [82] "other" "other" "other" "other" "other" "other" "g" "other" "i" #> [91] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [100] "other"top_values(x, 2, NA) # top 3, replace with NA#> [1] NA NA NA NA "i" NA NA NA NA "g" NA NA "i" "g" NA NA "i" "i" #> [19] NA NA "i" "g" NA NA NA NA NA NA NA NA NA NA NA NA "g" NA #> [37] "i" "i" NA "g" "i" NA "i" NA NA NA "g" NA NA NA "g" NA NA NA #> [55] NA NA "g" NA NA NA NA "i" "g" NA "g" "i" "g" NA NA "g" NA NA #> [73] NA "i" NA NA "g" "i" NA "g" NA NA NA NA NA NA NA "g" NA "i" #> [91] NA NA NA NA NA NA NA NA NA NAtop_values(x, 0) # extreme case, keep nothing#> [1] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [10] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [19] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [28] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [37] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [46] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [55] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [64] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [73] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [82] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [91] "other" "other" "other" "other" "other" "other" "other" "other" "other" #> [100] "other"## dealing with ties x <- c("a", "b", "a", "b", "c") ## in the case of a tie (a, b), the first value is ranked higher than the ## others top_values(x, n = 1)#> Warning: a tie among values (a, b) was broken by choosing the first value#> [1] "a" "other" "a" "other" "other"## here, the ties are ranked in reverse order, so b comes before a top_values(x, n = 1, ties_method = "last")#> Warning: a tie among values (a, b) was broken by choosing the last value#> [1] "other" "b" "other" "b" "other"## top_values differs from forcats::fct_lump in that if the user selects n - 1 ## values, it will force the last value to be "other" forcats::fct_lump(x, n = 2)#> [1] a b a b c #> Levels: a b ctop_values(x, n = 2)#> [1] "a" "b" "a" "b" "other"## If there is a tie for the last level, then it will drop the level ## depending on the ties_method # replace "d" with other top_values(c(x, "d"), n = 3)#> Warning: a tie among values (c, d) was broken by choosing the first value#> [1] "a" "b" "a" "b" "c" "other"#> Warning: a tie among values (c, d) was broken by choosing the last value#> [1] "a" "b" "a" "b" "other" "d"#> [1] "a" "a" "a" "b" "b" "c"top_values(x, n = 1, subset = 4:6)#> [1] "other" "other" "other" "b" "b" "other"top_values(x, n = 2, subset = 4:6)#> [1] "other" "other" "other" "b" "b" "c"top_values(x, n = 1, subset = -1)#> Warning: a tie among values (a, b) was broken by choosing the first value#> [1] "a" "a" "a" "other" "other" "other"top_values(x, n = 1, subset = -1, ties_method = "last")#> Warning: a tie among values (a, b) was broken by choosing the last value#> [1] "other" "other" "other" "b" "b" "other"