I was using dplyr::filter()
to subset a data.frame
and was having trouble
getting the result I wanted. I had a data.frame
along the lines of:
library(dplyr)
df <- tibble::tribble(
~x, ~y, ~z,
1, 2, 3,
4, 5, 6,
NA, 8, 9
)
And wanted to drop rows that met a certain condition:
df %>%
filter(x != 4)
## # A tibble: 1 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
But in this case the third row with a NA
value for x
was also being dropped.
The documentation for dplyr::filter()
explains that:
To be retained, the row must produce a value of TRUE for all conditions. Note
that when a condition evaluates to NA the row will be dropped, unlike base
subsetting with [.
So in my case the last row of the simple df
was not kept because:
NA != 4
## [1] NA
To get the result I wanted, I needed evaluation along the lines of:
!(NA %in% 4)
## [1] TRUE
And now my call to dplyr::filter()
returns the result I was looking for:
df %>%
filter(!(x %in% 4))
## # A tibble: 2 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 NA 8 9
The documentation made me curious about subsetting in base R. This also produced results that were initially counter-intutitive to me:
df[df$x != 4, ]
## # A tibble: 2 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 NA NA NA
The third row was kept in the subset, but the values for y
, and z
were
set to NA
as well. The documentation for [
notes that:
rows containing an NA produce an NA in the result
But similar logic that I used in dplyr::filter()
will work here as well:
df[!(df$x %in% 4), ]
## # A tibble: 2 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 NA 8 9