Problem
I think the package will be incomplete until we find a way to express groups of characters. Here's a challenge to express email pattern matching in rx
:
Challenges
First of all, I dont know of the way to express single "word" character (alnum
+ _
). We used rx_word
to denote \\w+
and perhaps it should have been rx_word_char() %>% rx_one_or_more()
.
rx_char <- function(.data = NULL, value=NULL) {
if(missing(value))
return(paste0(.data, "\\w"))
paste0(.data, sanitize(value))
}
I also extended rx_count
to cases of ranges of input
rx_count <- function(.data = NULL, n = 1) {
if(length(n)>1){
n[is.na(n)]<-""
return(paste0(.data, "{", n[1], "," , n[length(n)], "}"))
}
paste0(.data, "{", n,"}")
}
Finally, we dont have a way to express word boundaries (\\b
) and it might be useful to denote them. We shall call this function rx_word_edge
rx_word_start <- function(.data = NULL){
paste0(.data, "\\b")
}
rx_word_end <- rx_word_start
Finally, our biggest problem is that there's no way to express groups of characters, other than through rx_any_of()
, but if we pass other rx
expressions, values will be sanitized twice, meaning that we will get four backslashes before each symbol instead of two.
# this function is exactly like rx_any_of() but without sanitization
rx_group <- function(.data = NULL, value) {
paste0(.data, "[", value, "]")
}
Solution
Here's what it looks like when we put all pieces together:
x <- rx_word_start() %>%
rx_group(
rx() %>%
rx_char() %>%
rx_char(".%+-")
) %>%
rx_one_or_more() %>%
rx_char("@") %>%
rx_group(
rx() %>%
rx_char() %>%
rx_char(".-")
) %>%
rx_one_or_more() %>%
rx_char(".") %>%
rx_alpha() %>%
rx_count(2:6) %>%
rx_word_end()
x
#> [1] "\\b[\\w\\.%\\+-]+@[\\w\\.-]+\\.[[:alpha:]]{2,6}\\b"
txt <- "This text contains email [email protected] and [email protected]. The latter is no longer valid."
regmatches(txt, gregexpr(x, txt, perl = TRUE))
#> [[1]]
#> [1] "[email protected]" "[email protected]"
stringr::str_extract_all(txt, x)
#> [[1]]
#> [1] "[email protected]" "[email protected]"
The code works but I don't like it.
- Constructor
rx
look redundant (I believe, there's a way to get rid of it entirely using specialized class, see below).
- It is not very clear what
rx_one_or_more()
is referring to. I wonder if all functions should have rep
argument with default option one
and options some
/any
in addition to what rx_count
does today.
- Should
rx_char()
without arguments be called rx_wordchar
?
- Should
rx_char()
with arguments be called rx_literal()
or rx_plain
?
- We should be very explicit about sanitization of arguments. To the extent that we should just mention: "input will be sanitized".
rx_group
is artificial construct, a duplicate of rx_any_of
, but without sanitization. Here I see couple of solutions.
a. Allow "nested pipes" (as I have done above). Create S3 class and this way detect when type of value
argument is not character, but rx_string
. Input of this class do not need to be sanitized, because it has been sanitized at creation.
b. Do not allow "nested pipes". Instead define rx_any_of()
to have ...
and allow multiple arguments mixing functions and characters. Then hypotherical pipe would look like this:
rx_word_edge() %>%
rx_any_of(rx_wordchar(), ".%+-", rep="some") %>%
rx_literal("@") %>%
rx_any_of(rx_wordchar(), ".-", rep="some") %>%
rx_literal(".") %>%
rx_alpha(rep=2:6) %>%
rx_word_edge()
It's a lot to digest, but somehow everything related to one particular problem. Happy to split the issue once we identify the issues worth tackling.