verbalexpressions / rverbalexpressions Goto Github PK

View Code? Open in Web Editor NEW

281.0 9.0 12.0 895 KB

:speech_balloon: Create regular expressions easily

Home Page: https://rverbalexpressions.netlify.com/

License: Other

R 100.00%

r regex verbal-expressions regular-expressions

rverbalexpressions's Introduction

RVerbalExpressions

The goal of RVerbalExpressions is to make it easier to construct regular expressions using grammar and functionality inspired by VerbalExpressions. Usage of %>% is encouraged to build expressions in a chain like fashion.

Installation

Install the released version of RVerbalExpressions from CRAN:

install.packages("RVerbalExpressions")

Or install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("VerbalExpressions/RVerbalExpressions")

Example

This is a basic example which shows you how to build a regular expression:

library(RVerbalExpressions)

# construct an expression
x <- rx_start_of_line() %>% 
  rx_find('http') %>% 
  rx_maybe('s') %>% 
  rx_find('://') %>% 
  rx_maybe('www.') %>% 
  rx_anything_but(' ') %>% 
  rx_end_of_line()

# print the expression
x
#> [1] "^(http)(s)?(\\://)(www\\.)?([^ ]*)$"

# test for a match
grepl(x, "https://www.google.com")
#> [1] TRUE

Other Implementations

You can see an up to date list of all ports on VerbalExpressions.github.io.

Additionally, there are two R packages that try to solve the same problem. I encourage you to check these out:

rex by @kevinushey
rebus by @richierocks

Contributing

If you find any issues, typos, etc., please file an issue or submit a PR. All contributions are welcome!

rverbalexpressions's People

Contributors

Stargazers

Watchers

Forkers

hieuqtran dmi3kno mervynakash fcocarlosbarbosamartins dshelldhillon cybernetics jiandandan waughsh jmmj89 crsky1023 qtaolab josegabrielguerrero26

rverbalexpressions's Issues

Organizing package documentation

I have been thinking how to organize package documentation. We basically have a few "groups" of functions that may make sense to be introduced together (at least in pkgdown):

Single-character functions

These are functions that return one character and do not require any "wrappers"

rx_alpha_num
rx_br and rx_line_break
rx_digit
rx_something
rx_space
rx_tab
rx_whitespace
rx_word_char and rx_word (with default rep="some") argument.

Character "sets"

These function output ranges or "sets" of characters, wrapped into [, for which we don't have a way to express them with single character. This is important when "nesting" them into supersets below, when "outer" set of [ need to be "peeled off". From the user stand point they may not be any different from Single-character functions

rx_alphanum
rx_alpha
rx_lower and rx_upper
rx_punctuation
rx_range

"Appenders"

These functions take .data argument and simply append something to it, thus modifying the behavior of previously appended function(s).

rx_capture_groups
rx_count
rx_end_of_line and rx_start_of_line
rx_one_or_more and rx_none_or_more
rx_with_any_case

"Expression-wrappers"

These functions allow user to specify the sequence of characters out of which all should be matched to the string.

rx_avoid and rx_seek
rx_find (and rx_literal, which I now dropped)
rx_maybe (which is rx_find with rep argument set to "maybe")
rx_or (which might need a bit of extra work, see #16 and thus will be out of this category)

"Superset functions"

These functions specify a list of mutually exclusive symbols/expressions, only one of which should be matched to the string.

rx_one_of
rx_anything_but and rx_something_but
(eventually rx_either_of) will be moved here as well, if we decide to keep it.

I find this grouping helpful when reasoning about the functionality our package covers.

There are a few functions I dropped:
rx_any_of (duplicate of rx_one_of)
rx_digits (too little advantage compared to rx_digit(rep=n)
rx_literal (duplicate of rx_find)
rx_not (duplicate of rx_avoid_suffix)
rx_new has been moved to utils.R

Integration tests

We need more end-to-end examples. Emails, urls, SSNs, license plates, etc. Lets collect links to those here and later implement.

Consider crayon dependency for fancy output

When expressions get large, it might be nice to visualize it in a way that makes it more clear. With crayon we could add a function rx_pretty() that would adjust the text style, text color, and background color of either the match characters (default) or special characters.

# highlights match or special characters
rx_pretty <- function(.data, txt_style = "bold", txt_col = "black", 
                      bg_col = "white", inverse = FALSE) {
  esc <- c(".", "|", "*", "?", "+", "(", ")", "{", "}", "^", "$", "\\", ":", "=", "[", "]")
  values <- strsplit(.data, "")[[1]]
  
  if(inverse) {
    esc_idx <- which(values %in% esc)
  } else {
    esc_idx <- which(!values %in% esc)
  }
  
  txt <- crayon::make_style(txt_col)
  bg <- crayon::make_style(bg_col, bg = TRUE)
  fancy <- crayon::combine_styles(txt, bg)
  
  cat(replace(values, esc_idx, crayon::style(fancy(values[esc_idx]), as = txt_style)), sep = "")
}

f <- rx() %>% 
  rx_start_of_line() %>% 
  rx_find('http') %>% 
  rx_maybe('s') %>% 
  rx_find('://') %>% 
  rx_maybe('www.') %>% 
  rx_anything_but(' ') %>% 
  rx_end_of_line()

Default behavior:

f %>% 
  rx_pretty()

Inverse behavior:

f %>% 
  rx_pretty(inverse = TRUE)

Custom behavior:

f %>% 
  rx_pretty(
    txt_style = "italic", 
    txt_col = "white", 
    bg_col = "blue"
    )

Is it possible to use strans to learn RVerbalExpressions from examples?

https://github.com/Inventitech/strans

this is a handy command line tool i stumbled upon, whilst browsing appimages. you can give it a few examples, and then it uses a technique from microsoft prose framework to automaticaly infer regex rules. example below, for extracting file formats, but it can do a lot more than that.

wouldnt it be cool if there was an rpackage using this to autogenerate human readable r regex code?

ls | strans -before Viper_Browser-50-x86_64.AppImage -after AppImage --describe

let columnName = "0" in let x = ChooseInput(vs, columnName) in SubStr(x, PosPair(RegexPositionRelative(x, RegexPair("Dot", "ε"), 1), RegexPositionRelative(x, RegexPair("ε", "Line Separator"), -1)))

py application
https://docs.microsoft.com/en-us/python/api/overview/azure/prose/intro?view=prose-py-latest

Method dispatch for `rx_string`

Basically boils down to detecting that .data is not of rx_string class and acting as though first argument is the value argument.
Reference: http://adv-r.had.co.nz/S3.html

UPDATED: sanitize now has method dispatch, so we can simply write

## unexported function for sanitizing arguments
sanitize_args <- function(...){
   if (missing(...)) return(NULL) 
  res <- sapply(list(...), sanitize) 
  Reduce(paste0, res)
}

is.rx_string <- function(x){
  inherits(x, "rx_string")
}

# class constructor - also unexported function. 
rx <- function(x){
  if(is.rx_string(x)) return(x)
  class(x) <- c("rx_string", class(x)) 
  x
}

rx_literal <- function(.data, ...) {
  UseMethod("rx_literal", .data)
}

rx_literal.character <- function(.data, ...){
  res <- paste0(sanitize(.data), sanitize_args(...))
  rx(res)
}

rx_literal.rx_string <- function(.data, ...) {
  res <- paste0(.data, sanitize_args(...))
  rx(res)
}

Now you dont need a constructor. Function works both in chain and stand alone

rx_literal("?@")
#> [1] "\\?@"
#> attr(,"class")
#> [1] "rx_string" "character"

rx_literal("?") %>% rx_literal("@")
#> [1] "\\?@"
#> attr(,"class")
#> [1] "rx_string" "character"

Hadley says we should also implement a few essential methods. We should rethink all of our functions with vectorization in mind.

When implementing a vector class, you should implement these methods: length, [, [<-, [[, [[<-, c. (If [ is implemented rev, head, and tail should all work).

Rename package to `rx`

Awesome short name and a hex with blue and red pill. Tagline: "Rx is a painkiller for regex"

available::available("rx")
#> -- rx --------------------------------------------------------------------------
#> Name valid: ✔
#> Available on CRAN: ✔ 
#> Available on Bioconductor: ✔
#> Available on GitHub:  ✔ 
#> Abbreviations: http://www.abbreviations.com/rx
#> Wikipedia: https://en.wikipedia.org/wiki/rx
#> Wiktionary: https://en.wiktionary.org/wiki/rx
#> Urban Dictionary:
#>   (noun)from the symbol meaning "[prescription]" this [seeks] to label  someone very annoying that can only be taken in [small doses] at set  periods of time.
#>   http://rx.urbanup.com/915152
#> Sentiment:???

Btw, vx is also available name, if you decide to go for that one. But I really like the "painkiller" message

Add print method for rx_string class

I would like to hide the classes from printing. I'm planning on adding the following to utils.R, ref.

print.rx_string <- function(x, ...){
  cat(paste(strwrap(x), collapse = "\n"), "\n", sep = "")
}

rx_not documentation

Would be nice if the description for the rx_not() function also had a sentence like this:

See also the more descriptive functions rx_avoid_prefix() and rx_avoid_suffix().

Creating a set

Without using rx_range(), how can I get an rx like [A-Z0-9] ? I tried many ideas and failed. For example:

R> rx_uppercase()
[1] "[A-Z]"
R> rx_digit()
[1] "\\d"
R> rx_either_of(NULL, rx_uppercase(), rx_digit())
[1] "(\\[A-Z\\]|\\\\d)"

This doesn't work because each part is sanitized.

rx_raw ?

Just an idea. Might be nice to have a function called rx_raw which allows user to explicitly specify a part of the rx. For example:

rx_alpha() %>% rx_raw("{8}")
"[A-z]{8}"

Yes, I know about rx_multiple. This is just an example when I know what the rx should look like, but I can't get the rx functions to bend to my will. Other possible names could be rx_expert, rx_as_is.

Add lookarounds

Add ways to express lookarounds. This was brought up by @dmi3kno and he mentioned a pretty intuitive way using step_ahead() and step_back()

Source: https://twitter.com/dmi3k/status/1103401979152355328

Consider a prefix

Consider a prefix for auto-populating functions. Something similiar to what tidy packages do. For example, the str_ prefix from stringr.

Maybe:

rve_ for R Verbal Expressions?
match_?
vex_ for Verbal Expression?

https://twitter.com/hadleywickham/status/1101928543485943809
https://twitter.com/danmaclean/status/1101937358289715201

Add quick testing function

I suggest we add vectorized testing function to save ourselves some typing and allow piping into it

rx_test <- function(x, txt){
  regmatches(txt, gregexpr(x, txt, perl = TRUE))
}

Consider a constructor

Would be nice to have something that "starts" the verbal expression, mostly likely the name of the function would be whatever prefix is decided on. So if rx_ is the prefix, we would do something like:

rx() %>% 
  rx_seek_prefix("(") %>% 
  rx_anything_lazy() %>% 
  rx_seek_suffix(")")

Instead of:

rx_seek_prefix(value = "(") %>% 
  rx_anything_lazy() %>% 
  rx_seek_suffix(")")

Syntax for rx_or()

Right now we have rx_or implementation which compares .data and value

##### Do not run
rx() %>% 
  rx_find("a") %>%
  rx_or("b") # or at best rx_or(rx_find("b"))

In the comments you mentioned:

##### Do not run
  # Not sure if I like this. I would prefer:
  # find(value = "foo") %>%
  #   or() %>%
  #   find("bar")
  # Rather than having to nest a rule inside of or(), maybe use glue?

Might the solution be similar to how now (in dev branch) we organized rx_one_of():

###### Do not run
rx() %>%
  rx_find("gr") %>%
  either_of(rx_find("a"), rx_find("e")) %>%
  rx_find("y")

In a sense, this is rx_one_of with (?:a|b) instead of [ab] and limited to two arguments only. I actually believe nothing prevents us from allowing more arguments, if we go down this route. I think going this route will add consistency to the package.

Character sets

Problem

I think the package will be incomplete until we find a way to express groups of characters. Here's a challenge to express email pattern matching in rx:

Challenges

First of all, I dont know of the way to express single "word" character (alnum + _). We used rx_word to denote \\w+ and perhaps it should have been rx_word_char() %>% rx_one_or_more().

rx_char <- function(.data = NULL, value=NULL) {
  if(missing(value))
    return(paste0(.data, "\\w"))
  paste0(.data, sanitize(value))
}

I also extended rx_count to cases of ranges of input

rx_count <- function(.data = NULL, n = 1) {
  if(length(n)>1){
    n[is.na(n)]<-""
    return(paste0(.data, "{", n[1], "," , n[length(n)], "}"))
  }
  paste0(.data, "{", n,"}")
}

Finally, we dont have a way to express word boundaries (\\b) and it might be useful to denote them. We shall call this function rx_word_edge

rx_word_start <- function(.data = NULL){
  paste0(.data, "\\b")
}

rx_word_end <- rx_word_start

Finally, our biggest problem is that there's no way to express groups of characters, other than through rx_any_of(), but if we pass other rx expressions, values will be sanitized twice, meaning that we will get four backslashes before each symbol instead of two.

# this function is exactly like rx_any_of() but without sanitization
rx_group <- function(.data = NULL, value) {
  paste0(.data, "[", value, "]")
}

Solution

Here's what it looks like when we put all pieces together:

x <- rx_word_start() %>% 
  rx_group(
    rx() %>% 
      rx_char() %>% 
      rx_char(".%+-")
  ) %>%
  rx_one_or_more() %>% 
  rx_char("@") %>% 
  rx_group(
    rx() %>% 
      rx_char() %>% 
      rx_char(".-")
  ) %>% 
  rx_one_or_more() %>% 
  rx_char(".") %>% 
  rx_alpha() %>% 
  rx_count(2:6) %>% 
  rx_word_end()
x
#> [1] "\\b[\\w\\.%\\+-]+@[\\w\\.-]+\\.[[:alpha:]]{2,6}\\b"

txt <- "This text contains email [email protected] and [email protected]. The latter is no longer valid."
regmatches(txt, gregexpr(x, txt, perl = TRUE))
#> [[1]]
#> [1] "[email protected]" "[email protected]"  
stringr::str_extract_all(txt, x)
#> [[1]]
#> [1] "[email protected]" "[email protected]"

The code works but I don't like it.

Constructor rx look redundant (I believe, there's a way to get rid of it entirely using specialized class, see below).
It is not very clear what rx_one_or_more() is referring to. I wonder if all functions should have rep argument with default option one and options some/any in addition to what rx_count does today.
Should rx_char() without arguments be called rx_wordchar?
Should rx_char() with arguments be called rx_literal() or rx_plain?
We should be very explicit about sanitization of arguments. To the extent that we should just mention: "input will be sanitized".
rx_group is artificial construct, a duplicate of rx_any_of, but without sanitization. Here I see couple of solutions.
a. Allow "nested pipes" (as I have done above). Create S3 class and this way detect when type of value argument is not character, but rx_string. Input of this class do not need to be sanitized, because it has been sanitized at creation.
b. Do not allow "nested pipes". Instead define rx_any_of() to have ... and allow multiple arguments mixing functions and characters. Then hypotherical pipe would look like this:

rx_word_edge() %>% 
  rx_any_of(rx_wordchar(), ".%+-", rep="some") %>%
  rx_literal("@") %>% 
  rx_any_of(rx_wordchar(), ".-", rep="some") %>% 
  rx_literal(".") %>% 
  rx_alpha(rep=2:6) %>% 
  rx_word_edge()

It's a lot to digest, but somehow everything related to one particular problem. Happy to split the issue once we identify the issues worth tackling.

Rename then

Come up with an alternative name for then because the pipe is often interpreted as then. Maybe followed_by or exactly? Or possibly omit entirely.

https://twitter.com/hadleywickham/status/1101928543485943809
https://twitter.com/PStrafo/status/1101929427523461120
https://twitter.com/romain_francois/status/1102204638575636480

Character class helpers

Should we add generic character class helpers:

# rx_digit() # done
rx_alnum()
rx_alpha()
rx_lowercase()
rx_uppercase()
rx_space()
rx_punctuation() 
rx_whitespace()
rx_non_whitespace()
rx_tab()