GithubHelp home page GithubHelp logo

verbalexpressions / rverbalexpressions Goto Github PK

View Code? Open in Web Editor NEW
281.0 9.0 12.0 895 KB

:speech_balloon: Create regular expressions easily

Home Page: https://rverbalexpressions.netlify.com/

License: Other

R 100.00%
r regex verbal-expressions regular-expressions

rverbalexpressions's Introduction

RVerbalExpressions

AppVeyor Build status CRAN status CRAN_Download_Badge

The goal of RVerbalExpressions is to make it easier to construct regular expressions using grammar and functionality inspired by VerbalExpressions. Usage of %>% is encouraged to build expressions in a chain like fashion.

Installation

Install the released version of RVerbalExpressions from CRAN:

install.packages("RVerbalExpressions")

Or install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("VerbalExpressions/RVerbalExpressions")

Example

This is a basic example which shows you how to build a regular expression:

library(RVerbalExpressions)

# construct an expression
x <- rx_start_of_line() %>% 
  rx_find('http') %>% 
  rx_maybe('s') %>% 
  rx_find('://') %>% 
  rx_maybe('www.') %>% 
  rx_anything_but(' ') %>% 
  rx_end_of_line()

# print the expression
x
#> [1] "^(http)(s)?(\\://)(www\\.)?([^ ]*)$"

# test for a match
grepl(x, "https://www.google.com")
#> [1] TRUE

Other Implementations

You can see an up to date list of all ports on VerbalExpressions.github.io.

Additionally, there are two R packages that try to solve the same problem. I encourage you to check these out:

  1. rex by @kevinushey
  2. rebus by @richierocks

Contributing

If you find any issues, typos, etc., please file an issue or submit a PR. All contributions are welcome!

rverbalexpressions's People

Contributors

tylerlittlefield avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rverbalexpressions's Issues

Organizing package documentation

I have been thinking how to organize package documentation. We basically have a few "groups" of functions that may make sense to be introduced together (at least in pkgdown):

Single-character functions

These are functions that return one character and do not require any "wrappers"

  • rx_alpha_num
  • rx_br and rx_line_break
  • rx_digit
  • rx_something
  • rx_space
  • rx_tab
  • rx_whitespace
  • rx_word_char and rx_word (with default rep="some") argument.

Character "sets"

These function output ranges or "sets" of characters, wrapped into [, for which we don't have a way to express them with single character. This is important when "nesting" them into supersets below, when "outer" set of [ need to be "peeled off". From the user stand point they may not be any different from Single-character functions

  • rx_alphanum
  • rx_alpha
  • rx_lower and rx_upper
  • rx_punctuation
  • rx_range

"Appenders"

These functions take .data argument and simply append something to it, thus modifying the behavior of previously appended function(s).

  • rx_capture_groups
  • rx_count
  • rx_end_of_line and rx_start_of_line
  • rx_one_or_more and rx_none_or_more
  • rx_with_any_case

"Expression-wrappers"

These functions allow user to specify the sequence of characters out of which all should be matched to the string.

  • rx_avoid and rx_seek
  • rx_find (and rx_literal, which I now dropped)
  • rx_maybe (which is rx_find with rep argument set to "maybe")
  • rx_or (which might need a bit of extra work, see #16 and thus will be out of this category)

"Superset functions"

These functions specify a list of mutually exclusive symbols/expressions, only one of which should be matched to the string.

  • rx_one_of
  • rx_anything_but and rx_something_but
    (eventually rx_either_of) will be moved here as well, if we decide to keep it.

I find this grouping helpful when reasoning about the functionality our package covers.

There are a few functions I dropped:
rx_any_of (duplicate of rx_one_of)
rx_digits (too little advantage compared to rx_digit(rep=n)
rx_literal (duplicate of rx_find)
rx_not (duplicate of rx_avoid_suffix)
rx_new has been moved to utils.R

Integration tests

We need more end-to-end examples. Emails, urls, SSNs, license plates, etc. Lets collect links to those here and later implement.

Consider crayon dependency for fancy output

When expressions get large, it might be nice to visualize it in a way that makes it more clear. With crayon we could add a function rx_pretty() that would adjust the text style, text color, and background color of either the match characters (default) or special characters.

# highlights match or special characters
rx_pretty <- function(.data, txt_style = "bold", txt_col = "black", 
                      bg_col = "white", inverse = FALSE) {
  esc <- c(".", "|", "*", "?", "+", "(", ")", "{", "}", "^", "$", "\\", ":", "=", "[", "]")
  values <- strsplit(.data, "")[[1]]
  
  if(inverse) {
    esc_idx <- which(values %in% esc)
  } else {
    esc_idx <- which(!values %in% esc)
  }
  
  txt <- crayon::make_style(txt_col)
  bg <- crayon::make_style(bg_col, bg = TRUE)
  fancy <- crayon::combine_styles(txt, bg)
  
  cat(replace(values, esc_idx, crayon::style(fancy(values[esc_idx]), as = txt_style)), sep = "")
}

f <- rx() %>% 
  rx_start_of_line() %>% 
  rx_find('http') %>% 
  rx_maybe('s') %>% 
  rx_find('://') %>% 
  rx_maybe('www.') %>% 
  rx_anything_but(' ') %>% 
  rx_end_of_line()

Default behavior:

f %>% 
  rx_pretty()

Screen Shot 2019-03-08 at 1 30 16 PM

Inverse behavior:

f %>% 
  rx_pretty(inverse = TRUE)

Screen Shot 2019-03-08 at 1 31 27 PM

Custom behavior:

f %>% 
  rx_pretty(
    txt_style = "italic", 
    txt_col = "white", 
    bg_col = "blue"
    )

Screen Shot 2019-03-08 at 1 32 24 PM

Is it possible to use strans to learn RVerbalExpressions from examples?

https://github.com/Inventitech/strans

this is a handy command line tool i stumbled upon, whilst browsing appimages. you can give it a few examples, and then it uses a technique from microsoft prose framework to automaticaly infer regex rules. example below, for extracting file formats, but it can do a lot more than that.

wouldnt it be cool if there was an rpackage using this to autogenerate human readable r regex code?

ls | strans -before Viper_Browser-50-x86_64.AppImage -after AppImage --describe

let columnName = "0" in let x = ChooseInput(vs, columnName) in SubStr(x, PosPair(RegexPositionRelative(x, RegexPair("Dot", "ε"), 1), RegexPositionRelative(x, RegexPair("ε", "Line Separator"), -1)))

py application
https://docs.microsoft.com/en-us/python/api/overview/azure/prose/intro?view=prose-py-latest

Method dispatch for `rx_string`

Basically boils down to detecting that .data is not of rx_string class and acting as though first argument is the value argument.
Reference: http://adv-r.had.co.nz/S3.html

UPDATED: sanitize now has method dispatch, so we can simply write

## unexported function for sanitizing arguments
sanitize_args <- function(...){
   if (missing(...)) return(NULL) 
  res <- sapply(list(...), sanitize) 
  Reduce(paste0, res)
}

is.rx_string <- function(x){
  inherits(x, "rx_string")
}

# class constructor - also unexported function. 
rx <- function(x){
  if(is.rx_string(x)) return(x)
  class(x) <- c("rx_string", class(x)) 
  x
}

rx_literal <- function(.data, ...) {
  UseMethod("rx_literal", .data)
}

rx_literal.character <- function(.data, ...){
  res <- paste0(sanitize(.data), sanitize_args(...))
  rx(res)
}

rx_literal.rx_string <- function(.data, ...) {
  res <- paste0(.data, sanitize_args(...))
  rx(res)
}

Now you dont need a constructor. Function works both in chain and stand alone

rx_literal("?@")
#> [1] "\\?@"
#> attr(,"class")
#> [1] "rx_string" "character"

rx_literal("?") %>% rx_literal("@")
#> [1] "\\?@"
#> attr(,"class")
#> [1] "rx_string" "character"

Hadley says we should also implement a few essential methods. We should rethink all of our functions with vectorization in mind.

When implementing a vector class, you should implement these methods: length, [, [<-, [[, [[<-, c. (If [ is implemented rev, head, and tail should all work).

Rename package to `rx`

Awesome short name and a hex with blue and red pill. Tagline: "Rx is a painkiller for regex"

available::available("rx")
#> -- rx --------------------------------------------------------------------------
#> Name valid: ✔
#> Available on CRAN: ✔ 
#> Available on Bioconductor: ✔
#> Available on GitHub:  ✔ 
#> Abbreviations: http://www.abbreviations.com/rx
#> Wikipedia: https://en.wikipedia.org/wiki/rx
#> Wiktionary: https://en.wiktionary.org/wiki/rx
#> Urban Dictionary:
#>   (noun)from the symbol meaning "[prescription]" this [seeks] to label  someone very annoying that can only be taken in [small doses] at set  periods of time.
#>   http://rx.urbanup.com/915152
#> Sentiment:???

Btw, vx is also available name, if you decide to go for that one. But I really like the "painkiller" message

Add print method for rx_string class

I would like to hide the classes from printing. I'm planning on adding the following to utils.R, ref.

print.rx_string <- function(x, ...){
  cat(paste(strwrap(x), collapse = "\n"), "\n", sep = "")
}

rx_not documentation

Would be nice if the description for the rx_not() function also had a sentence like this:

See also the more descriptive functions rx_avoid_prefix() and rx_avoid_suffix().

Creating a set

Without using rx_range(), how can I get an rx like [A-Z0-9] ? I tried many ideas and failed. For example:

R> rx_uppercase()
[1] "[A-Z]"
R> rx_digit()
[1] "\\d"
R> rx_either_of(NULL, rx_uppercase(), rx_digit())
[1] "(\\[A-Z\\]|\\\\d)"

This doesn't work because each part is sanitized.

rx_raw ?

Just an idea. Might be nice to have a function called rx_raw which allows user to explicitly specify a part of the rx. For example:

rx_alpha() %>% rx_raw("{8}")
"[A-z]{8}"

Yes, I know about rx_multiple. This is just an example when I know what the rx should look like, but I can't get the rx functions to bend to my will. Other possible names could be rx_expert, rx_as_is.

Add quick testing function

I suggest we add vectorized testing function to save ourselves some typing and allow piping into it

rx_test <- function(x, txt){
  regmatches(txt, gregexpr(x, txt, perl = TRUE))
}

Consider a constructor

Would be nice to have something that "starts" the verbal expression, mostly likely the name of the function would be whatever prefix is decided on. So if rx_ is the prefix, we would do something like:

rx() %>% 
  rx_seek_prefix("(") %>% 
  rx_anything_lazy() %>% 
  rx_seek_suffix(")")

Instead of:

rx_seek_prefix(value = "(") %>% 
  rx_anything_lazy() %>% 
  rx_seek_suffix(")")

Syntax for rx_or()

Right now we have rx_or implementation which compares .data and value

##### Do not run
rx() %>% 
  rx_find("a") %>%
  rx_or("b") # or at best rx_or(rx_find("b"))

In the comments you mentioned:

##### Do not run
  # Not sure if I like this. I would prefer:
  # find(value = "foo") %>%
  #   or() %>%
  #   find("bar")
  # Rather than having to nest a rule inside of or(), maybe use glue?

Might the solution be similar to how now (in dev branch) we organized rx_one_of():

###### Do not run
rx() %>%
  rx_find("gr") %>%
  either_of(rx_find("a"), rx_find("e")) %>%
  rx_find("y")

In a sense, this is rx_one_of with (?:a|b) instead of [ab] and limited to two arguments only. I actually believe nothing prevents us from allowing more arguments, if we go down this route. I think going this route will add consistency to the package.

Character sets

Problem

I think the package will be incomplete until we find a way to express groups of characters. Here's a challenge to express email pattern matching in rx:

regex-example

Challenges

First of all, I dont know of the way to express single "word" character (alnum + _). We used rx_word to denote \\w+ and perhaps it should have been rx_word_char() %>% rx_one_or_more().

rx_char <- function(.data = NULL, value=NULL) {
  if(missing(value))
    return(paste0(.data, "\\w"))
  paste0(.data, sanitize(value))
}

I also extended rx_count to cases of ranges of input

rx_count <- function(.data = NULL, n = 1) {
  if(length(n)>1){
    n[is.na(n)]<-""
    return(paste0(.data, "{", n[1], "," , n[length(n)], "}"))
  }
  paste0(.data, "{", n,"}")
}

Finally, we dont have a way to express word boundaries (\\b) and it might be useful to denote them. We shall call this function rx_word_edge

rx_word_start <- function(.data = NULL){
  paste0(.data, "\\b")
}

rx_word_end <- rx_word_start

Finally, our biggest problem is that there's no way to express groups of characters, other than through rx_any_of(), but if we pass other rx expressions, values will be sanitized twice, meaning that we will get four backslashes before each symbol instead of two.

# this function is exactly like rx_any_of() but without sanitization
rx_group <- function(.data = NULL, value) {
  paste0(.data, "[", value, "]")
}

Solution

Here's what it looks like when we put all pieces together:

x <- rx_word_start() %>% 
  rx_group(
    rx() %>% 
      rx_char() %>% 
      rx_char(".%+-")
  ) %>%
  rx_one_or_more() %>% 
  rx_char("@") %>% 
  rx_group(
    rx() %>% 
      rx_char() %>% 
      rx_char(".-")
  ) %>% 
  rx_one_or_more() %>% 
  rx_char(".") %>% 
  rx_alpha() %>% 
  rx_count(2:6) %>% 
  rx_word_end()
x
#> [1] "\\b[\\w\\.%\\+-]+@[\\w\\.-]+\\.[[:alpha:]]{2,6}\\b"

txt <- "This text contains email [email protected] and [email protected]. The latter is no longer valid."
regmatches(txt, gregexpr(x, txt, perl = TRUE))
#> [[1]]
#> [1] "[email protected]" "[email protected]"  
stringr::str_extract_all(txt, x)
#> [[1]]
#> [1] "[email protected]" "[email protected]"  

The code works but I don't like it.

  1. Constructor rx look redundant (I believe, there's a way to get rid of it entirely using specialized class, see below).
  2. It is not very clear what rx_one_or_more() is referring to. I wonder if all functions should have rep argument with default option one and options some/any in addition to what rx_count does today.
  3. Should rx_char() without arguments be called rx_wordchar?
  4. Should rx_char() with arguments be called rx_literal() or rx_plain?
  5. We should be very explicit about sanitization of arguments. To the extent that we should just mention: "input will be sanitized".
  6. rx_group is artificial construct, a duplicate of rx_any_of, but without sanitization. Here I see couple of solutions.
    a. Allow "nested pipes" (as I have done above). Create S3 class and this way detect when type of value argument is not character, but rx_string. Input of this class do not need to be sanitized, because it has been sanitized at creation.
    b. Do not allow "nested pipes". Instead define rx_any_of() to have ... and allow multiple arguments mixing functions and characters. Then hypotherical pipe would look like this:
rx_word_edge() %>% 
  rx_any_of(rx_wordchar(), ".%+-", rep="some") %>%
  rx_literal("@") %>% 
  rx_any_of(rx_wordchar(), ".-", rep="some") %>% 
  rx_literal(".") %>% 
  rx_alpha(rep=2:6) %>% 
  rx_word_edge()

It's a lot to digest, but somehow everything related to one particular problem. Happy to split the issue once we identify the issues worth tackling.

Character class helpers

Should we add generic character class helpers:

# rx_digit() # done
rx_alnum()
rx_alpha()
rx_lowercase()
rx_uppercase()
rx_space()
rx_punctuation() 
rx_whitespace()
rx_non_whitespace()
rx_tab()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.