GithubHelp home page GithubHelp logo

rematch2's Introduction

rematch2

Match Regular Expressions with a Nicer 'API'

Linux Build Status Windows Build status CRAN RStudio mirror downloads Coverage Status

A small wrapper on regular expression matching functions regexpr and gregexpr to return the results in tidy data frames.


Installation

install.packages("rematch2")

Rematch vs rematch2

Note that rematch2 is not compatible with the original rematch package. There are at least three major changes:

  • The order of the arguments for the functions is different. In rematch2 the text vector is first, and pattern is second.
  • In the result, .match is the last column instead of the first.
  • rematch2 returns tibble data frames. See https://github.com/hadley/tibble.

Usage

First match

library(rematch2)

With capture groups:

dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
  "76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)
#> # A tibble: 7 x 5
#>      ``    ``    ``            .text     .match
#>   <chr> <chr> <chr>            <chr>      <chr>
#> 1  2016    04    20       2016-04-20 2016-04-20
#> 2  1977    08    08       1977-08-08 1977-08-08
#> 3  <NA>  <NA>  <NA>       not a date       <NA>
#> 4  <NA>  <NA>  <NA>             2016       <NA>
#> 5  <NA>  <NA>  <NA>         76-03-02       <NA>
#> 6  2012    06    30       2012-06-30 2012-06-30
#> 7  2015    01    21 2015-01-21 19:58 2015-01-21

Named capture groups:

isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)
#> # A tibble: 7 x 5
#>    year month   day            .text     .match
#>   <chr> <chr> <chr>            <chr>      <chr>
#> 1  2016    04    20       2016-04-20 2016-04-20
#> 2  1977    08    08       1977-08-08 1977-08-08
#> 3  <NA>  <NA>  <NA>       not a date       <NA>
#> 4  <NA>  <NA>  <NA>             2016       <NA>
#> 5  <NA>  <NA>  <NA>         76-03-02       <NA>
#> 6  2012    06    30       2012-06-30 2012-06-30
#> 7  2015    01    21 2015-01-21 19:58 2015-01-21

A slightly more complex example:

github_repos <- c(
	"metacran/crandb",
	"jeroenooms/[email protected]",
    "jimhester/covr#47",
	"hadley/dplyr@*release",
    "r-lib/remotes@550a3c7d3f9e1493a2ba",
    "/$&@R64&3"
)
owner_rx   <- "(?:(?<owner>[^/]+)/)?"
repo_rx    <- "(?<repo>[^/@#]+)"
subdir_rx  <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
ref_rx     <- "(?:@(?<ref>[^*].*))"
pull_rx    <- "(?:#(?<pull>[0-9]+))"
release_rx <- "(?:@(?<release>[*]release))"

subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
github_rx  <- sprintf(
	"^(?:%s%s%s%s|(?<catchall>.*))$",
    owner_rx, repo_rx, subdir_rx, subtype_rx
)
re_match(text = github_repos, pattern = github_rx)
#> # A tibble: 6 x 9
#>        owner    repo subdir                  ref  pull  release  catchall
#>        <chr>   <chr>  <chr>                <chr> <chr>    <chr>     <chr>
#> 1   metacran  crandb                                                     
#> 2 jeroenooms    curl                      v0.9.3                         
#> 3  jimhester    covr                                47                   
#> 4     hadley   dplyr                                   *release          
#> 5      r-lib remotes        550a3c7d3f9e1493a2ba                         
#> 6                                                               /$&@R64&3
#> # ... with 2 more variables: .text <chr>, .match <chr>

All matches

Extract all names, and also first names and last names:

name_rex <- paste0(
  "(?<first>[[:upper:]][[:lower:]]+) ",
  "(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
  "  Ben Franklin and Jefferson Davis",
  "\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not
#> # A tibble: 2 x 4
#>       first      last                              .text    .match
#>      <list>    <list>                              <chr>    <list>
#> 1 <chr [2]> <chr [2]>   Ben Franklin and Jefferson Davis <chr [2]>
#> 2 <chr [1]> <chr [1]>               "\tMillard Fillmore" <chr [1]>
not$first
#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"
not$last
#> [[1]]
#> [1] "Franklin" "Davis"   
#> 
#> [[2]]
#> [1] "Fillmore"
not$.match
#> [[1]]
#> [1] "Ben Franklin"    "Jefferson Davis"
#> 
#> [[2]]
#> [1] "Millard Fillmore"

Match positions

re_exec and re_exec_all are similar to re_match and re_match_all, but they also return match positions. These functions return match records. A match record has three components: match, start, end, and each component can be a vector. It is similar to a data frame in this respect.

pos <- re_exec(notables, name_rex)
pos
#> # A tibble: 2 x 4
#>        first       last                              .text     .match
#> *     <list>     <list>                              <chr>     <list>
#> 1 <list [3]> <list [3]>   Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]>               "\tMillard Fillmore" <list [3]>

Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but rematch2 defines some special classes and an $ operator, to make it easier to extract parts of re_exec and re_exec_all matches. You simply query the match, start or end part of a column:

pos$first$match
#> [1] "Ben"     "Millard"
pos$first$start
#> [1] 3 2
pos$first$end
#> [1] 5 8

re_exec_all is very similar, but these queries return lists, with arbitrary number of matches:

allpos <- re_exec_all(notables, name_rex)
allpos
#> # A tibble: 2 x 4
#>        first       last                              .text     .match
#>       <list>     <list>                              <chr>     <list>
#> 1 <list [3]> <list [3]>   Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]>               "\tMillard Fillmore" <list [3]>
allpos$first$match
#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"
allpos$first$start
#> [[1]]
#> [1]  3 20
#> 
#> [[2]]
#> [1] 2
allpos$first$end
#> [[1]]
#> [1]  5 28
#> 
#> [[2]]
#> [1] 8

License

MIT © Mango Solutions, Gábor Csárdi

rematch2's People

Contributors

brodieg avatar gaborcsardi avatar krlmlr avatar wibeasley avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.