GithubHelp home page GithubHelp logo

sgreben / regex-builder Goto Github PK

View Code? Open in Web Editor NEW
61.0 5.0 10.0 195 KB

Write regular expressions in pure Java

Java 98.21% Dockerfile 1.26% Makefile 0.53%
expression-builder capture-groups java regex builder fluent wrapper

regex-builder's Introduction

Java Regex Builder

Write regexes as plain Java code. Unlike opaque regex strings, commenting your expressions and reusing regex fragments is straightforward.

The regex-builder library is implemented as a light-weight wrapper around java.util.regex. It consists of three main components: the expression builder Re, its fluent API equivalent FluentRe, and the character class builder CharClass. The components are introduced in the examples below as well as in the API overview tables at the end of this document.

There's a discussion of this project over on the Java subreddit.

Maven dependency

<dependency>
  <groupId>com.github.sgreben</groupId>
  <artifactId>regex-builder</artifactId>
  <version>1.2.1</version>
</dependency>

Examples

Imports:

import com.github.sgreben.regex_builder.CaptureGroup;
import com.github.sgreben.regex_builder.Expression;
import com.github.sgreben.regex_builder.Pattern;
import static com.github.sgreben.regex_builder.CharClass.*;
import static com.github.sgreben.regex_builder.Re.*;

Apache log

  • Regex string: (\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\S+) (\\S+) (\\S+)\" (\\d{3}) (\\d+)
  • Java code:
CaptureGroup ip, client, user, dateTime, method, request, protocol, responseCode, size;
Expression token = repeat1(nonWhitespaceChar());

ip = capture(token);
client = capture(token);
user = capture(token);
dateTime = capture(sequence(
  repeat1(union(wordChar(),':','/')),  whitespaceChar(), oneOf("+\\-"), repeat(digit(), 4)
));
method = capture(token);
request = capture(token);
protocol = capture(token);
responseCode = capture(repeat(digit(), 3));
size = capture(number());

Pattern p = Pattern.compile(sequence(
  ip, ' ', client, ' ', user, " [", dateTime, "] \"", method, ' ', request, ' ', protocol, "\" ", responseCode, ' ', size
));

Note that capture groups are plain java objects - no need to mess around with group indices or string group names. You can use the expression like this:

String logLine = "127.0.0.1 - - [21/Jul/2014:9:55:27 -0800] \"GET /home.html HTTP/1.1\" 200 2048";
Matcher m = p.matcher(logLine);

assertTrue(m.matches());

assertEquals("127.0.0.1", m.group(ip));
assertEquals("-", m.group(client));
assertEquals("-", m.group(user));
assertEquals("21/Jul/2014:9:55:27 -0800", m.group(dateTime));
assertEquals("GET", m.group(method));
assertEquals("/home.html", m.group(request));
assertEquals("HTTP/1.1", m.group(protocol));
assertEquals("200", m.group(responseCode));
assertEquals("2048", m.group(size));

Or, if you'd like to rewrite the log to a simpler "ip - request - response code" format, you can simply do

String result = m.replaceFirst(replacement(ip, " - ", request, " - ", responseCode));

Apache log (fluent API)

The above example can also be expressed using the fluent API implemented in FluentRe. To use it, you have import it as

import static com.github.sgreben.regex_builder.CharClass.*;
import com.github.sgreben.regex_builder.FluentRe;
CaptureGroup ip, client, user, dateTime, method, request, protocol, responseCode, size;
FluentRe nonWhitespace = FluentRe.match(nonWhitespaceChar()).repeat1();

ip = nonWhitespace.capture();
client = nonWhitespace.capture();
user = nonWhitespace.capture();
dateTime = FluentRe
    .match(union(wordChar(), oneOf(":/"))).repeat1()
    .then(whitespaceChar())
    .then(oneOf("+\\-"))
    .then(FluentRe.match(digit()).repeat(4))
    .capture();
method = nonWhitespace.capture();
request = nonWhitespace.capture();
protocol = nonWhitespace.capture();
responseCode = FluentRe.match(digit()).repeat(3).capture();
size = FluentRe.match(digit()).repeat1().capture();

Pattern p = FluentRe.match(beginInput())
    .then(ip).then(' ')
    .then(client).then(' ')
    .then(user).then(" [")
    .then(dateTime).then("] \"")
    .then(method).then(' ')
    .then(request).then(' ')
    .then(protocol).then("\" ")
    .then(responseCode).then(' ')
    .then(size)
    .then(endInput())
    .compile();

Date (DD/MM/YYYY HH:MM:SS)

  • Regex string: (\d\d\)/(\d\d)\/(\d\d\d\d) (\d\d):(\d\d):(\d\d)
  • Java code:
Expression twoDigits = repeat(digit(), 2);
Expression fourDigits = repeat(digit(), 4);
CaptureGroup day = capture(twoDigits);
CaptureGroup month = capture(twoDigits);
CaptureGroup year = capture(fourDigits);
CaptureGroup hour = capture(twoDigits);
CaptureGroup minute = capture(twoDigits);
CaptureGroup second = capture(twoDigits);
Expression dateExpression = sequence(
  day, '/', month, '/', year, ' ', // DD/MM/YYY
  hour, ':', minute, ':', second,    // HH:MM:SS
);

Use the expression like this:

Pattern p = Pattern.compile(dateExpression)
Matcher m = p.matcher("01/05/2015 12:30:22");
m.find();
assertEquals("01", m.group(day));
assertEquals("05", m.group(month));
assertEquals("2015", m.group(year));
assertEquals("12", m.group(hour));
assertEquals("30", m.group(minute));
assertEquals("22", m.group(second));

Hex color

  • Regex string: #([a-fA-F0-9]){3}(([a-fA-F0-9]){3})?
  • Java code:
Expression threeHexDigits = repeat(hexDigit(), 3);
CaptureGroup hexValue = capture(
    threeHexDigits,              // #FFF
    optional(threeHexDigits)  // #FFFFFF
);
Expression hexColor = sequence(
  '#', hexValue
);

Use the expression like this:

Pattern p = Pattern.compile(hexColor);
Matcher m = p.matcher("#0FAFF3 and #1bf");
m.find();
assertEquals("0FAFF3", m.group(hexValue));
m.find();
assertEquals("1bf", m.group(hexValue));

Reusing expressions

To reuse an expression cleanly, it should be packaged as a class. To access the capture groups contained in the expression, each capture group should be exposed as a final field or method.

To allow the resulting object to be used as an expression, regex-builder provides a utility class ExpressionWrapper, which exposes a method setExpression(Expression expr) and implements the Expresssion interface.

import com.github.sgreben.regex_builder.ExpressionWrapper;

To use the class, simply extend it and call setExpression in your constructor or initialization block. You can then pass it to any regex-builder method that expects an Expression.

Reusable Apache log expression

Using ExpressionWrapper, we can package the Apache log example above as follows:

public class ApacheLog extends ExpressionWrapper {
    public final CaptureGroup ip, client, user, dateTime, method, request, protocol, responseCode, size;

    {
        Expression nonWhitespace = repeat1(CharClass.nonWhitespaceChar());
        ip = capture(nonWhitespace);
        client = capture(nonWhitespace);
        user = capture(nonWhitespace);
        dateTime = capture(sequence(
            repeat1(union(wordChar(), ':', '/')),
            whitespaceChar(),
            oneOf("+\\-"),
            repeat(digit(), 4)
        ));
        method = capture(nonWhitespace);
        request = capture(nonWhitespace);
        protocol = capture(nonWhitespace);
        responseCode = capture(repeat(CharClass.digit(), 3));
        size = capture(repeat1(CharClass.digit()));

        Expression expression = sequence(
            ip, ' ', client, ' ', user, " [", dateTime, "] \"", method, ' ', request, ' ', protocol, "\" ", responseCode, ' ', size,
        );
        setExpression(expression);
    }
}

We can then use instances of the packaged expression like this:

public static boolean sameIP(String twoLogs) {
    ApacheLog log1 = new ApacheLog();
    ApacheLog log2 = new ApacheLog();
    Pattern p = Pattern.compile(sequence(
        log1, ' ', log2
    ));
    Matcher m = p.matcher(twoLogs);
    m.find();
    return m.group(log1.ip).equals(m.group(log2.ip));
}

API

Expression builder

Builder method java.util.regex syntax
repeat(e, N) e{N}
repeat(e) e*
repeat(e).possessive() e*+
repeatPossessive(e) e*+
repeat1(e) e+
repeat1(e).possessive() e++
repeat1Possessive(e) e++
optional(e) e?
optional(e).possessive() e?+
optionalPossessive(e) e?+
capture(e) (e)
positiveLookahead(e) (?=e)
negativeLookahead(e) (?!e)
positiveLookbehind(e) (?<=e)
negativeLookbehind(e) (?<!e)
backReference(g) \g
separatedBy(sep, e) (?:e((?:sep)(?:e))*)?
separatedBy1(sep, e) e(?:(?:sep)(?:e))*
choice(e1,...,eN) (?:e1|...| eN)
sequence(e1,...,eN) e1...eN
string(s) \Qs\E
word() \w+
number() \d+
whitespace() \s*
whitespace1() \s+
CaptureGroup g = capture(e) (?g e)

CharClass builder

Builder method java.util.regex syntax
range(from, to) [from-to]
range(f1, t1, ..., fN, tN) [f1-t1f2-t2...fN-tN]
oneOf("abcde") [abcde]
union(class1, ..., classN) [[class1]...[classN]]
complement(class1) [^class1]]
anyChar() .
digit() \d
nonDigit() \D
hexDigit() [a-fA-F0-9]
nonHexDigit() [^a-fA-F0-9]]
wordChar() \w
nonWordChar() \W
wordBoundary() \b
nonWordBoundary() \B
whitespaceChar() \s
nonWhitespaceChar() \S
verticalWhitespaceChar() \v
nonVerticalWhitespaceChar() \V
horizontalWhitespaceChar() \h
nonHorizontalWhitespaceChar() \H

regex-builder's People

Contributors

dependabot[bot] avatar sgreben avatar stefanlobbenmeier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

regex-builder's Issues

Complement still not working as expected

So I already made a pull request #4 to fix the issue partially, but I think there is a larger issue left, and I am not quite sure on how to fix it. While the code now matches your API documentation in the markdown, I think the definition in the markdown is not accurate.

For example, while your documentation states, that

nonHexDigit() [^[a-fA-F0-9]]

It should actually be

nonHexDigit() [^a-fA-F0-9]

I am not sure how to fix it since your complement is supposed to apply to any CharClass. For now, I will add a NoneOf class.

Support for named capture groups

Thanks for sharing this library @sgreben, it looks really neat!

Is it possible to give capture groups a name? So that I can generate a regexp like:

(?<date>\d{2})-(?<month>\d{2})-(?<year>\d{4})

Use \ for Fluent.match(char) and Fluent.match(String)

Currently, Fluent.match(char) will take the character (e.g. $) and wrap it with \Q and \E (e.g. \Q$\E). This is harder to read than \ (e.g. \$). Please make Fluent.match(char) smart in that it will add \ only when necessary.

Furthermore, please make Fluent.match(String) smart in that it will add \ for each character that needs it and if there are 5 or more characters in sequence that need escaping then use \Q and \E. Why 5 or more? To minimize the total length of the regular expression. Adding \Q and \E adds 4 characters. If there are 4 or less characters needing escaping, then adding 4 or less \ will be easier to read.

This will make it easier for humans to read the regular expression.

Simplify the regular expression from CharClass.union()

Let's say I have the following code CharClass.union(CharClass.digit(), ','). Currently, it makes this regular expression: [\d[,]]. Please simplify to [\d,].

This will make it easier for humans to read the regular expression.

replaceAll API

From lukaseder on reddit:

Feature request: One of the biggest pains when using large regular expressions is to match capturing groups with group indexes (e.g. when using replaceAll). It would be great if I could also compose replaceAll patterns by re-using the groups in the regex.
Example:

"abc".replaceAll("(b)",` "$1$1");

In your API (pseudo code):

Group b = group("b");
Expression sequence = sequence(b);
"abc".replaceAll(sequence, sequence.index(b) + sequence.index(b));

If I now add another group in front of group b, the index will be recalculated to $2.

Add toString()

Please add toString() to all classes and return the current regular expression. This will help when stepping through code that is building a regular expression.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.