GithubHelp home page GithubHelp logo

isabella232 / reap Goto Github PK

View Code? Open in Web Editor NEW

This project forked from juxt/reap

0.0 0.0 0.0 456 KB

A Clojure library for decoding and encoding strings used by web protocols.

Emacs Lisp 0.03% Clojure 99.70% Makefile 0.27%

reap's Introduction

reap

Regular Expressions for Accurate Parsing

A Clojure library for decoding and encoding strings used by web protocols.

Warning

STATUS: Alpha. Ready to use but the API is likely to change in future versions.

Quick Start

Suppose you want to decode an Accept header from an HTTP request. For example, Firefox sends one like this:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8

whereas a Chrome browser on Windows 7 might send:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9

You can use reap to parse that header’s value into data you can more easily work with.

Here’s how:

(require
  '[juxt.reap.alpha.decoders.rfc7231 :refer [accept]]
  '[juxt.reap.alpha.regex :as re])

(let [decoder (accept {})]
  (decoder
    (re/input
      "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9")))

This will return the following sequence of items:

#:juxt.reap.alpha.rfc7231
{:media-range "text/html",
 :type "text",
 :subtype "html",
 :parameters {}}

#:juxt.reap.alpha.rfc7231
{:media-range "application/xhtml+xml",
 :type "application",
 :subtype "xhtml+xml",
 :parameters {}}

#:juxt.reap.alpha.rfc7231
{:media-range "application/xml",
 :type "application",
 :subtype "xml",
 :parameters {},
 :qvalue 0.9}

#:juxt.reap.alpha.rfc7231
{:media-range "image/webp",
 :type "image",
 :subtype "webp",
 :parameters {}}

#:juxt.reap.alpha.rfc7231
{:media-range "image/apng",
 :type "image",
 :subtype "apng",
 :parameters {}}

#:juxt.reap.alpha.rfc7231
{:media-range "*/*",
 :type "*",
 :subtype "*",
 :parameters {},
 :qvalue 0.8}

#:juxt.reap.alpha.rfc7231
{:media-range "application/signed-exchange",
 :type "application",
 :subtype "signed-exchange",
 :parameters {"v" "b3"},
 :qvalue 0.9}

reap contains parsers for most things you’d want to parse when writing web applications, so you can focus on writing your app without worrying about writing parsers. It’s fast too, so you don’t have to worry about a performance impact.

Introduction

The Internet is a system of interoperable computer software written to a set of exacting specifications (RFCs) published by the Internet Engineering Task Force.

Many Internet protocols, notably HTTP, are textual in nature.

Software components of the Internet must be able to efficiently encode and decode strings of text accurately in order to process correctly.

Problem Statement

There are not many tools of sufficient quality which can help with the decoding and encoding of text strings, especially those defined in RFCs.

Therefore, programmers are often left to write their own 'quick and dirty' code. This leads to software that does not properly implement (and is not fully conformant with) the rules defined in the RFCs.

Programmers often have to strike a balance between conforming to the rules layed down by the RFCs and competing priorities such as meeting performance requirements and project deadlines.

Unfortunately, code that violates any aspect of a specification can lead to an unhealthy Internet. Time is wasted debugging interoperability problems, buggy implementations cause problems for users and lead to, in some cases, security vulnerabilities.

Example: the HTTP Accept header

In RFC 7231 (which defines part of HTTP), the Accept header is specified by the following rule:

Accept = [ ( "," / ( media-range [ accept-params ] ) ) *( OWS "," [
    OWS ( media-range [ accept-params ] ) ] ) ]

As well as indicating the ways that various punctuation and other characters can be combined, the rule makes reference to other rules, such as media-range:

media-range = ( "*/*" / ( type "/*" ) / ( type "/" subtype ) ) *( OWS
    ";" OWS parameter )

A type here is a token, defined in another RFC (RFC 7230), which states a token is a sequence of at least one tchar:

token = 1*tchar
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
    "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA

Let’s leave aside DIGIT and ALPHA and return to the parameter rule, which itself is non-trivial:

parameter = token "=" ( token / quoted-string )

The rule tells us that values can be tokens, but can alternatively be separated by quotation marks:

quoted-string = DQUOTE *( qdtext / quoted-pair ) DQUOTE

What is contained within these quotation marks is subject to further exacting rules about which characters and character ranges are valid and how characters can be escaped by using quoted-pairs:

qdtext = HTAB / SP / "!" / %x23-5B ; '#'-'['
    / %x5D-7E ; ']'-'~'
    / obs-text
obs-text = %x80-FF
quoted-pair = "\" ( HTAB / SP / VCHAR / obs-text )

A media-range, itself containing parameters (where values are required) can be optionally followed by a special parameter indicating the term’s weight, optionally followed by further parameters (where values are optional), called accept extensions.

These are the rules for just one HTTP request header, and it’s by far from the most complex!

So it’s no surprise that programmers who resort to writing custom parsing code might skip a few details.

Alternatives

There are a number of excellent tools for generating text parsers, from venerable ones such as flex/bison to more modern ones including Antlr and Instaparse.

Unfortunately, these tools tend to be designed more for parsing languages than strings of characters. I haven’t found one which has built-in support for even some Internet RFCs. They also tend to be less efficient than Regular Expressions, which have been around for decades and have been heavily optimised in that time.

Ingredients

reap is built from some old ideas.

Lisp (1958)

Clojure is used as the implementation language to facilitate faster research and prototyping. If this project proves useful/stable it might be a good idea to port to Java and provide a Clojure wrapper.

Regular Expressions (1950s)

Everything in reap is ultimately compiled into a regular expression. Regexes provide the performance.

Allen’s Interval Algebra (1983)

Allen’s interval algebra allows character intervals to be manipulated and combined, to form optimal ranges which optimise the performance of the regular expression.

Parser Combinators (1989)

Parser combinators are used to combine parsers built from regular expressions.

User Guide

Functions marked with the metadata tag :juxt.reap/codec take an 'options' argument. and return a map of entries.

:juxt.reap/decode

A single-arity parser function, taking a java.util.regex.Matcher as the only argument and returning a Clojure map or sequence.

:juxt.reap/encode

A single-arity function, taking a Clojure map or sequence and returning a string.

Options

The 'options' argument is a map containing the following optional entries:

:juxt.reap/decode-preserve-case

Set to true to prevent the parser from transforming tokens that are treated as case-insensitive to lower-case. This lossy transformation simplifies case-insensitive comparisons. Defaults to nil (false).

:juxt.reap/encode-case-transform

Set to :lower to transform generated tokens to lower-case, where applicable (where the token is semantically case-insensitive). Set to :canonical to transform tokens and header values to their canonical case. Defaults to nil.

reap's People

Contributors

andreacrotti avatar malcolmsparks avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.