GithubHelp home page GithubHelp logo

ashtonthomas / markup.ml Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aantron/markup.ml

0.0 1.0 0.0 376 KB

Error-recovering streaming HTML5 and XML parsers.

Home Page: http://aantron.github.io/markup.ml

Makefile 1.04% OCaml 98.77% Standard ML 0.19%

markup.ml's Introduction

Markup.ml   version 0.7 Documentation BSD license Travis status Coverage

Markup.ml is a pair of best-effort parsers implementing the HTML5 and XML specifications. Usage is simple, because each parser is just a function from byte streams to parsing signal streams:

Usage example

In addition to being error-correcting, the parsers are:

  • streaming: capable of parsing partial input and emitting signals while more input is still being received;
  • lazy: not parsing input unless it is needed to emit the next parsing signal, so you can easily stop parsing partway through a document;
  • non-blocking: they can be used with Lwt, but still provide a straightforward synchronous interface for simple usage; and
  • one-pass: memory consumption is limited since the parsers don't build up a document representation, nor buffer input beyond a small amount of lookahead.

The parsers detect character encodings automatically. Strings emitted are in UTF-8.

Here is a breakdown showing the signal stream and errors emitted during the parsing and pretty-printing of bad_html:

string bad_html         "<body><p><em>Markup.ml<p>rocks!"

|> parse_html           `Start_element "body"
|> signals              `Start_element "p"
                        `Start_element "em"
                        `Text ["Markup.ml"]
                        ~report (1, 10) (`Unmatched_start_tag "em")
                        `End_element                   (* /em: recovery *)
                        `End_element                   (* /p: not an error *)
                        `Start_element "p"
                        `Start_element "em"            (* recovery *)
                        `Text ["rocks!"]
                        `End_element                   (* /em *)
                        `End_element                   (* /p *)
                        `End_element                   (* /body *)

|> pretty_print         (* adjusts the `Text signals *)

|> write_html
|> to_channel stdout;;  "...shown above..."            (* valid HTML *)

The parsers are subjected to thorough testing.

For a higher-level parser, see Lambda Soup, which is based on Markup.ml, but can search documents using CSS selectors, and perform various manipulations.

Overview and basic usage

The interface is centered around four functions between byte streams and signal streams: parse_html, write_html, parse_xml, and write_xml. These have several optional arguments for fine-tuning their behavior. The rest of the functions either input or output byte streams, or transform signal streams in some interesting way.

Here is an example with an optional argument:

(* Show up to 10 XML well-formedness errors to the user. Stop after
   the 10th, without reading more input. *)
let report =
  let count = ref 0 in
  fun location error ->
    error |> Error.to_string ~location |> prerr_endline;
    count := !count + 1;
    if !count >= 10 then raise_notrace Exit

file "some.xml" |> parse_xml ~report |> signals |> drain

Advanced: Cohttp + Markup.ml + Lambda Soup + Lwt

The code below is a complete program that requests a Google search, then performs a streaming scrape of result titles. The first GitHub link is printed, then the program exits without waiting for the rest of input. Perhaps early exit is not so important for a Google results page, but it may be needed for large documents. Memory consumption is low because only the h3 elements are converted into DOM-like trees.

open Lwt.Infix

let () =
  Markup_lwt.ensure_tail_calls ();    (* Workaround for current Lwt :( *)

  Lwt_main.run begin
    Uri.of_string "https://www.google.com/search?q=markup.ml"
    |> Cohttp_lwt_unix.Client.get
    >|= snd                           (* Assume success and get body. *)
    >|= Cohttp_lwt_body.to_stream     (* Now an Lwt_stream.t. *)
    >|= Markup_lwt.lwt_stream         (* Now a Markup.stream. *)
    >|= Markup.strings_to_bytes
    >|= Markup.parse_html
    >|= Markup.signals
    >|= Markup.elements (fun name _ -> snd name = "h3")
    >>= Markup_lwt.iter begin fun h3_subtree ->
      h3_subtree
      |> Markup_lwt.to_list
      >|= Markup.of_list
      >|= Soup.from_signals
      >|= fun soup ->
        let open Soup in
        match soup $? "a[href*=github]" with
        | None -> ()
        | Some a ->
          a |> texts |> List.iter print_string;
          print_newline ();
          exit 0
    end
  end

This prints aantron/markup.ml · GitHub. To run it, do:

ocamlfind opt -linkpkg -package lwt.unix -package cohttp.lwt \
    -package markup.lwt -package lambdasoup scrape.ml && ./a.out

You can get all the necessary packages by

opam install lwt ssl cohttp lambdasoup markup

Installing

opam install markup

Documentation

The interface of Markup.ml is three modules: Markup, Markup_lwt, and Markup_lwt_unix. The last two are available only if you have Lwt installed.

The documentation includes a summary of the conformance status of Markup.ml.

Help wanted

Parsing markup has more applications than one person can easily think of, which makes it difficult to do exhaustive testing. I would greatly appreciate any bug reports.

Although the parsers are in an "advanced" state of completion, there is still considerable work to be done on standard conformance and speed. Again, any help would be appreciated.

I have much more experience with Lwt than Async, so if you would like to create an Async interface, it would be very welcome.

Please see the CONTRIBUTING file. Feel free to open issues on GitHub, or send me an email at [email protected].

License

Markup.ml is distributed under the BSD license. The Markup.ml source distribution includes a copy of the HTML5 entity list, which is distributed under the W3C document license. The copyright notices and text of this license are found in LICENSE.

markup.ml's People

Contributors

aantron avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.