Markup.ml is a pair of best-effort parsers implementing the HTML5 and XML specifications. Usage is simple, because each parser is just a function from byte streams to parsing signal streams:
In addition to being error-correcting, the parsers are:
- streaming: capable of parsing partial input and emitting signals while more input is still being received;
- lazy: not parsing input unless it is needed to emit the next parsing signal, so you can easily stop parsing partway through a document;
- non-blocking: they can be used with Lwt, but still provide a straightforward synchronous interface for simple usage; and
- one-pass: memory consumption is limited since the parsers don't build up a document representation, nor buffer input beyond a small amount of lookahead.
The parsers detect character encodings automatically. Strings emitted are in UTF-8.
Here is a breakdown showing the signal stream and errors emitted during the
parsing and pretty-printing of bad_html
:
string bad_html "<body><p><em>Markup.ml<p>rocks!"
|> parse_html `Start_element "body"
|> signals `Start_element "p"
`Start_element "em"
`Text ["Markup.ml"]
~report (1, 10) (`Unmatched_start_tag "em")
`End_element (* /em: recovery *)
`End_element (* /p: not an error *)
`Start_element "p"
`Start_element "em" (* recovery *)
`Text ["rocks!"]
`End_element (* /em *)
`End_element (* /p *)
`End_element (* /body *)
|> pretty_print (* adjusts the `Text signals *)
|> write_html
|> to_channel stdout;; "...shown above..." (* valid HTML *)
The parsers are subjected to thorough testing.
For a higher-level parser, see Lambda Soup, which is based on Markup.ml, but can search documents using CSS selectors, and perform various manipulations.
The interface is centered around four functions between byte streams and signal
streams: parse_html
, write_html
,
parse_xml
, and write_xml
. These have several
optional arguments for fine-tuning their behavior. The rest of the functions
either input or output byte streams, or
transform signal streams in some interesting way.
Here is an example with an optional argument:
(* Show up to 10 XML well-formedness errors to the user. Stop after
the 10th, without reading more input. *)
let report =
let count = ref 0 in
fun location error ->
error |> Error.to_string ~location |> prerr_endline;
count := !count + 1;
if !count >= 10 then raise_notrace Exit
file "some.xml" |> parse_xml ~report |> signals |> drain
Advanced: Cohttp + Markup.ml + Lambda Soup + Lwt
The code below is a complete program that requests a Google search, then
performs a streaming scrape of result titles. The first GitHub link is printed,
then the program exits without waiting for the rest of input. Perhaps early exit
is not so important for a Google results page, but it may be needed for large
documents. Memory consumption is low because only the h3
elements are
converted into DOM-like trees.
open Lwt.Infix
let () =
Markup_lwt.ensure_tail_calls (); (* Workaround for current Lwt :( *)
Lwt_main.run begin
Uri.of_string "https://www.google.com/search?q=markup.ml"
|> Cohttp_lwt_unix.Client.get
>|= snd (* Assume success and get body. *)
>|= Cohttp_lwt_body.to_stream (* Now an Lwt_stream.t. *)
>|= Markup_lwt.lwt_stream (* Now a Markup.stream. *)
>|= Markup.strings_to_bytes
>|= Markup.parse_html
>|= Markup.signals
>|= Markup.elements (fun name _ -> snd name = "h3")
>>= Markup_lwt.iter begin fun h3_subtree ->
h3_subtree
|> Markup_lwt.to_list
>|= Markup.of_list
>|= Soup.from_signals
>|= fun soup ->
let open Soup in
match soup $? "a[href*=github]" with
| None -> ()
| Some a ->
a |> texts |> List.iter print_string;
print_newline ();
exit 0
end
end
This prints aantron/markup.ml · GitHub
. To run it, do:
ocamlfind opt -linkpkg -package lwt.unix -package cohttp.lwt \
-package markup.lwt -package lambdasoup scrape.ml && ./a.out
You can get all the necessary packages by
opam install lwt ssl cohttp lambdasoup markup
opam install markup
The interface of Markup.ml is three modules: Markup
,
Markup_lwt
, and Markup_lwt_unix
. The last
two are available only if you have Lwt installed.
The documentation includes a summary of the conformance status of Markup.ml.
Parsing markup has more applications than one person can easily think of, which makes it difficult to do exhaustive testing. I would greatly appreciate any bug reports.
Although the parsers are in an "advanced" state of completion, there is still considerable work to be done on standard conformance and speed. Again, any help would be appreciated.
I have much more experience with Lwt than Async, so if you would like to create an Async interface, it would be very welcome.
Please see the CONTRIBUTING
file. Feel free to open issues on
GitHub, or send me an email at [email protected].
Markup.ml is distributed under the BSD license. The Markup.ml source
distribution includes a copy of the HTML5 entity list, which is distributed
under the W3C document license. The copyright notices and text of this license
are found in LICENSE
.