GithubHelp home page GithubHelp logo

marked's Introduction

The Märkəd Project

deps status CI Status

A rust language project for parsing, filtering, selecting and serializing HTML and XML mark-up.

See the marked crate or marked-cli crates or the README(s) and CHANGELOG(s) under this (github hosted) source tree and cargo workspace.

Feature Overview

Currently implemented features:

A vector-allocated, indexed, DOM-like tree structure

The marked::Document is a DOM-like tree structure suitable for HTML and XML. This was forked from the victor project (same author as html5ever) and further optimized. It is implemented as a (std) Vec of Node types, which references parent, siblings and children via (std) NonZeroU32 indexes for space efficiency.

html5ever integration

Including HTML5 document and fragment parsing and HTML5 serialization (mark-up output). With the marked::Document (DOM), parsing and serialization is measurably faster (see benchmarks in source tree) than the RcDom previously included with html5ever associated crates, and mutating the Document is more straightforward, via a mutable reference.

xml-rs integration

Strict, UTF-8 XML parsing to marked::Document is currently supported by integration of the xml-rs crate.

Legacy character encoding support

An estimated 5% of the web remains in encodings other than UTF-8; too common to be treated as an error. Via marked::html::parse_buffered:

  • Decoding via encoding_rs which implements The Encoding Standard including alternative names (labels) for supported encodings.

  • HTML5 parsing restart from initial (4k) buffer with new encoding hints obtained from <head>/<meta> charset or an http-equiv content-type with charset.

  • Byte-Order-Mark BOM sniffing as high priority EncodingHint for UTF-8, UTF-16 Big-Endian and UTF-16 Little-Endian.

  • "Impossible" hints from the above are ignored. For example, if we read a hint from UTF-8 that says its UTF-16LE (which would make it impossible to read the same hint if it was used).

(Note that the detection features are not currently provided by html5ever and associated crates.)

Rust "selectors" API

A NodeRef type with "CSS selectors"-like methods to recursively select and find elements using closure predicates. We prefer direct rust language compiler support for writing such selection logic, over CSS or other interpreted DSL.

HTML tag and attribute metadata

See marked::html::t (tags) and marked::html::a (attributes) modules.

Tree walking filters API

Bulk modifications to the DOM is easily and efficiently achieved with mutating filter functions/closures and a tree walker (depth or breadth-first) implementation in marked. This style of interface is sometimes called the "visitor pattern". See Document::filter_at for details. The crate also includes the following built-in filters (a partial list):

detach_banned_element : Detach known banned (via metadata) and unknown elements

retain_basic_attributes : Remove all attributes that are not part of the "basic" logical set (via metadata)

fold_empty_inline : Fold empty or meaninglessly "inline" elements

text_normalize : Normalize text nodes by merging, replacing control characters and minimizing white-space.

An unreleased example, compatibility test and benchmark of ammonia crate equivalent filtering (for hygiene and safety) is included in the source tree (./ammonia-compare)

Roadmap

Features incomplete or unstarted which may be included in this project in the future (PRs welcome):

  • Complete (faster, more correct, legacy encodings) strict-mode XML parsing

  • Lenient-mode XML parsing

  • Optional (opt-in) direct charset detection (initial read buffer or entire document) via something like chardet, integrated as high priority EncodingHint.

  • XML/HTML pretty-indenting serialization (combines well with the existing white-space normalization features)

  • XML (and XHTML) serialization

License

This project is dual licensed under either of following:

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the märkəd project by you, as defined by the Apache License, shall be dual licensed as above, without any additional terms or conditions.

marked's People

Contributors

dekellum avatar simonsapin avatar nox avatar

Stargazers

Zhazha_JiaYiZhen avatar layne.zhuang avatar RoXoM avatar Alexandr Zahatski avatar afc163 avatar GAURAV avatar Alejandra González avatar Jeff Carpenter avatar Laurențiu Nicola avatar timothy avatar Idan Lupinsky avatar Wesley Moore avatar  avatar

Watchers

 avatar timothy avatar James Cloos avatar

marked's Issues

Dependencies in last release are too broad

For example all of these, which leaked through from our private development days:

html5ever       = { version=">=0.25.1, <2.0" }
tendril         = { version=">=0.4.1,  <2.0", features=["encoding_rs"] }
encoding_rs     = { version=">=0.8.13, <2.0" }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.