GithubHelp home page GithubHelp logo

rusterlium / html5ever_elixir Goto Github PK

View Code? Open in Web Editor NEW
80.0 8.0 71.0 486 KB

NIF wrapper of html5ever using Rustler

Home Page: https://hexdocs.pm/html5ever

License: Apache License 2.0

Elixir 25.33% Rust 26.72% HTML 47.95%
html5ever binding elixir erlang nif html-parser rustler

html5ever_elixir's People

Contributors

benjamin-philip avatar dependabot[bot] avatar hansihe avatar mfeckie avatar notriddle avatar philss avatar tristanperalta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html5ever_elixir's Issues

html5ever mix compile failed.

Installing html5ever {0.7.0} causes mix to not compile
`
Is the erlang_nif-sys version up to date in the Cargo.toml?
Does 'cargo update' fix it?
If not please report at https://github.com/goertzenator/erlang_nif-sys.

--- stderr
thread 'main' panicked at 'gen_api.erl encountered an error.', /home/parsingpeppers/.cargo/registry/src/github.com-1ecc6299db9ec823/erlang_nif-sys-0.6.4/build.rs:28:13
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.

warning: build failed, waiting for other jobs to finish...
error: build failed
could not compile dependency :html5ever, "mix compile" failed. You can recompile this dependency with "mix deps.compile html5ever", update it with "mix deps.update html5ever" or clean it with "mix deps.clean html5ever"
** (RuntimeError) Rust NIF compile error (rustc exit code 101)
lib/mix/tasks/compile.rustler.ex:60: Mix.Tasks.Compile.Rustler.compile_crate/1
(elixir) lib/enum.ex:1336: Enum."-map/2-lists^map/1-0-"/2
lib/mix/tasks/compile.rustler.ex:14: Mix.Tasks.Compile.Rustler.run/1
(mix) lib/mix/task.ex:331: Mix.Task.run_task/3
(mix) lib/mix/tasks/compile.all.ex:73: Mix.Tasks.Compile.All.run_compiler/2
(mix) lib/mix/tasks/compile.all.ex:53: Mix.Tasks.Compile.All.do_compile/4
(mix) lib/mix/tasks/compile.all.ex:24: anonymous fn/1 in Mix.Tasks.Compile.All.run/1
(mix) lib/mix/tasks/compile.all.ex:40: Mix.Tasks.Compile.All.with_logger_app/1
`

Error during build: "Unsupported Erlang version" with OTP 22.0

Generated rustler app
==> html5ever
Compiling NIF crate :html5ever_nif (native/html5ever_nif)...
   Compiling erlang_nif-sys v0.6.4
   Compiling rustler_codegen v0.18.0
   Compiling syn v0.15.22
   Compiling tendril v0.4.1
error: failed to run custom build command for `erlang_nif-sys v0.6.4`
process didn't exit successfully: `/home/user/app/_build/dev/rustler_crates/html5ever_nif/release/build/erlang_nif-sys-d37b2e3dcb9ae709/build-script-build` (exit code: 101)
--- stdout
Unsupported Erlang version.

Thanks!

Segfaults on highly nested html

Example:

iex(4)> Enum.reduce(1..4000, "", fn _,acc -> "<div>" <> acc end) |> Html5ever.parse()
Segmentation fault

While I doubt any HTML document really needs 4k nested tags, this could be abused by attackers if the library is used to parse user-generated content.

I am not too familiar with Rust, but I am pretty sure you are hitting recursion depth limit when transforming the parsed tree to erlang terms.

New release needed?

Hi, I am not able to use html5ever, apparently because of a dependency issue. The latest release, 1.14.0 can only work with rustler_precompiled ~> 0.5.2, but I'm using another package, mjml_eex, that already depends on rustler_precompiled ~> 0.6.0. In the latest master branch of html5ever, it now works with rustler_precompiled ~> 0.6.0. I configured the html5ever dependency in my mix.exs to point to master branch in the git repo, but when I deployed my app, it wouldn't use a precompiled NIF. Instead, it tried to compile the project with cargo. I then installed cargo, but it couldn't find the command.

Is there a step missing that normally runs when a release of html5ever is created? Is that why no precompiled NIF was found when using the master branch?

panic: Templates not supported

I encounter a panic on some html:

thread '<unnamed>' panicked at src/flat_dom.rs:218:9:
Templates not supported

I understand the proper support is hard; is is possible to gracefully degrade by ignoring or emitting raw template tags?

In general, I think panicking should be avoided in a parser.

Add Cargo.lock

html5ever_elixir is impossible to build in a sandbox because of an unpinned build-time dependency fetch. Could you lock Cargo.toml?

Allow extracting of comments from an HTML document

I wonder if there is an easy way to extract comments embedded inside an HTML document.

I tried using html5ever with Floki and using the default parser comments are present in the parsed document as

{:comment, "My Comment"}

but when I switch the parser to html5ever they are just stripped. This can also be verified running:

html = """
<html><title>Some Title</title><body><!-- some comment --></body></html>
"""

Floki.parse_document(html)
|> IO.inspect()

Floki.parse_document(html, html_parser: Floki.HTMLParser.Html5ever)
|> IO.inspect()

that results in this output:

{:ok,
 [
   {"html", [],
    [{"title", [], ["Some Title"]}, {"body", [], [comment: " some comment "]}]}
 ]}
{:ok,
 [
   {"html", [],
    [{"head", [], [{"title", [], ["Some Title"]}]}, {"body", [], ["\n"]}]}
 ]}

Parsing non-UTF-8 pages

Parsing pages not written in UTF-8 currently produces errors:

> %HTTPoison.Response{body: body} = HTTPoison.get!("http://manybooks.net/index.xml")
> Html5ever.parse(body)

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }', src/libcore/result.rs:859
note: Run with `RUST_BACKTRACE=1` for a backtrace.
{:error, "called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 4070 }"}

In this case this XML feed has the encoding in the xml preeamble:

<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
...

Can I get around this problem or can the library be fixed to handle this situation?

Large performance differences between using `parse_sync` or `parse_async`

When parsing large (314kB) HTML to create a Meeseeks.Document, Html5ever.Native.parse_sync runs 2x faster than Html5ever.Native.parse_async .

Example project: https://github.com/mischov/meeseeks_html5ever_parse

Example output:

$ MIX_ENV=prod mix run -e MeeseeksHtml5everParse.run
Running tests...
Parsed with Html5ever async in 17250.7 us
Parsed with Html5ever sync in 18877.3 us

Created Meeseeks Document from tuples in 6883.2 us

Parsed with Meeseeks async in 66956.9 us
Parsed with Meeseeks sync in 33076.9 us

Edit: I'm running Erlang/OTP 19.

Compilation issue on Mac OS X

I'm on Mac OS X El Capitan (10.11.6). Elixir version

Erlang/OTP 19 [erts-8.2] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Elixir 1.4.1

==> html5ever
Compiling NIF crate :html5ever_nif (native/html5ever_nif)...
could not compile dependency :html5ever, "mix compile" failed. You can recompile this dependency with "mix deps.compile html5ever", update it with "mix deps.update html5ever" or clean it with "mix deps.clean html5ever"
** (ErlangError) erlang error: :enoent
    (elixir) lib/system.ex:564: System.cmd("cargo", ["rustc", "--no-default-features", "--release", "--", "--codegen", "link-args=-flat_namespace -undefined suppress"], [cd: "/Users/jonathanlin/Documents/blitz/blitz-cms/deps/html5ever/native/html5ever_nif", stderr_to_stdout: true, env: [{"CARGO_TARGET_DIR", "/Users/jonathanlin/Documents/blitz/blitz-cms/_build/dev/rustler_crates/html5ever_nif"}], into: %IO.Stream{device: :standard_io, line_or_bytes: :line, raw: false}])
    lib/mix/tasks/compile.rustler.ex:49: Mix.Tasks.Compile.Rustler.compile_crate/1
    (elixir) lib/enum.ex:1229: Enum."-map/2-lists^map/1-0-"/2
    lib/mix/tasks/compile.rustler.ex:12: Mix.Tasks.Compile.Rustler.run/1
    (mix) lib/mix/task.ex:294: Mix.Task.run_task/3
    (elixir) lib/enum.ex:1229: Enum."-map/2-lists^map/1-0-"/2
    (mix) lib/mix/tasks/compile.all.ex:19: anonymous fn/1 in Mix.Tasks.Compile.All.run/1
    (mix) lib/mix/tasks/compile.all.ex:37: Mix.Tasks.Compile.All.with_logger_app/1

Erlang 20 is unsupported

Running the latest version of html5ever (0.4.0). Upgraded to erlang 20 and got a compilation error on ubuntu 16.04. Stack trace from running mix deps.compile html5ever:

Compiling NIF crate :html5ever_nif (native/html5ever_nif)...
   Compiling erlang_nif-sys v0.6.1
   Compiling rustler_codegen v0.14.0
   Compiling html5ever v0.16.0
   Compiling string_cache v0.5.0
error: failed to run custom build command for `erlang_nif-sys v0.6.1`
process didn't exit successfully: `/root/m_c_a/_build/dev/rustler_crates/html5ever_nif/release/build/erlang_nif-sys-ae2db8d6f62d8a63/build-script-build` (exit code: 101)
--- stdout
Unsupported Erlang version.

Is the erlang_nif-sys version up to date in the Cargo.toml?
Does 'cargo update' fix it?
If not please report at https://github.com/goertzenator/erlang_nif-sys.

--- stderr
thread 'main' panicked at 'gen_api.erl encountered an error.', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/erlang_nif-sys-0.6.1/build.rs:28
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Build failed, waiting for other jobs to finish...
error: build failed
could not compile dependency :html5ever, "mix compile" failed. You can recompile this dependency with "mix deps.compile html5ever", update it with "mix deps.update html5ever" or clean it with "mix deps.clean html5ever"
** (RuntimeError) Rust NIF compile error (rustc exit code 101)
    lib/mix/tasks/compile.rustler.ex:58: Mix.Tasks.Compile.Rustler.compile_crate/1
    (elixir) lib/enum.ex:1229: Enum."-map/2-lists^map/1-0-"/2
    lib/mix/tasks/compile.rustler.ex:12: Mix.Tasks.Compile.Rustler.run/1
    (mix) lib/mix/task.ex:300: Mix.Task.run_task/3
    (elixir) lib/enum.ex:1229: Enum."-map/2-lists^map/1-0-"/2
    (mix) lib/mix/tasks/compile.all.ex:19: anonymous fn/1 in Mix.Tasks.Compile.All.run/1
    (mix) lib/mix/tasks/compile.all.ex:37: Mix.Tasks.Compile.All.with_logger_app/1
    (mix) lib/mix/task.ex:300: Mix.Task.run_task/3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.