kuchiki-rs / kuchiki Goto Github PK

View Code? Open in Web Editor NEW

466.0 466.0 54.0 10.86 MB

(朽木) HTML/XML tree manipulation library for Rust

License: MIT License

Rust 99.81% HTML 0.19%

kuchiki's People

Contributors

Stargazers

Watchers

kuchiki's Issues

I realize I can manipulate the tree structure after kuchiki has parsed the html. But could there be a way for the user of this crate to provide a (or multiple) TreeSink or TreeSink-like struct, so that they can manipulate nodes as the tree is being built, and thus do everything they would need to do in only 1 step?

I haven't thought about the api design that much. Would this be in the scope of this project?

XML serializing of NodeRef

While it's clear that Kuchiki is only interested in parsing and serializing HTML files, is there a way to serialize a NodeRef to an XML compliant output?

Sorry that this is off topic.

Support for parsing fragments

Hello, I wanted to know if there are plans to add support for parsing fragments. I am extracting the content within in a noscript tag which is represented as a string and since it can contain a number of HTML nodes after parsing, the only way this can be done is if there was a way to parse fragments.

Question on removing whitespace text nodes

I have tried removing text nodes containing only whitespace using

let node = kuchiki::parse_html().one(html_str);
node.descendants().filter(|x| match x.data(){
   NodeData::Text(t) => t.borrow().trim().is_empty(),
   _ => false,
}).for_each(|x| x.detach());

however, the text nodes still end up being there when I later traverse the nodes.
Is there a better way of handling this?

No way to modify a tag name?

Currently, as far as I can see, the name property on a node is read-only. This means that there is no way to modify the tag name, e.g. to change div to p. Maybe there is a simpler way around that than rebuilding a tree?

Example of re-use of a compiled set of selectors

Hi there,

I’m trying to use this library to run a set query on a large number of documents.

I’m parsing the document with kuchiki::parse_html().one(), and I’m stuck on how to actually run the filter against this. I know I can call select on the returned document NodeRef, but I have already gone to the trouble of compiling my selector, and I can’t seem to figure out how to actually make use of it!

I’ve tried using .inclusive_descendants() on my NodeRef, but it then complains of a type mismatch.

I’d love to know what the expected way to approach this is - the language used to describe the Selector objects seems to suggest reusing them in compiled form is intended to be possible!

Kuchiki for generating HTML reports

@brson asked on IRC:

I have a tool that wants to output html reports. Is kuchiki right for that? Is it ready?

You can in theory build up a tree with NodeRef::new_element, NodeRef::new_text, and NodeRef::append and then call the HTML serializer on it, but it’s not gonna be convenient.

Right now, the best I can recommend is write!’ing repeatedly to a String. Eventually, I expect the Rust ecosystem to grow templating libraries like http://jinja.pocoo.org/docs/dev/ . Jinja just emits text, and sometimes that text happens to be in HTML syntax. Genshi http://genshi.edgewall.org/ is a bit different in that it is based on XML/HTML trees rather than text, so the end result is guaranteed to be well-formed, but that’s not really a big deal now that HTML has standardized permissive error recovery in the parser. A Genshi-like templating library in Rust could be based on Kuchiki, but would be more than just what Kuchiki does.

Regarding Kuchiki being ready: it lack tutorial-style documentation, but other than that it should work. In theory, as I’m not aware of “real” projects using it yet. Feedback welcome!

How to get all text of document except text inside script/style/noscript tags?

Kind of a cross-post of this.

I'm trying to get all visible text in a document (text that is not part of script/style/noscript tags).

I've come up with the following algo:

let parser = kuchiki::parse_html().one(content);
for child in parser.inclusive_descendants() {
    if let Some(el) = child.as_element() {
        let tag_name = &el.name.local;
        if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
            child.detach();
        }
    }
}
let text = parser.text_contents();
println!("{}", text);

However, this doesn't seem to work. parser.text_contents() still returns inline Javascript in style tags.

Am I using the detach API incorrectly?

How to replace an element with new element defined as string?

Hi,

I have a working code:

extern crate kuchiki;
use kuchiki::traits::*;

fn main() {        
    let html = "
    <html>
        <head></head>
        <body>
            <p class='foo'>Hello, world!</p>
            <p class='foo'>I love HTML</p>
        </body>
    </html>";

    let document = kuchiki::parse_html().one(html);
    let paragraph = document.select("p").unwrap().collect::<Vec<_>>();

    for element in paragraph {
        let new_p_element = "<p class='newp'>Hello, from loved HTML</p>";
        element.as_node().detach()
    }

    println!("{}", document.to_string())

Instead of detaching/remove'ing p element's i'd like to replace them with the element that is defined in new_p_element. How would I achieve something like element.as_node.replace(&new_p_element) just with a code which actually compiles?

Thanks!

NodeData is unnecessarily large, bloating the size of all Nodes

mem::size_of::<Node>() reports 120 bytes, and mem::size_of::<NodeData>() makes up 80 of those bytes. If we box most of the variants of the NodeData enum, this can shrink to 56 and 16 bytes respectively.

Pseudo class selector support

Is it possible to add support for selectors like "div.class p:has(span.something:contains(SomeText)) a" which have pseudo class selectors like ":has" and ":contains"?

Helping maintain kuchiki

I see a lot of PR rotting on the vine, maybe I or someone else can get access to Kuchiki and help to fix some of these issues? I volunteer if no one else wants to do it.

Error "type mismatch resolving" with from_file() method

tried to use kuchiki like this

let html_file = Path::new("cocon1.html");
let document = kuchiki::parse_html().from_file(html_file);

But I got his error

src\main.rs:44:42: 44:51 error: type mismatch resolving `<html5ever::tendril::fmt::UTF8 as html5ever::tendril::fmt::SliceFormat>::Slice == [u8]`:
 expected str,
    found slice [E0271]
src\main.rs:44     let document = kuchiki::parse_html().from_file(html_file);
                                                        ^~~~~~~~~

Rem : Wiith htm5ever, It's OK with this code:

let dom = parse_document(RcDom::default(), Default::default())
                .from_utf8()
                .from_file(html_file)
                .unwrap();

Parsing XML with <element />

Hello,

I have a question about parsing an XML file, and whether I should be expecting children or siblings from the next tag down.

Here's a snippet:

<body copyright="All data copyright Massachusetts Institute of Technology 2016.">
<route tag="saferidecambwest" title="Saferide Cambridge West" color="cc66ff" oppositeColor="000000" latMin="42.3547905" latMax="42.364801" lonMin="-71.1124847" lonMax="-71.0843897">
<stop tag="mass84_d" title="84 Mass Ave" shortTitle="84 Mass" lat="42.3595199" lon="-71.09416" stopId="03"/>
<stop tag="mccrmk" title="McCormick Hall" lat="42.35766" lon="-71.0947" stopId="30"/>
<stop tag="burtho" title="Burton House" lat="42.3560823" lon="-71.098703" stopId="16"/>
<stop tag="newho" title="New House" lat="42.35559" lon="-71.1002099" stopId="35"/>
<stop tag="tangwest" title="Tang/Westgate" lat="42.3547905" lon="-71.1026282" stopId="51"/>
<stop tag="simmhl" title="Simmons hall" lat="42.3565968" lon="-71.1020453" stopId="47"/>
<stop tag="ww15" title="WW15" lat="42.3557087" lon="-71.1097712" stopId="55"/>
<stop tag="brookchest" title="Brookline St at Chestnut St" shortTitle="Brookline @ Chestnut" lat="42.3565399" lon="-71.1089" stopId="14"/>
<stop tag="putmag" title="Putnum Ave at Magazine St" shortTitle="Putnum @ Magazine" lat="42.35904" lon="-71.11096" stopId="43"/>
<stop tag="rivfair" title="River St at Fairmont St" shortTitle="River @ Fairmont" lat="42.3626179" lon="-71.1124847" stopId="44"/>
<stop tag="rivpleas" title="River St at Pleasant St" shortTitle="River @ Pleasant" lat="42.3639197" lon="-71.1086206" stopId="46"/>
<stop tag="rivfrank" title="River Street@Franklin Street" shortTitle="River Street@Franklin" lat="42.364801" lon="-71.105938" stopId="45"/>
<stop tag="sydgreen" title="Sydney St at Green St" shortTitle="Sydney @ Green" lat="42.3624601" lon="-71.1001597" stopId="50"/>
<stop tag="paci70" title="70 Pacific Street" shortTitle="70 Pacific" lat="42.36023" lon="-71.1023" stopId="40"/>
<stop tag="whou" title="NW30/Warehouse" lat="42.3590332" lon="-71.1002949" stopId="54"/>
<stop tag="edge" title="Edgerton" lat="42.3601694" lon="-71.0975111" stopId="23"/>
<stop tag="kendsq" title="Kendall Square T" lat="42.36237" lon="-71.08613" stopId="01"/>
<stop tag="wadse40" title="Wadsworth@E40" lat="42.3612719" lon="-71.0843897" stopId="53"/>
<stop tag="mass77" title="77 Mass Ave" shortTitle="77 Mass" lat="42.3592699" lon="-71.0936799" stopId="02"/>
<direction tag="frcamp" title="from Campus" name="" useForUI="true">
  <stop tag="mass84_d" />
  <stop tag="mccrmk" />
  <stop tag="burtho" />
  <stop tag="newho" />
  <stop tag="tangwest" />
  <stop tag="simmhl" />
  <stop tag="ww15" />
  <stop tag="brookchest" />
  <stop tag="putmag" />
  <stop tag="rivfair" />
  <stop tag="rivpleas" />
</direction>
<direction tag="tocamp" title="to Campus" name="" useForUI="true">
  <stop tag="rivfrank" />
  <stop tag="sydgreen" />
  <stop tag="paci70" />
  <stop tag="whou" />
  <stop tag="edge" />
  <stop tag="kendsq" />
  <stop tag="wadse40" />
  <stop tag="mass77" />
</direction>
</route>
</body>

Here's some example code I'm using to try to walk the tree:

let xml = kuchiki::parse_html().one(body); // I also made sure to clean out \n, which was causing some problems.

for route in xml.select("route").unwrap() {

// Here I want to get the stops that are just children but not all descendants

    // This only prints one stop, the first one (84 Mass Ave). Expected many stops.
    for stop in route.as_node().children() {
        println!(":?", stop);  
    }
    // This returns the first stop as expected (84 Mass Ave)
    let first_child = route.as_node().first_child().unwrap();

    // This returns the next stop (McCormick Hall), which I thought was a sibling
    let child_of_child = first_child.first_child().unwrap();    

    // This returns all elements in the rest of the file
    let first_child_descendants = first_child.descendants();
}

Am I misunderstanding something? I expected all the <stop> that are next to each other to be siblings, not descendants.

Node.to_string()

It would be nice if there was a function for a node that made the node and all it's children into the equivalent html.

How can I wrap an Element with a new Element?

Thanks @SimonSapin for this awesome library! I am having a bit of trouble manipulating the DOM though and can't find any examples of how to do this.

I am using traverse() to iterate over NodeEdges, then parse their attributes. I want to wrap certain elements I find in a new Element I define during the iteration.

For example, I want to find <a href="whatever">my link</a> and transform it to this: <h1><a href="whatever">my link</a></h1>.

How can I go about doing so?

Thank you

Error parsing Tag when no close Tag

Hello,

I want to get all links on this page: http://cocon.se/
Here's my code I'm using to try to get links (for test, html page is save in local file: cocon1.html):

    let html_file =  Path::new("cocon1.html").to_path_buf();
    let document = kuchiki::parse_html().from_utf8().from_file(&html_file).unwrap();
    for link in document.select("a").unwrap() {
        let attributes = link.attributes.borrow();
        println!("{:?} {}", attributes.get("href"), link.text_contents());
    }

I got this result :

Some("http://cocon.se/")
Some("http://cocon.se/visualisation/") Visualisations
Some("http://cocon.se/cocon-semantique/") Cocon Sémantique
Some("http://cocon.se/a-propos/") A propos
Some("http://cocon.se/#focus") Quoi c'est ?
Some("http://cocon.se/#focus") J'en suis !
Some("http://twitter.com/Lemoussel")
Some("http://twitter.com/SylvainDeaure")
None Johann
None Sylvain V.
None Fabrice
Some("/cas/") Etude de cas
Some("/actus/") Actus
Some("/historique/") Historique
Some("http://cocon.se/cas/astuce1.html")
Some("http://cocon.se/cas/astuce1.html") Une astuce de grand mère
Some("http://cocon.se/cas/jouer.html")
Some("http://cocon.se/cas/jouer.html") A vous de jouer !
Some("http://cocon.se/actus/gagner-tableau.html")
Some("http://cocon.se/actus/gagner-tableau.html") Qui veut gagner des tableaux ?
Some("http://cocon.se/actus/gagner-tableau.html") […]
Some("http://cocon.se/historique/cocon-con.html")
Some("http://cocon.se/historique/cocon-con.html") Un cocon moins con
Some("#carousel-homepage-latestnews") Previous
Some("#carousel-homepage-latestnews") Next
Some("mailto:[email protected]") [email protected]
Some("http://twitter.com/SylvainDeaure")
Some("/mentions/") Mentions légales
Some("/mentions/")

Some("/mentions/")

Some("/mentions/")

/* */

Last 3 lines Some("/mentions/") are incorrect. There is only one link /mentions/ with text Mentions légales in HTML page.

This error seems to be due as the last <a> tag is not closed => </a</div>
class="zerif-copyright" href="/mentions/">Mentions légales</a</div>

Is it possible to parse HTML page with errors?

Support for the :scope pseudo-class

It would be useful if :scope could be used to anchor selection to the root of the inclusive-descendant tree. For example, in the following document, div_b.select(":scope > div") should return the div.c element (as it does in Javascript), but instead returns nothing because the document root is being used as the selection scope.

<html>
    <body>
        <div class="a">
            <div class="b">
                <div class="c">
                </div>
            </div>
        </div>
    </body>
</html>

From a brief glance at the selectors crate, it appears that :scope is supported, but the MatchingContext::scope_element field must be directly set.

Creating a duplicate node

I'd like to be able to create a duplicate of a node, like the DOM's cloneNode method. Note that this is different to Rust's Clone trait in not returning another reference to the same node, but rather would be returning a new NodeRef that doesn't have a parent, and which doesn't share mutations with the other one.

It looks like, currently, there's no way to do this short of implementing a tree-walk myself, which is perfectly feasible, but it feels like something the library could usefully offer - especially with a clarification that .clone() may not do what you want (which is an easy mistake to make!).

Xml escaped string with script tag.

Seems newly generated Text node in Script node is generating script to Xml escaped string.

fn create_element(name: &str, att: Option<Vec<(&str, &str)>>) -> NodeRef {
    let att = if let Some(att) = att {
        att.iter()
            .map(|(k, v)| {
                (
                    ExpandedName::new(ns!(), LocalName::from(*k)),
                    Attribute {
                        prefix: None,
                        value: v.to_string(),
                    },
                )
            })
            .collect()
    } else {
        Vec::new()
    };

    let local_name = if name == "script" {
        // I've already replaced local name to cached local name instead creating new local name. but it still does not work.
        local_name!("script")
    } else {
        LocalName::from(name)
    };

    NodeRef::new_element(QualName::new(None, ns!(), local_name), att)
}


let script_el = create_element("script", None);
script_el.append(NodeRef::new_text(include_str!("../script.js" /* non-escaped script! */)));

Inside html5ever. mod.rs#221

fn write_text(&mut self, text: &str) -> io::Result<()> {
    let escape = match self.parent().html_name {
        Some(local_name!("style")) |
        Some(local_name!("script")) |
        Some(local_name!("xmp")) |
        Some(local_name!("iframe")) |
        Some(local_name!("noembed")) |
        Some(local_name!("noframes")) |
        Some(local_name!("plaintext")) => false,
  
        Some(local_name!("noscript")) => !self.opts.scripting_enabled,
  
        _ => true,
    };
  
    if escape {
        self.write_escaped(text, false)
    } else {
        self.writer.write_all(text.as_bytes())
    }
}

Add travis-cargo to kuchiki for doc upload and coverage.

I think it would be worthwhile to add travis-cargo and get docs and code coverage (in addition to tests).

Addition is very easy, I managed to do so for xml5ever. Plus the docs for travis-cargo are quite good.

This could be a good candidate for a beginner level assignment :)

selectors::Element not properly exported

I'm trying to use the methods from: https://simonsapin.github.io/kuchiki/selectors/trait.Element.html

In particular: get_local_name(), get_namespace().

However I get the error:

src/main.rs:64:14: 64:27 error: no method named `get_namespace` found for type `kuchiki::node_data_ref::NodeDataRef<kuchiki::tree::ElementData>` in the current scope
src/main.rs:64  m.get_namespace();
                  ^~~~~~~~~~~~~
src/main.rs:64:14: 64:27 help: items from traits can only be used if the trait is in scope; the following trait is implemented but not in scope, perhaps add a `use` for it:
src/main.rs:64:14: 64:27 help: candidate #1: `use selectors::tree::Element`
error: aborting due to previous error

But I cannot use selectors::tree::Element; because it is private.

If I extern crate selectors; use selectors::Element; I get exactly the same error.

Question: Can this be used in wasm?

Im currently working on a project that requires compilation to wasm, so i would like to know if it is supported.

Relicense under dual MIT/Apache-2.0

This issue was automatically generated. Feel free to close without ceremony if
you do not agree with re-licensing or if it is not possible for other reasons.
Respond to @cmr with any questions or concerns, or pop over to
#rust-offtopic on IRC to discuss.

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

Some ask, "Does this really apply to binary redistributions? Does MIT really
require reproducing the whole thing?" I'm not a lawyer, and I can't give legal
advice, but some Google Android apps include open source attributions using
this interpretation. Others also agree with
it.
But, again, the copyright notice redistribution is not the primary motivation
for the dual-licensing. It's stronger protections to licensees and better
interoperation with the wider Rust ecosystem.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright, due to not being a "creative
work", e.g. a typo fix) and then add the following to your README:

## License

Licensed under either of

 * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, if you have them, use the following boilerplate
(based on that used in Rust):

// Copyright 2016 kuchiki developers
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

It's commonly asked whether license headers are required. I'm not comfortable
making an official recommendation either way, but the Apache license
recommends it in their appendix on how to use the license.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT/Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.

Or, if you're a contributor, you can check the box in this repo next to your
name. My scripts will pick this exact phrase up and check your checkbox, but
I'll come through and manually review this issue later as well.

Can't reference Sink in struct

The Sink type is private, but parse_html returns a Parser<Sink> type.

This makes it impossible to reference the return type in a struct:

struct App {
    config: Config,
    current_parser: Option<Parser<Sink>>, // <-- Can't do this
    buffer: [u8; 1024],
    counter: usize,
}

No easy way to access attributes

the attributes field on ElementData is a HashMap<QualName, String>. QualName doesn't impl Eq for anything but itself, accessing a specific attribute requires constructing a QualName object.

Cannot build with a local cssparser copy

I've downloaded rust-cssparser-0.24.1 and added cssparser = { path = "../rust-cssparser-0.24.1" } to the Cargo.toml, but I'm getting:

error[E0277]: the trait bound `select::PseudoElement: cssparser::serializer::ToCss` is not satisfied
  --> src/select.rs:25:6
   |
25 | impl SelectorImpl for KuchikiSelectors {
   |      ^^^^^^^^^^^^ the trait `cssparser::serializer::ToCss` is not implemented for `select::PseudoElement`

error[E0277]: the trait bound `select::PseudoClass: cssparser::serializer::ToCss` is not satisfied
  --> src/select.rs:25:6
   |
25 | impl SelectorImpl for KuchikiSelectors {
   |      ^^^^^^^^^^^^ the trait `cssparser::serializer::ToCss` is not implemented for `select::PseudoClass`

error[E0277]: the trait bound `select::PseudoClass: cssparser::serializer::ToCss` is not satisfied
  --> src/select.rs:82:6
   |
82 | impl NonTSPseudoClass for PseudoClass {
   |      ^^^^^^^^^^^^^^^^ the trait `cssparser::serializer::ToCss` is not implemented for `select::PseudoClass`

error[E0277]: the trait bound `select::PseudoElement: cssparser::serializer::ToCss` is not satisfied
   --> src/select.rs:117:6
    |
117 | impl selectors::parser::PseudoElement for PseudoElement {
    |      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `cssparser::serializer::ToCss` is not implemented for `select::PseudoElement`

Not sure why it doesn't work, because if we use cssparser-0.24.1 from crates.io - everything is ok.

Can't use kuchiki as dependency.

I tried using kuchiki as dependency on rust stable and nightly, and it didn't work.

Here are steps to reproduce:

Run cargo new test and navigate to project
In Cargo.toml add following lines:
[dependencies]
kuchiki = "0.0.1"
Run cargo build on the created project.

Expected: build project
Got: Unable to get packages from source.

Perhaps a new version should be deployed to crates.io? I've added other projects just fine (e.g. hyper).

Implement AsRef<NodeRef> for NodeDataRef?

NodeDataRef provides a reference to its NodeRef with the as_node() method. Could it also implement AsRef<NodeRef>? That would make it easier to write methods that can accept either a NodeDataRef or a NodeRef.

Put selector methods behind a feature

Currently this crate pulls in cssparser and selectors to add the possibility to lookup nodes with CSS selectors.

I'm working on a project that needs to do some HTML manipulation to sanitize the input, but we don't need to use CSS selectors so we would like to avoid pulling crates we don't need.

Would it make sense for this project to move all the selector features behind a Cargo feature, to reduce the number of dependencies for projects that don't need them?

Shareable (Send) Document

I'm currently writing a media application where users are able to extract data from html pages with custom scripts written in Gluon.

Gluon makes it pretty easy to access Rust data types, but they must be Send.

The html5ever/kuchiki nodes use Rc<> and so can't be send.

I've come up with two alternatives that sort of make this work, but they are both ridiculously ugly hacks. (a: have a dedicated parser thread that owns the dom and handles queries with channel communication; b) re-parse on each query and return custom nodes that don't use Rc).

Do you see any alternative of how I could make this more sensible?

This would obviously be easier with arena style allocation, but kuchiki has the dom hard-coded so I don't see that happening.

Table-related tags seem to get ignored completely

Tags such as <tbody>, <tr> and <td> seem to get ignored completely by the parser.

let html = "
    <tbody>
        <td colspan=\"2\">Example 1</td>
    </tbody>
    <p>Example 2</p>
";

let document = kuchiki::parse_html().one(html);

for node in document.descendants() {
    if let Some(element) = node.as_element() {
        println!("{:?}", element.name.local);
    }
}

Output

Atom('html' type=static)
Atom('head' type=static)
Atom('body' type=static)
Atom('p' type=static)

Does html5ever currently omit those tags or are they handled in a different way? Calling .data() on NodeRef will not show any presence of those tags neither.

Any advice is greatly appreciated.

README.md update with working recipes

Hi Simon,

following your "Sorry for the lack of tutorial-like documentation for html5ever and tendril…" statement in https://stackoverflow.com/questions/35654525/parsing-html-page-content-in-a-stream-with-hyper-and-html5ever/35660699#35660699 i'd like to ask whether you'd accept a pull request to modify kuchiki's repo README.md with a few basic copy-paste ready examples on how to:

parse html document from string;
select first element matched by selector;
add all elements matched by selector to a Vector;
remove element(s) from the tree;
serialize document to string;

That's all I've managed to achieve so far :)

I am writting my first rust code ever, without having any knowledge on dom manipulation algorithms, with library that has close to zero documentation. That is pain and took too many hours already to get even the most basic examples working... which could have been just copy-pasted from docs. Some examples are put in src/tests.rs but all the begginers will probably be diverted to https://docs.rs/kuchiki/ which they will dive into only of they are very brave and most likely not getting anywhere from there anyways.

Currently I do the examples research by following a comment on your post in stackoverflow mentioned above - "For example code you can run a search for extern crate kuchiki on github". That is not an elegant process :)

I sense you'd be keen on preparing the 'proper' tutorial and that definitely takes a lot of time, my suggestion is let's just drop 10-20 copy-paste'able examples in the readme, it'd be 20% achieving 80%, at least for newbies like me. Am ready to get started with providing a pull request with my not necessary elegant, but working examples.

Processing Instruction tokens in Kuchiki

From what I understand, @quininer had some troubles adding XML support (see #31 , https://github.com/quininer/sanngaa) to kuchiki. Namely, there are no Processing Instructions token/node in kuchiki, which makes sense, seeing it was made to parse HTML, which doesn't have that token.

Is this something that kuchiki could work around, or should we wait for markup5ever?

Remove all elements with selector

How to remove every element matching a css-selector?
Currently if I detach() a node in the select() loop, it just stops.
i.e.

for css_match in document.select(".someclass").unwrap() {
    css_match.as_node().detach();
    // Stops after one iteration, even if there are multiple in the document.
}

Do I have to keep running select until all instances are removed?

Support for :contains("foo")

I'm not sure if this needs to be implemented in one of the servo crates or kuchiki, but I noticed it's currently not supported and it would be very useful to have.

Thanks!

Make kuchiki more forgiving about invalid encodings

At the moment I can't parse https://google.com (after resolving redirects), because some character there is invalid UTF-8, apparently.

Cannot compile

I can't compile kuchiki, I get the same error when i try to add selectors to cargo.toml, so maybe that is where the problem lies. I am using the nightly builds (1.5)

Below is the error I'm getting when running: cargo build --verbose

unable to get packages from source

Caused by:
Failed to unpack package kuchiki v0.0.1

Caused by:
No such file or directory (os error 2)

XML support?

It seems that xml5ever has now merged into html5ever somewhat ... does this mean that I can use kuchiki safely on xml documents (like RSS feeds)?

Question on naming conventions of ids starting with a number.

It seems that Kuchiki will return an Err when calling select_first if the id begins with a number. For example if the html has something like this,

<p id="1">Some foo content</p>

This would be accessed by calling:

let p_node = node_ref.select_first("p#1").unwrap();

However this will just return an Err. Is this a bug in the way a CSS selector is parsed or is it that the CSS spec requires ids to be named starting with an alphabetic character?

Configureable memory limit

I’ve been running into an issue where large documents being parsed cause massive amounts of memory usage that can slow or even crash my server and the ability to limit the memory usage of kuchiki would be invaluable. Having a memory limit could also allow the usage of preallocated buffers, which would do wonders for performance as well

Is it possible to modify a value of an Attribute

Hi,
I need to modify the dom tree by replacing all 'src' values.
As data() returns a reference instead of a mutable reference, is it possible to do so ?

Question: Equivalent of innerHTML method

The innerHTML method in the DOM API returns a string of the HTML structure of an element node. Is there an efficient way to make such a function call in Kuchiki? The thing that comes closest for me, is:

node_ref.serialize()

which can then be converted to a String but this looks very inefficient if it is repeatedly called.

Discoverability on crates.io

About a week ago I was searching for a library to manipulate HTML documents using a DOM like structure. I found many crates, none of which were really ready in my opinion, but I did not find this crate.

In fact, is you go to crates.io now and search for any of the following, you will not find it either:

html
html dom
html manipulation
html tree

I only found out about this project by googling "rust rctree html" which led me to the users.rust-lang.org forum thread.

It seems that crates.io has an issue with your written description:

(朽木) HTML/XML tree manipulation library

It sees "HTML/XML" as a single word and as a result if you just search for "html" or "xml" you will not find the crate. I will create an issue regarding this on the crates.io repo if one doesn't already exist.

However for the time being, you can probably solve this by adding the keywords property to Cargo.toml:

keywords = ["dom", "xml", "html", "markup", "language"]

How to create an empty element?

E.g. a link element is an empty element. Looking through the API docs, examples and tests I couldn't figure out how to create such an element using kuchiki.

Question on `select` and its inclusive nature

The select function in NodeRef is used to get all nodes matching a given CSS selector which is inclusive of the NodeRef that calls it. This works fine but could be cause for concern if the iterator it returns is then used to delete nodes.
Is there a way to get a non inclusive iterator that does not involve using a if statement before looping?

Consider the following HTML

<div id="top">
  <div>foo</div>
  <div>bar</div>
  <p>baz</p>
</div>

In the following Rust code, we assume the node with the div#top selector is bound to a variable x:

let mut nodes = x.select("div").unwrap();
while let Some(node_ref) = nodes.next() {
  node_ref.detach();
}

This would delete div#top which is probably unintentional if a user only wanted to remove its <div> children.

Compiling selectors with cssparser::Parser

Hi there!

I'm using Kuchiki to write a toy browser engine. I need to implement the cssparser::QualifiedRuleParser trait to parse CSS stylesheets into kuchiki::Selectors and declarations belonging to those selectors (e.g. margin: auto), similar to what Servo does here.

However, the KuchikiParser is private, so I can't pass it as the first argument to selectors::parser::SelectorList::parse.

There is kuchiki::Selectors::compile(&str), but I don't have a &str in parse_prelude and parse_block. Instead, I have a cssparser::Parser, which can go right into the second argument of selectors::parser::SelectorList::parse.

My issue can be solved in one of two ways:

Expose the KuchikiParser as part of the public API.
Add a new kuchiki::Selectors::compile method that takes a cssparser::Parser as input instead of a &str.

It's possible I might be better off using the html5ever, selectors, and cssparser crates directly, but I wanted to explore these options, too. Do either of these fit with your vision for the Kuchiki API?

no method named `from_http` found

tried to compile examples/hyper.rs, got error: no method named from_http found for type html5ever::driver::Parser<kuchiki::parser::Sink> in the current scope.

Recent change in Hyper seems to break the example hyper code due to TLS

Running the example code returns an invalid scheme for HTTP error.

http://stackoverflow.com/questions/41619187/hyper-says-invalid-scheme-for-http-for-https-urls

Fix seems pretty simple. Will pull create PR. It adds a dependency on this crate though https://docs.rs/hyper-native-tls/0.2.2/hyper_native_tls/

kuchiki-rs / kuchiki Goto Github PK

kuchiki's People

Contributors

Stargazers

Watchers

Forkers

kuchiki's Issues

Why?

How?

Contributor checkoff

Recommend Projects

Recommend Topics

Recommend Org

Jobs