kuchiki-rs / kuchiki Goto Github PK
View Code? Open in Web Editor NEW(朽木) HTML/XML tree manipulation library for Rust
License: MIT License
(朽木) HTML/XML tree manipulation library for Rust
License: MIT License
I realize I can manipulate the tree structure after kuchiki has parsed the html. But could there be a way for the user of this crate to provide a (or multiple) TreeSink or TreeSink-like struct, so that they can manipulate nodes as the tree is being built, and thus do everything they would need to do in only 1 step?
I haven't thought about the api design that much. Would this be in the scope of this project?
While it's clear that Kuchiki is only interested in parsing and serializing HTML files, is there a way to serialize a NodeRef
to an XML compliant output?
Sorry that this is off topic.
Hello, I wanted to know if there are plans to add support for parsing fragments. I am extracting the content within in a noscript
tag which is represented as a string and since it can contain a number of HTML nodes after parsing, the only way this can be done is if there was a way to parse fragments.
I have tried removing text nodes containing only whitespace using
let node = kuchiki::parse_html().one(html_str);
node.descendants().filter(|x| match x.data(){
NodeData::Text(t) => t.borrow().trim().is_empty(),
_ => false,
}).for_each(|x| x.detach());
however, the text nodes still end up being there when I later traverse the nodes.
Is there a better way of handling this?
Currently, as far as I can see, the name
property on a node is read-only. This means that there is no way to modify the tag name, e.g. to change div
to p
. Maybe there is a simpler way around that than rebuilding a tree?
Hi there,
I’m trying to use this library to run a set query on a large number of documents.
I’m parsing the document with kuchiki::parse_html().one()
, and I’m stuck on how to actually run the filter against this. I know I can call select
on the returned document NodeRef
, but I have already gone to the trouble of compiling my selector, and I can’t seem to figure out how to actually make use of it!
I’ve tried using .inclusive_descendants()
on my NodeRef
, but it then complains of a type mismatch.
I’d love to know what the expected way to approach this is - the language used to describe the Selector objects seems to suggest reusing them in compiled form is intended to be possible!
@brson asked on IRC:
I have a tool that wants to output html reports. Is kuchiki right for that? Is it ready?
You can in theory build up a tree with NodeRef::new_element
, NodeRef::new_text
, and NodeRef::append
and then call the HTML serializer on it, but it’s not gonna be convenient.
Right now, the best I can recommend is write!
’ing repeatedly to a String
. Eventually, I expect the Rust ecosystem to grow templating libraries like http://jinja.pocoo.org/docs/dev/ . Jinja just emits text, and sometimes that text happens to be in HTML syntax. Genshi http://genshi.edgewall.org/ is a bit different in that it is based on XML/HTML trees rather than text, so the end result is guaranteed to be well-formed, but that’s not really a big deal now that HTML has standardized permissive error recovery in the parser. A Genshi-like templating library in Rust could be based on Kuchiki, but would be more than just what Kuchiki does.
Regarding Kuchiki being ready: it lack tutorial-style documentation, but other than that it should work. In theory, as I’m not aware of “real” projects using it yet. Feedback welcome!
Kind of a cross-post of this.
I'm trying to get all visible text in a document (text that is not part of script/style/noscript tags).
I've come up with the following algo:
let parser = kuchiki::parse_html().one(content);
for child in parser.inclusive_descendants() {
if let Some(el) = child.as_element() {
let tag_name = &el.name.local;
if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
child.detach();
}
}
}
let text = parser.text_contents();
println!("{}", text);
However, this doesn't seem to work. parser.text_contents()
still returns inline Javascript in style tags.
Am I using the detach
API incorrectly?
Hi,
I have a working code:
extern crate kuchiki;
use kuchiki::traits::*;
fn main() {
let html = "
<html>
<head></head>
<body>
<p class='foo'>Hello, world!</p>
<p class='foo'>I love HTML</p>
</body>
</html>";
let document = kuchiki::parse_html().one(html);
let paragraph = document.select("p").unwrap().collect::<Vec<_>>();
for element in paragraph {
let new_p_element = "<p class='newp'>Hello, from loved HTML</p>";
element.as_node().detach()
}
println!("{}", document.to_string())
Instead of detaching/remove'ing p element's i'd like to replace them with the element that is defined in new_p_element
. How would I achieve something like element.as_node.replace(&new_p_element)
just with a code which actually compiles?
Thanks!
mem::size_of::<Node>()
reports 120 bytes, and mem::size_of::<NodeData>()
makes up 80 of those bytes. If we box most of the variants of the NodeData enum, this can shrink to 56 and 16 bytes respectively.
Is it possible to add support for selectors like "div.class p:has(span.something:contains(SomeText)) a" which have pseudo class selectors like ":has" and ":contains"?
I see a lot of PR rotting on the vine, maybe I or someone else can get access to Kuchiki and help to fix some of these issues? I volunteer if no one else wants to do it.
tried to use kuchiki like this
let html_file = Path::new("cocon1.html");
let document = kuchiki::parse_html().from_file(html_file);
But I got his error
src\main.rs:44:42: 44:51 error: type mismatch resolving `<html5ever::tendril::fmt::UTF8 as html5ever::tendril::fmt::SliceFormat>::Slice == [u8]`:
expected str,
found slice [E0271]
src\main.rs:44 let document = kuchiki::parse_html().from_file(html_file);
^~~~~~~~~
Rem : Wiith htm5ever
, It's OK with this code:
let dom = parse_document(RcDom::default(), Default::default())
.from_utf8()
.from_file(html_file)
.unwrap();
Hello,
I have a question about parsing an XML file, and whether I should be expecting children or siblings from the next tag down.
Here's a snippet:
<body copyright="All data copyright Massachusetts Institute of Technology 2016.">
<route tag="saferidecambwest" title="Saferide Cambridge West" color="cc66ff" oppositeColor="000000" latMin="42.3547905" latMax="42.364801" lonMin="-71.1124847" lonMax="-71.0843897">
<stop tag="mass84_d" title="84 Mass Ave" shortTitle="84 Mass" lat="42.3595199" lon="-71.09416" stopId="03"/>
<stop tag="mccrmk" title="McCormick Hall" lat="42.35766" lon="-71.0947" stopId="30"/>
<stop tag="burtho" title="Burton House" lat="42.3560823" lon="-71.098703" stopId="16"/>
<stop tag="newho" title="New House" lat="42.35559" lon="-71.1002099" stopId="35"/>
<stop tag="tangwest" title="Tang/Westgate" lat="42.3547905" lon="-71.1026282" stopId="51"/>
<stop tag="simmhl" title="Simmons hall" lat="42.3565968" lon="-71.1020453" stopId="47"/>
<stop tag="ww15" title="WW15" lat="42.3557087" lon="-71.1097712" stopId="55"/>
<stop tag="brookchest" title="Brookline St at Chestnut St" shortTitle="Brookline @ Chestnut" lat="42.3565399" lon="-71.1089" stopId="14"/>
<stop tag="putmag" title="Putnum Ave at Magazine St" shortTitle="Putnum @ Magazine" lat="42.35904" lon="-71.11096" stopId="43"/>
<stop tag="rivfair" title="River St at Fairmont St" shortTitle="River @ Fairmont" lat="42.3626179" lon="-71.1124847" stopId="44"/>
<stop tag="rivpleas" title="River St at Pleasant St" shortTitle="River @ Pleasant" lat="42.3639197" lon="-71.1086206" stopId="46"/>
<stop tag="rivfrank" title="River Street@Franklin Street" shortTitle="River Street@Franklin" lat="42.364801" lon="-71.105938" stopId="45"/>
<stop tag="sydgreen" title="Sydney St at Green St" shortTitle="Sydney @ Green" lat="42.3624601" lon="-71.1001597" stopId="50"/>
<stop tag="paci70" title="70 Pacific Street" shortTitle="70 Pacific" lat="42.36023" lon="-71.1023" stopId="40"/>
<stop tag="whou" title="NW30/Warehouse" lat="42.3590332" lon="-71.1002949" stopId="54"/>
<stop tag="edge" title="Edgerton" lat="42.3601694" lon="-71.0975111" stopId="23"/>
<stop tag="kendsq" title="Kendall Square T" lat="42.36237" lon="-71.08613" stopId="01"/>
<stop tag="wadse40" title="Wadsworth@E40" lat="42.3612719" lon="-71.0843897" stopId="53"/>
<stop tag="mass77" title="77 Mass Ave" shortTitle="77 Mass" lat="42.3592699" lon="-71.0936799" stopId="02"/>
<direction tag="frcamp" title="from Campus" name="" useForUI="true">
<stop tag="mass84_d" />
<stop tag="mccrmk" />
<stop tag="burtho" />
<stop tag="newho" />
<stop tag="tangwest" />
<stop tag="simmhl" />
<stop tag="ww15" />
<stop tag="brookchest" />
<stop tag="putmag" />
<stop tag="rivfair" />
<stop tag="rivpleas" />
</direction>
<direction tag="tocamp" title="to Campus" name="" useForUI="true">
<stop tag="rivfrank" />
<stop tag="sydgreen" />
<stop tag="paci70" />
<stop tag="whou" />
<stop tag="edge" />
<stop tag="kendsq" />
<stop tag="wadse40" />
<stop tag="mass77" />
</direction>
</route>
</body>
Here's some example code I'm using to try to walk the tree:
let xml = kuchiki::parse_html().one(body); // I also made sure to clean out \n, which was causing some problems.
for route in xml.select("route").unwrap() {
// Here I want to get the stops that are just children but not all descendants
// This only prints one stop, the first one (84 Mass Ave). Expected many stops.
for stop in route.as_node().children() {
println!(":?", stop);
}
// This returns the first stop as expected (84 Mass Ave)
let first_child = route.as_node().first_child().unwrap();
// This returns the next stop (McCormick Hall), which I thought was a sibling
let child_of_child = first_child.first_child().unwrap();
// This returns all elements in the rest of the file
let first_child_descendants = first_child.descendants();
}
Am I misunderstanding something? I expected all the <stop>
that are next to each other to be siblings, not descendants.
It would be nice if there was a function for a node that made the node and all it's children into the equivalent html.
Thanks @SimonSapin for this awesome library! I am having a bit of trouble manipulating the DOM though and can't find any examples of how to do this.
I am using traverse()
to iterate over NodeEdge
s, then parse their attributes. I want to wrap certain elements I find in a new Element I define during the iteration.
For example, I want to find <a href="whatever">my link</a>
and transform it to this: <h1><a href="whatever">my link</a></h1>
.
How can I go about doing so?
Thank you
Hello,
I want to get all links on this page: http://cocon.se/
Here's my code I'm using to try to get links (for test, html page is save in local file: cocon1.html):
let html_file = Path::new("cocon1.html").to_path_buf();
let document = kuchiki::parse_html().from_utf8().from_file(&html_file).unwrap();
for link in document.select("a").unwrap() {
let attributes = link.attributes.borrow();
println!("{:?} {}", attributes.get("href"), link.text_contents());
}
I got this result :
Some("http://cocon.se/")
Some("http://cocon.se/visualisation/") Visualisations
Some("http://cocon.se/cocon-semantique/") Cocon Sémantique
Some("http://cocon.se/a-propos/") A propos
Some("http://cocon.se/#focus") Quoi c'est ?
Some("http://cocon.se/#focus") J'en suis !
Some("http://twitter.com/Lemoussel")
Some("http://twitter.com/SylvainDeaure")
None Johann
None Sylvain V.
None Fabrice
Some("/cas/") Etude de cas
Some("/actus/") Actus
Some("/historique/") Historique
Some("http://cocon.se/cas/astuce1.html")
Some("http://cocon.se/cas/astuce1.html") Une astuce de grand mère
Some("http://cocon.se/cas/jouer.html")
Some("http://cocon.se/cas/jouer.html") A vous de jouer !
Some("http://cocon.se/actus/gagner-tableau.html")
Some("http://cocon.se/actus/gagner-tableau.html") Qui veut gagner des tableaux ?
Some("http://cocon.se/actus/gagner-tableau.html") […]
Some("http://cocon.se/historique/cocon-con.html")
Some("http://cocon.se/historique/cocon-con.html") Un cocon moins con
Some("#carousel-homepage-latestnews") Previous
Some("#carousel-homepage-latestnews") Next
Some("mailto:[email protected]") [email protected]
Some("http://twitter.com/SylvainDeaure")
Some("/mentions/") Mentions légales
Some("/mentions/")Some("/mentions/")
Some("/mentions/")
/* */
Last 3 lines Some("/mentions/")
are incorrect. There is only one link /mentions/
with text Mentions légales
in HTML page.
This error seems to be due as the last <a>
tag is not closed => </a</div>
class="zerif-copyright" href="/mentions/">Mentions légales</a</div>
Is it possible to parse HTML page with errors?
It would be useful if :scope
could be used to anchor selection to the root of the inclusive-descendant tree. For example, in the following document, div_b.select(":scope > div")
should return the div.c
element (as it does in Javascript), but instead returns nothing because the document root is being used as the selection scope.
<html>
<body>
<div class="a">
<div class="b">
<div class="c">
</div>
</div>
</div>
</body>
</html>
From a brief glance at the selectors
crate, it appears that :scope
is supported, but the MatchingContext::scope_element
field must be directly set.
I'd like to be able to create a duplicate of a node, like the DOM's cloneNode
method. Note that this is different to Rust's Clone
trait in not returning another reference to the same node, but rather would be returning a new NodeRef
that doesn't have a parent, and which doesn't share mutations with the other one.
It looks like, currently, there's no way to do this short of implementing a tree-walk myself, which is perfectly feasible, but it feels like something the library could usefully offer - especially with a clarification that .clone()
may not do what you want (which is an easy mistake to make!).
Seems newly generated Text node in Script node is generating script to Xml escaped string.
fn create_element(name: &str, att: Option<Vec<(&str, &str)>>) -> NodeRef {
let att = if let Some(att) = att {
att.iter()
.map(|(k, v)| {
(
ExpandedName::new(ns!(), LocalName::from(*k)),
Attribute {
prefix: None,
value: v.to_string(),
},
)
})
.collect()
} else {
Vec::new()
};
let local_name = if name == "script" {
// I've already replaced local name to cached local name instead creating new local name. but it still does not work.
local_name!("script")
} else {
LocalName::from(name)
};
NodeRef::new_element(QualName::new(None, ns!(), local_name), att)
}
let script_el = create_element("script", None);
script_el.append(NodeRef::new_text(include_str!("../script.js" /* non-escaped script! */)));
Inside html5ever. mod.rs#221
fn write_text(&mut self, text: &str) -> io::Result<()> {
let escape = match self.parent().html_name {
Some(local_name!("style")) |
Some(local_name!("script")) |
Some(local_name!("xmp")) |
Some(local_name!("iframe")) |
Some(local_name!("noembed")) |
Some(local_name!("noframes")) |
Some(local_name!("plaintext")) => false,
Some(local_name!("noscript")) => !self.opts.scripting_enabled,
_ => true,
};
if escape {
self.write_escaped(text, false)
} else {
self.writer.write_all(text.as_bytes())
}
}
I think it would be worthwhile to add travis-cargo and get docs and code coverage (in addition to tests).
Addition is very easy, I managed to do so for xml5ever. Plus the docs for travis-cargo are quite good.
This could be a good candidate for a beginner level assignment :)
I'm trying to use the methods from: https://simonsapin.github.io/kuchiki/selectors/trait.Element.html
In particular: get_local_name()
, get_namespace()
.
However I get the error:
src/main.rs:64:14: 64:27 error: no method named `get_namespace` found for type `kuchiki::node_data_ref::NodeDataRef<kuchiki::tree::ElementData>` in the current scope
src/main.rs:64 m.get_namespace();
^~~~~~~~~~~~~
src/main.rs:64:14: 64:27 help: items from traits can only be used if the trait is in scope; the following trait is implemented but not in scope, perhaps add a `use` for it:
src/main.rs:64:14: 64:27 help: candidate #1: `use selectors::tree::Element`
error: aborting due to previous error
But I cannot use selectors::tree::Element;
because it is private.
If I extern crate selectors; use selectors::Element;
I get exactly the same error.
Im currently working on a project that requires compilation to wasm, so i would like to know if it is supported.
This issue was automatically generated. Feel free to close without ceremony if
you do not agree with re-licensing or if it is not possible for other reasons.
Respond to @cmr with any questions or concerns, or pop over to
#rust-offtopic
on IRC to discuss.
You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.
TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.
The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.
Some ask, "Does this really apply to binary redistributions? Does MIT really
require reproducing the whole thing?" I'm not a lawyer, and I can't give legal
advice, but some Google Android apps include open source attributions using
this interpretation. Others also agree with
it.
But, again, the copyright notice redistribution is not the primary motivation
for the dual-licensing. It's stronger protections to licensees and better
interoperation with the wider Rust ecosystem.
To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright, due to not being a "creative
work", e.g. a typo fix) and then add the following to your README:
## License
Licensed under either of
* Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
### Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.
and in your license headers, if you have them, use the following boilerplate
(based on that used in Rust):
// Copyright 2016 kuchiki developers
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.
It's commonly asked whether license headers are required. I'm not comfortable
making an official recommendation either way, but the Apache license
recommends it in their appendix on how to use the license.
Be sure to add the relevant LICENSE-{MIT,APACHE}
files. You can copy these
from the Rust repo for a plain-text
version.
And don't forget to update the license
metadata in your Cargo.toml
to:
license = "MIT/Apache-2.0"
I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!
To agree to relicensing, comment with :
I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.
Or, if you're a contributor, you can check the box in this repo next to your
name. My scripts will pick this exact phrase up and check your checkbox, but
I'll come through and manually review this issue later as well.
The Sink
type is private, but parse_html
returns a Parser<Sink>
type.
This makes it impossible to reference the return type in a struct:
struct App {
config: Config,
current_parser: Option<Parser<Sink>>, // <-- Can't do this
buffer: [u8; 1024],
counter: usize,
}
the attributes
field on ElementData
is a HashMap<QualName, String>
. QualName
doesn't impl Eq
for anything but itself, accessing a specific attribute requires constructing a QualName
object.
I've downloaded rust-cssparser-0.24.1
and added cssparser = { path = "../rust-cssparser-0.24.1" }
to the Cargo.toml
, but I'm getting:
error[E0277]: the trait bound `select::PseudoElement: cssparser::serializer::ToCss` is not satisfied
--> src/select.rs:25:6
|
25 | impl SelectorImpl for KuchikiSelectors {
| ^^^^^^^^^^^^ the trait `cssparser::serializer::ToCss` is not implemented for `select::PseudoElement`
error[E0277]: the trait bound `select::PseudoClass: cssparser::serializer::ToCss` is not satisfied
--> src/select.rs:25:6
|
25 | impl SelectorImpl for KuchikiSelectors {
| ^^^^^^^^^^^^ the trait `cssparser::serializer::ToCss` is not implemented for `select::PseudoClass`
error[E0277]: the trait bound `select::PseudoClass: cssparser::serializer::ToCss` is not satisfied
--> src/select.rs:82:6
|
82 | impl NonTSPseudoClass for PseudoClass {
| ^^^^^^^^^^^^^^^^ the trait `cssparser::serializer::ToCss` is not implemented for `select::PseudoClass`
error[E0277]: the trait bound `select::PseudoElement: cssparser::serializer::ToCss` is not satisfied
--> src/select.rs:117:6
|
117 | impl selectors::parser::PseudoElement for PseudoElement {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `cssparser::serializer::ToCss` is not implemented for `select::PseudoElement`
Not sure why it doesn't work, because if we use cssparser-0.24.1
from crates.io - everything is ok.
I tried using kuchiki as dependency on rust stable
and nightly
, and it didn't work.
Here are steps to reproduce:
cargo new test
and navigate to projectCargo.toml
add following lines:cargo build
on the created project.Expected: build project
Got: Unable to get packages from source.
Perhaps a new version should be deployed to crates.io? I've added other projects just fine (e.g. hyper).
NodeDataRef
provides a reference to its NodeRef
with the as_node()
method. Could it also implement AsRef<NodeRef>
? That would make it easier to write methods that can accept either a NodeDataRef
or a NodeRef
.
Currently this crate pulls in cssparser and selectors to add the possibility to lookup nodes with CSS selectors.
I'm working on a project that needs to do some HTML manipulation to sanitize the input, but we don't need to use CSS selectors so we would like to avoid pulling crates we don't need.
Would it make sense for this project to move all the selector features behind a Cargo feature, to reduce the number of dependencies for projects that don't need them?
I'm currently writing a media application where users are able to extract data from html pages with custom scripts written in Gluon.
Gluon makes it pretty easy to access Rust data types, but they must be Send
.
The html5ever/kuchiki nodes use Rc<> and so can't be send.
I've come up with two alternatives that sort of make this work, but they are both ridiculously ugly hacks. (a: have a dedicated parser thread that owns the dom and handles queries with channel communication; b) re-parse on each query and return custom nodes that don't use Rc).
Do you see any alternative of how I could make this more sensible?
This would obviously be easier with arena style allocation, but kuchiki has the dom hard-coded so I don't see that happening.
Hi
Tags such as <tbody>
, <tr>
and <td>
seem to get ignored completely by the parser.
let html = "
<tbody>
<td colspan=\"2\">Example 1</td>
</tbody>
<p>Example 2</p>
";
let document = kuchiki::parse_html().one(html);
for node in document.descendants() {
if let Some(element) = node.as_element() {
println!("{:?}", element.name.local);
}
}
Output
Atom('html' type=static)
Atom('head' type=static)
Atom('body' type=static)
Atom('p' type=static)
Does html5ever
currently omit those tags or are they handled in a different way? Calling .data()
on NodeRef
will not show any presence of those tags neither.
Any advice is greatly appreciated.
Hi Simon,
following your "Sorry for the lack of tutorial-like documentation for html5ever and tendril…" statement in https://stackoverflow.com/questions/35654525/parsing-html-page-content-in-a-stream-with-hyper-and-html5ever/35660699#35660699 i'd like to ask whether you'd accept a pull request to modify kuchiki's repo README.md with a few basic copy-paste ready examples on how to:
That's all I've managed to achieve so far :)
I am writting my first rust code ever, without having any knowledge on dom manipulation algorithms, with library that has close to zero documentation. That is pain and took too many hours already to get even the most basic examples working... which could have been just copy-pasted from docs. Some examples are put in src/tests.rs
but all the begginers will probably be diverted to https://docs.rs/kuchiki/ which they will dive into only of they are very brave and most likely not getting anywhere from there anyways.
Currently I do the examples research by following a comment on your post in stackoverflow mentioned above - "For example code you can run a search for extern crate kuchiki
on github". That is not an elegant process :)
I sense you'd be keen on preparing the 'proper' tutorial and that definitely takes a lot of time, my suggestion is let's just drop 10-20 copy-paste'able examples in the readme, it'd be 20% achieving 80%, at least for newbies like me. Am ready to get started with providing a pull request with my not necessary elegant, but working examples.
From what I understand, @quininer had some troubles adding XML support (see #31 , https://github.com/quininer/sanngaa) to kuchiki. Namely, there are no Processing Instructions token/node in kuchiki, which makes sense, seeing it was made to parse HTML, which doesn't have that token.
Is this something that kuchiki could work around, or should we wait for markup5ever?
How to remove every element matching a css-selector?
Currently if I detach()
a node in the select()
loop, it just stops.
i.e.
for css_match in document.select(".someclass").unwrap() {
css_match.as_node().detach();
// Stops after one iteration, even if there are multiple in the document.
}
Do I have to keep running select
until all instances are removed?
I'm not sure if this needs to be implemented in one of the servo crates or kuchiki, but I noticed it's currently not supported and it would be very useful to have.
Thanks!
At the moment I can't parse https://google.com
(after resolving redirects), because some character there is invalid UTF-8, apparently.
I can't compile kuchiki, I get the same error when i try to add selectors to cargo.toml, so maybe that is where the problem lies. I am using the nightly builds (1.5)
Below is the error I'm getting when running: cargo build --verbose
unable to get packages from source
Caused by:
Failed to unpack package kuchiki v0.0.1
Caused by:
No such file or directory (os error 2)
It seems that xml5ever has now merged into html5ever somewhat ... does this mean that I can use kuchiki safely on xml documents (like RSS feeds)?
It seems that Kuchiki will return an Err
when calling select_first
if the id begins with a number. For example if the html has something like this,
<p id="1">Some foo content</p>
This would be accessed by calling:
let p_node = node_ref.select_first("p#1").unwrap();
However this will just return an Err
. Is this a bug in the way a CSS selector is parsed or is it that the CSS spec requires ids to be named starting with an alphabetic character?
I’ve been running into an issue where large documents being parsed cause massive amounts of memory usage that can slow or even crash my server and the ability to limit the memory usage of kuchiki would be invaluable. Having a memory limit could also allow the usage of preallocated buffers, which would do wonders for performance as well
The innerHTML
method in the DOM API returns a string of the HTML structure of an element node. Is there an efficient way to make such a function call in Kuchiki? The thing that comes closest for me, is:
node_ref.serialize()
which can then be converted to a String
but this looks very inefficient if it is repeatedly called.
About a week ago I was searching for a library to manipulate HTML documents using a DOM like structure. I found many crates, none of which were really ready in my opinion, but I did not find this crate.
In fact, is you go to crates.io now and search for any of the following, you will not find it either:
I only found out about this project by googling "rust rctree html" which led me to the users.rust-lang.org forum thread.
It seems that crates.io has an issue with your written description:
(朽木) HTML/XML tree manipulation library
It sees "HTML/XML" as a single word and as a result if you just search for "html" or "xml" you will not find the crate. I will create an issue regarding this on the crates.io repo if one doesn't already exist.
However for the time being, you can probably solve this by adding the keywords property to Cargo.toml
:
keywords = ["dom", "xml", "html", "markup", "language"]
E.g. a link
element is an empty element. Looking through the API docs, examples and tests I couldn't figure out how to create such an element using kuchiki
.
The select
function in NodeRef
is used to get all nodes matching a given CSS selector which is inclusive of the NodeRef
that calls it. This works fine but could be cause for concern if the iterator it returns is then used to delete nodes.
Is there a way to get a non inclusive iterator that does not involve using a if statement before looping?
Consider the following HTML
<div id="top">
<div>foo</div>
<div>bar</div>
<p>baz</p>
</div>
In the following Rust code, we assume the node with the div#top
selector is bound to a variable x
:
let mut nodes = x.select("div").unwrap();
while let Some(node_ref) = nodes.next() {
node_ref.detach();
}
This would delete div#top
which is probably unintentional if a user only wanted to remove its <div>
children.
Hi there!
I'm using Kuchiki to write a toy browser engine. I need to implement the cssparser::QualifiedRuleParser trait to parse CSS stylesheets into kuchiki::Selectors and declarations belonging to those selectors (e.g. margin: auto
), similar to what Servo does here.
However, the KuchikiParser is private, so I can't pass it as the first argument to selectors::parser::SelectorList::parse.
There is kuchiki::Selectors::compile(&str), but I don't have a &str
in parse_prelude and parse_block. Instead, I have a cssparser::Parser, which can go right into the second argument of selectors::parser::SelectorList::parse.
My issue can be solved in one of two ways:
KuchikiParser
as part of the public API.kuchiki::Selectors::compile
method that takes a cssparser::Parser as input instead of a &str
.It's possible I might be better off using the html5ever
, selectors
, and cssparser
crates directly, but I wanted to explore these options, too. Do either of these fit with your vision for the Kuchiki API?
tried to compile examples/hyper.rs, got error: no method named from_http
found for type html5ever::driver::Parser<kuchiki::parser::Sink>
in the current scope.
Running the example code returns an invalid scheme for HTTP error.
http://stackoverflow.com/questions/41619187/hyper-says-invalid-scheme-for-http-for-https-urls
Fix seems pretty simple. Will pull create PR. It adds a dependency on this crate though https://docs.rs/hyper-native-tls/0.2.2/hyper_native_tls/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.