GithubHelp home page GithubHelp logo

Error when no matching nodes about rvest HOT 13 CLOSED

tidyverse avatar tidyverse commented on August 22, 2024
Error when no matching nodes

from rvest.

Comments (13)

hrbrmstr avatar hrbrmstr commented on August 22, 2024

I was thinking about this from an SO post a while back. Perhaps either (a) there could be a separate html_nodes_exists function (similar to the one I hacked up at that link) or (b) html_nodes could have something like a check_first logical parameter that wraps the resultant XPath with a boolean() and does a test first or (c) some combo of both. I can take a stab at this depending on which way you'd be leaning (or if you have other ideas).

from rvest.

hadley avatar hadley commented on August 22, 2024

I think it would be better to return NULL for html_nodes(). Then that gives an easy existence check

from rvest.

FrankCData avatar FrankCData commented on August 22, 2024

Not sure if this belongs here (or should be noted separately) - but I'm having a related issue scraping a page with 1 missing value. I.E. for one "row" of data, a value is missing. The node is also completely absent in the html (not just empty or containing NA, so I get an error "arguments imply differing number of rows").

e.g.
Flickr_One_Camera_Data <- html("https://www.flickr.com/cameras/canon")

FACD <- data.frame(brand = "canon",
name=Flickr_One_Camera_Data %>% html_nodes(" td a") %>% html_text(),
type=Flickr_One_Camera_Data %>% html_nodes("td:nth-child(6) ") %>% html_text(),
stringsAsFactors=FALSE)

Rank (row) 28 has no entry for type, and I can't seem to be able to trap this in any way, and replace it with a default value - presumably something related to html_nodes()

from rvest.

hrbrmstr avatar hrbrmstr commented on August 22, 2024

You can work around that by using html_table and setting fill=TRUE:

html_table(Flickr_One_Camera_Data, fill=TRUE)[[1]][28,]
##    Rank ▾  Name # of items Avg. daily users Activity Factor Type
## 28     28 EOS M  3,422,744              224               6 <NA>

Then cleaning up the columns afterwards. The resultant XPath to retrieve the td nodes in the html_nodes call is behaving as it's expected to. I'm not sure html_nodes needs a fill option.

from rvest.

FrankCData avatar FrankCData commented on August 22, 2024

Thanks - that deals nicely with the missing value.

However, not being able to use html_nodes is giving me a lot of unwanted data in some columns I will need to clean up (as you mention).
So, if "fill" is undesirable - is there some other way of detecting a missing node value ?

(In my original version, I only select the actual data I need - using SelectorGadget and CSS selectors. The node contains less data than a column, so no cleaning-up is needed - but I have the problem of missing nodes, hence my original post)

from rvest.

hadley avatar hadley commented on August 22, 2024

@FrankCPhoto how can rvest detect something that is missing? I don't see how you could tell which column is the one that's missing.

from rvest.

FrankCData avatar FrankCData commented on August 22, 2024

(Disclaimer first - I'm a relative beginner in R)

I was thinking along the lines : when reading tabular data, each column should have be present - or have some way to trap it if not. Of course, I appreciate that not all scraped data is tabular - and a node could just be a single value from anywhere on a web page. So, this would probably have to apply to tabular data only.

It's not an problem I've had before i.e the complete absence of a node, as opposed to an empty "field". But I imagine it might be a common enough issue for web scraping. If I can trap it somehow (I was trying things like is.na etc) then I can handle it.

from rvest.

hadley avatar hadley commented on August 22, 2024

@FrankCPhoto as @hrbrmstr suggested you'll need to use fill = TRUE and then clean up yourself. I can't see anyway that httr could help you more than that.

from rvest.

FrankCData avatar FrankCData commented on August 22, 2024

@hadley (& @hrbrmstr) Thanks for the help - have manged to work around it using fill.

from rvest.

sebastianbarfort avatar sebastianbarfort commented on August 22, 2024

I really think a html_nodes_exist function would be useful. I get the Error in class(out) <- "XMLNodeSet" : attempt to set an attribute on NULL error message a lot and not I can figure out how to write one myself. The SO answer uses xpath but that really defeats the purpose of using CSS selectors.

from rvest.

wqr786 avatar wqr786 commented on August 22, 2024

I think the html_nodes() should return a NULL value in case if a value doesn't exist.

I see @hadley committed and closed the issue. But I'm trying it out and it doesn't seem to work for me. I am trying to fetch some values from pages and for one of the page it is giving this error:
L::xmlValue, ..., .type = character(1)) :
Unknown input of class: NULL

Anyone know of this?

from rvest.

hrbrmstr avatar hrbrmstr commented on August 22, 2024

Can you post an example?

On Sat, Aug 1, 2015 at 7:44 AM, Waqar [email protected] wrote:

I think the html_nodes() should return a NULL value in case if a value
doesn't exist.

I see @hadley https://github.com/hadley committed and closed the issue.
But I'm trying it out and it doesn't seem to work for me. I am trying to
fetch some values from pages and for one of the page it is giving this
error:
L::xmlValue, ..., .type = character(1)) :
Unknown input of class: NULL

Anyone know of this?


Reply to this email directly or view it on GitHub
#31 (comment).

from rvest.

hadley avatar hadley commented on August 22, 2024

Please file a reproducible example as a new issue

from rvest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.