GithubHelp home page GithubHelp logo

Comments (7)

hadley avatar hadley commented on July 21, 2024 5

Alternatively, you could write it like this:

nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
  html %>%
  html_nodes("div.video-summary-data")

column <- function(x) nodes %>% html_node(xpath = x) %>% html_text()

df <- data.frame(
  title = column("div[1]//a"),
  author = column("div[3]//a"),
  date = column("div[4]//small[1]"),
  language = column("div[4]//small[2]"),
  description = column("div[5]//p"),
  stringsAsFactors = FALSE
)

I think this is simple enough that you don't need a wrapper

from rvest.

abresler avatar abresler commented on July 21, 2024

Man this is amazing!! Soon people will see how powerful r can be for webscraping

from rvest.

hadley avatar hadley commented on July 21, 2024

I like it - I need to think through the syntax a bit though. It's a bit too reliant on xpath currently - i.e. there's no way to access an attribute with a css selector. It'd also be nice if it worked in a where you could easily test each component, before building up into a data frame.

(Also I think this will require a variant of html_node() that doesn't unlist(), and assumes the selector only pulls out one node. Now that I think of it, html_node() is basically [, and we need an equivalent of [[. Maybe html_node() should be renamed to html_nodes() and html_node() should only ever pull out one node per input element.)

from rvest.

hadley avatar hadley commented on July 21, 2024

Another idea is to implement the css selector extensions used by https://github.com/EricChiang/pup

from rvest.

renkun-ken avatar renkun-ken commented on July 21, 2024

Great thinking! The design of html_node() and html_nodes() works with many problematic cases in which the nodes are not quite regular. The code above does not really work due to some exceptional cases in the webpage, but with these two functions, it works correctly without too much worry about the missing/blank values in some fields.

from rvest.

renkun-ken avatar renkun-ken commented on July 21, 2024

Using html_node() and html_nodes(), the code can be much less error-prone.

library(rvest)

nodes <- "http://pyvideo.org/category/50/pycon-us-2014" %>%
  html %>%
  html_nodes("div.video-summary-data")

cols <- list(
  title = "div[1]//a//text()",
  author = "div[3]//a//text()",
  date = "div[4]//small[1]//text()",
  language = "div[4]//small[2]//text()",
  description = "div[5]//p//text()")

df <- cols %>%
  lapply(function(col) {
    nodes %>%
      html_node(xpath = col) %>%
      html_text
  }) %>%
  data.frame(stringsAsFactors = FALSE)

from rvest.

renkun-ken avatar renkun-ken commented on July 21, 2024

@hadley, fantastic use, just forget about using function :) This workflow looks very natural now.

from rvest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.