GithubHelp home page GithubHelp logo

Comments (5)

renkun-ken avatar renkun-ken commented on July 21, 2024

Using the following packages:

  • rvest: for easy scraping
  • stringr: for easy string manipulation
library(rvest)    # devtools::install_github("rvest","hadley")
library(stringr)  # install.pacakges("stringr")

url <- "http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data/page+%d"

trimNewline <- function(p) str_replace(p, "\n","")
asInteger <- function(x) as.integer(str_replace(x,",",""))
fromPercent <- function(x) as.numeric(str_replace(x, "%", ""))/100

table <- 1:3 %>%
  lapply(function(page) {
    nodes <- sprintf(url, page) %>%
      html() %>%
      html_nodes("table tbody tr")

    column <- function(xpath) nodes %>% html_node(xpath = xpath) %>% html_text(trim = TRUE)

    data.frame(
      rank = column("td[1]/div[1]/span") %>%
        str_replace("#(\\d+)(Tie)?","\\1") %>%
        as.integer,
      score = column("td[1]/span[1]/span") %>%
        str_replace("(\\d+) out of 100.","\\1") %>%
        asInteger,
      name = column("td[2]/a"),
      location = column("td[2]/p/text()[1]"),
      tuitionAndFees = column("td[3]/text()[1]") %>% trimNewline,
      totalEnrollment = column("td[4]/text()[1]") %>% asInteger,
      fall2013AcceptanceRate = column("td[5]/text()[1]") %>% fromPercent,
      averageFreshmanRetentionRate = column("td[6]/text()[1]") %>% fromPercent(),
      sixYearGraduationRate = column("td[7]/text()[1]") %>% fromPercent,
      stringsAsFactors = FALSE
    )
  }) %>%
  do.call(rbind, .)
> head(table)
  rank score                  name      location tuitionAndFees
1    1   100  Princeton University Princeton, NJ        $41,820
2    2    99    Harvard University Cambridge, MA        $43,938
3    3    98       Yale University New Haven, CT        $45,800
4    4    95   Columbia University  New York, NY        $51,008
5    4    95   Stanford University  Stanford, CA        $44,757
6    4    95 University of Chicago   Chicago, IL        $48,253
  totalEnrollment fall2013AcceptanceRate averageFreshmanRetentionRate
1            8014                  0.074                         0.98
2           19882                  0.058                         0.97
3           12109                  0.069                         0.99
4           23606                  0.069                         0.99
5           18136                  0.057                         0.98
6           12539                  0.088                         0.99
  sixYearGraduationRate
1                  0.97
2                  0.97
3                  0.98
4                  0.96
5                  0.96
6                  0.93

You may change 1:3 to more page numbers, like all 11. :)

Note that in the later pages, some cells have tips (which is annoying) so that I have to use td[5]/text()[1] such xpath to ensure only first text is selected.

from rvest.

ignacio82 avatar ignacio82 commented on July 21, 2024

Very impresive. I need to go over it line by line to make sure i understand how your did your magic.
Thanks!

from rvest.

renkun-ken avatar renkun-ken commented on July 21, 2024

Thanks, @ignacio82! If you have any question about it, just ask here. Let me first point out the basic knowledge you need:

  • HTML
  • CSS selector
  • XPath selector
  • Regular expression

You don't have dive deep but get to know the very basics. You don't have to be a professional web developer to just scrape some webpages.

from rvest.

ignacio82 avatar ignacio82 commented on July 21, 2024

were can I read about:

  • CSS selector
  • XPath selector
    ?
    This is the first time I hear about that stuff...

from rvest.

renkun-ken avatar renkun-ken commented on July 21, 2024

http://www.w3schools.com/ offers great and basic tutorials on a wide variety of web stuff.

You can quickly go though HTML, CSS and XPath.

Basically speaking,

HTML is the markup language behind web pages, it defines the contents and layout of a web page. A web page like the ranking is described by a very nested collection of tags which is expressed in plain text so that your web browser can receive the text from server, analyze its structure and figure out how to render it.

CSS is a language that defines a style sheet for the tags or classes in HTML to match, so that the different groups of elements can have different styles (color, border, etc.) without too redundant declaration of inline styles for each element. A CSS selector can help the browser (and us) to select a particular group of elements in the web page.

Note that HTML is very close to XML which is used to store and transmit data between different services. XML has no pre-definition of tags but HTML defines some tags so that browser can understand how to interpret an element by their tag name. XPath is very flexible and powerful to describe a query for a particular set of nodes in XML, which mostly also applies to HTML.

So that you have to understand the basic motivation and know-how to get started scraping web pages :)

from rvest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.