Is it possible to scrape the college rankings data from this website <a href="http

Using the following packages: rvest<

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

were can I read about: CSS selector XPath selector<br

Scraping college rankings data? about rvest HOT 5 CLOSED

tidyverse commented on July 21, 2024

Scraping college rankings data?

from rvest.

Comments (5)

renkun-ken commented on July 21, 2024

Using the following packages:

rvest: for easy scraping
stringr: for easy string manipulation

library(rvest)    # devtools::install_github("rvest","hadley")
library(stringr)  # install.pacakges("stringr")

url <- "http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data/page+%d"

trimNewline <- function(p) str_replace(p, "\n","")
asInteger <- function(x) as.integer(str_replace(x,",",""))
fromPercent <- function(x) as.numeric(str_replace(x, "%", ""))/100

table <- 1:3 %>%
  lapply(function(page) {
    nodes <- sprintf(url, page) %>%
      html() %>%
      html_nodes("table tbody tr")

    column <- function(xpath) nodes %>% html_node(xpath = xpath) %>% html_text(trim = TRUE)

    data.frame(
      rank = column("td[1]/div[1]/span") %>%
        str_replace("#(\\d+)(Tie)?","\\1") %>%
        as.integer,
      score = column("td[1]/span[1]/span") %>%
        str_replace("(\\d+) out of 100.","\\1") %>%
        asInteger,
      name = column("td[2]/a"),
      location = column("td[2]/p/text()[1]"),
      tuitionAndFees = column("td[3]/text()[1]") %>% trimNewline,
      totalEnrollment = column("td[4]/text()[1]") %>% asInteger,
      fall2013AcceptanceRate = column("td[5]/text()[1]") %>% fromPercent,
      averageFreshmanRetentionRate = column("td[6]/text()[1]") %>% fromPercent(),
      sixYearGraduationRate = column("td[7]/text()[1]") %>% fromPercent,
      stringsAsFactors = FALSE
    )
  }) %>%
  do.call(rbind, .)

> head(table)
  rank score                  name      location tuitionAndFees
1    1   100  Princeton University Princeton, NJ        $41,820
2    2    99    Harvard University Cambridge, MA        $43,938
3    3    98       Yale University New Haven, CT        $45,800
4    4    95   Columbia University  New York, NY        $51,008
5    4    95   Stanford University  Stanford, CA        $44,757
6    4    95 University of Chicago   Chicago, IL        $48,253
  totalEnrollment fall2013AcceptanceRate averageFreshmanRetentionRate
1            8014                  0.074                         0.98
2           19882                  0.058                         0.97
3           12109                  0.069                         0.99
4           23606                  0.069                         0.99
5           18136                  0.057                         0.98
6           12539                  0.088                         0.99
  sixYearGraduationRate
1                  0.97
2                  0.97
3                  0.98
4                  0.96
5                  0.96
6                  0.93

You may change 1:3 to more page numbers, like all 11. :)

Note that in the later pages, some cells have tips (which is annoying) so that I have to use td[5]/text()[1] such xpath to ensure only first text is selected.

from rvest.

ignacio82 commented on July 21, 2024

Very impresive. I need to go over it line by line to make sure i understand how your did your magic.
Thanks!

from rvest.

renkun-ken commented on July 21, 2024

Thanks, @ignacio82! If you have any question about it, just ask here. Let me first point out the basic knowledge you need:

HTML
CSS selector
XPath selector
Regular expression

You don't have dive deep but get to know the very basics. You don't have to be a professional web developer to just scrape some webpages.

from rvest.

ignacio82 commented on July 21, 2024

were can I read about:

CSS selector
XPath selector
?
This is the first time I hear about that stuff...

from rvest.

renkun-ken commented on July 21, 2024

http://www.w3schools.com/ offers great and basic tutorials on a wide variety of web stuff.

You can quickly go though HTML, CSS and XPath.

Basically speaking,

HTML is the markup language behind web pages, it defines the contents and layout of a web page. A web page like the ranking is described by a very nested collection of tags which is expressed in plain text so that your web browser can receive the text from server, analyze its structure and figure out how to render it.

CSS is a language that defines a style sheet for the tags or classes in HTML to match, so that the different groups of elements can have different styles (color, border, etc.) without too redundant declaration of inline styles for each element. A CSS selector can help the browser (and us) to select a particular group of elements in the web page.

Note that HTML is very close to XML which is used to store and transmit data between different services. XML has no pre-definition of tags but HTML defines some tags so that browser can understand how to interpret an element by their tag name. XPath is very flexible and powerful to describe a query for a particular set of nodes in XML, which mostly also applies to HTML.

So that you have to understand the basic motivation and know-how to get started scraping web pages :)

from rvest.

Scraping college rankings data? about rvest HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs