Comments (5)
Using the following packages:
rvest
: for easy scrapingstringr
: for easy string manipulation
library(rvest) # devtools::install_github("rvest","hadley")
library(stringr) # install.pacakges("stringr")
url <- "http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data/page+%d"
trimNewline <- function(p) str_replace(p, "\n","")
asInteger <- function(x) as.integer(str_replace(x,",",""))
fromPercent <- function(x) as.numeric(str_replace(x, "%", ""))/100
table <- 1:3 %>%
lapply(function(page) {
nodes <- sprintf(url, page) %>%
html() %>%
html_nodes("table tbody tr")
column <- function(xpath) nodes %>% html_node(xpath = xpath) %>% html_text(trim = TRUE)
data.frame(
rank = column("td[1]/div[1]/span") %>%
str_replace("#(\\d+)(Tie)?","\\1") %>%
as.integer,
score = column("td[1]/span[1]/span") %>%
str_replace("(\\d+) out of 100.","\\1") %>%
asInteger,
name = column("td[2]/a"),
location = column("td[2]/p/text()[1]"),
tuitionAndFees = column("td[3]/text()[1]") %>% trimNewline,
totalEnrollment = column("td[4]/text()[1]") %>% asInteger,
fall2013AcceptanceRate = column("td[5]/text()[1]") %>% fromPercent,
averageFreshmanRetentionRate = column("td[6]/text()[1]") %>% fromPercent(),
sixYearGraduationRate = column("td[7]/text()[1]") %>% fromPercent,
stringsAsFactors = FALSE
)
}) %>%
do.call(rbind, .)
> head(table)
rank score name location tuitionAndFees
1 1 100 Princeton University Princeton, NJ $41,820
2 2 99 Harvard University Cambridge, MA $43,938
3 3 98 Yale University New Haven, CT $45,800
4 4 95 Columbia University New York, NY $51,008
5 4 95 Stanford University Stanford, CA $44,757
6 4 95 University of Chicago Chicago, IL $48,253
totalEnrollment fall2013AcceptanceRate averageFreshmanRetentionRate
1 8014 0.074 0.98
2 19882 0.058 0.97
3 12109 0.069 0.99
4 23606 0.069 0.99
5 18136 0.057 0.98
6 12539 0.088 0.99
sixYearGraduationRate
1 0.97
2 0.97
3 0.98
4 0.96
5 0.96
6 0.93
You may change 1:3
to more page numbers, like all 11. :)
Note that in the later pages, some cells have tips (which is annoying) so that I have to use td[5]/text()[1]
such xpath to ensure only first text is selected.
from rvest.
Very impresive. I need to go over it line by line to make sure i understand how your did your magic.
Thanks!
from rvest.
Thanks, @ignacio82! If you have any question about it, just ask here. Let me first point out the basic knowledge you need:
- HTML
- CSS selector
- XPath selector
- Regular expression
You don't have dive deep but get to know the very basics. You don't have to be a professional web developer to just scrape some webpages.
from rvest.
were can I read about:
- CSS selector
- XPath selector
?
This is the first time I hear about that stuff...
from rvest.
http://www.w3schools.com/ offers great and basic tutorials on a wide variety of web stuff.
You can quickly go though HTML, CSS and XPath.
Basically speaking,
HTML is the markup language behind web pages, it defines the contents and layout of a web page. A web page like the ranking is described by a very nested collection of tags which is expressed in plain text so that your web browser can receive the text from server, analyze its structure and figure out how to render it.
CSS is a language that defines a style sheet for the tags or classes in HTML to match, so that the different groups of elements can have different styles (color, border, etc.) without too redundant declaration of inline styles for each element. A CSS selector can help the browser (and us) to select a particular group of elements in the web page.
Note that HTML is very close to XML which is used to store and transmit data between different services. XML has no pre-definition of tags but HTML defines some tags so that browser can understand how to interpret an element by their tag name. XPath is very flexible and powerful to describe a query for a particular set of nodes in XML, which mostly also applies to HTML.
So that you have to understand the basic motivation and know-how to get started scraping web pages :)
from rvest.
Related Issues (20)
- rvest fails to parse HTML page from google scholar; returns `xml_nodeset (0)` HOT 2
- Create `read_html()` documentation page HOT 1
- Some way to customise user agent for `read_html_live()`
- Apparent typo in `read_html_live()` docs
- Github Actions Can't Install Dev Version HOT 4
- Figure out dynamic html elements selection
- read_html_live() practical implementation HOT 1
- Release rvest 1.0.4
- Long lines truncated at 10,000,000 chars. HOT 4
- Invalid Char in Json Text
- `req_perform_iteratively()` + `paths` HOT 1
- Need absolute URL helper HOT 1
- Web scraping vignette
- LiveHTML object corrupted after `$click()` HOT 1
- read_html should take httr2 response directly
- Hide more automated browser tells
- read_html_live() memory "leak" HOT 4
- Example broken in LiveHTML page HOT 1
- README should end with link to getting started vignette
- Select option from list using htmlLIVE
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rvest.