GithubHelp home page GithubHelp logo

scraper's Introduction

scraper

Project for scraping content out of pages and/or feeds.

The big idea here is to use fluent builder to make a simple scraping DSL. For example, the simplest scraping job would look like this:

String pageContent = new Scraper.Builder().url("http://www.apple.com").getResult();

Chaining calls to do things like stipulate whether you want just text, or if you want any transforms performed, so the above could be changed like this to return just the text (for instance for a classification engine):

String pageContent = new Scraper.Builder().url("http://www.apple.com").asText().getResult();

Further manipulators could be used to do things like direct the scraper to certain elements, e.g. suppose we wanted the 3rd table as HTML and nothing more, something like this:

List<String> urls = scraper
	.url(testTableHtmlUrl)
	.extract(scraper.extractor().table(3).links().getResults())
	.getResults();

Here is a more advanced case. We want to get the 3rd table, extract the links from it, then get the value of the parameter oppId from each link:

List<String> ids = scraper
	.url(testTableHtmlUrl)
	.extract(scraper.extractor().table(3).links().parameter("oppId").getResults())
	.getResults();

Note that the keys are collected in the getResults() method in the extractor, but another getResults() call is needed in the scraper because we might have to iterate, in which case each page would have an extraction and the results would be collected.

Iteration

In a lot of cases, you want to scrape something from a page, but the same form is repeated on multiple pages. To support that, we have the notion of an iterator. The scraper will call the iterator each time it's ready for a new page. All the iterator has to do is construct the URL for the next page. Like this:

	Scraper scraper = new Scraper();
	Iterator pageIterator = new Iterator() {
		@Override
		public URL build(int i) {
			String nextPageUrl = MessageFormat.format("/testpages/ids-page-{0}.html", i + 2);
			log.debug("next page to iterate to: {}", nextPageUrl);
			return TestUtil.getFileAsURL(nextPageUrl);
		}
	};
	List<String> ids = scraper
			.url(testTableHtmlUrl)
			.pages(1)
			.iterator(pageIterator)
			.extract(scraper.extractor().table(3).links().parameter("oppId").getResults())
			.getResults();

	assertThat(ids.size(), is(86));

Notice that we are constraining the iteration with the pages method. We probably want to support an open-ended iteration where the scraper will keep trying to get more pages until it gets a 404 and then it will exit. This is necessary because we may not know how many pages there are and pages may be added at some point. (Implementing this is not very difficult: inside the scraper, it sets up the extractor, gets the results, then checks if there is an iterator and if there is, it calls it in a loop, collecting all the results.)

Listing and Detail: Following Links to a Detail Page

Another common scenario is that you have a set of links that you have to follow to a detail page where the actual content is that you want to scrape. That's what this syntax is meant to support. Here is an example:

@Test
public void useIteratedListingAndDetailInterface() throws IOException {
	Scraper scraper = new Scraper();
	Iterator pageIterator = new Iterator() {
		@Override
		public URL build(int i) {
			String nextPageUrl = MessageFormat.format("/testpages/ids-page-{0}.html", i + 2);
			log.debug("next page to iterate to: {}", nextPageUrl);
			return TestUtil.getFileAsURL(nextPageUrl);
		}
	};
	Scraper detailScraper = new Scraper();
	List<Map<String, String>> records = scraper
			.url(testTableHtmlUrl)
			.pages(3)
			.iterator(pageIterator)
			.listing(scraper.extractor().table(3).links().getResults())
			.detail(detailScraper)
			.getRecords();

	assertThat(records.size(), is(greaterThan(0)));
	log.debug("fields = {}", records);

}

Notice that we have to have a separate scraper for extracting the details.

scraper's People

Contributors

aakture avatar andrey-chorniy avatar chorniyn avatar codeslubber avatar doaaanwar avatar finucane avatar garamsong avatar mulderbaba avatar rafikmh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scraper's Issues

New Lines in Fields Needs to be Handled Better

Right now DefaultFieldExtractor will remove all new lines (including <BR>s) from field values (because it uses element.getTextExtractor().toString() such as in getValueFieldText()).
We should be able to replace BRs with semicolons so that we can parse contact fields better, i.e., separated name, job title and address lines.

OutOfMemoryError in DefaultFieldExtractor

OOM happened in DefaultFieldExtractor.getFields() method
`java.lang.OutOfMemoryError: Java heap space

at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:622)
at java.lang.StringBuilder.append(StringBuilder.java:202)
at java.lang.StringBuilder.append(StringBuilder.java:76)
at net.htmlparser.jericho.Renderer$HR_ElementHandler.processBlockContent(Renderer.java:1193)
at net.htmlparser.jericho.Renderer$AbstractBlockElementHandler.process(Renderer.java:1111)
at net.htmlparser.jericho.Renderer$Processor.appendSegmentProcessingChildElements(Renderer.java:852)
at net.htmlparser.jericho.Renderer$Processor.appendTo(Renderer.java:823)
at net.htmlparser.jericho.Renderer.appendTo(Renderer.java:140)
at net.htmlparser.jericho.CharStreamSourceUtil.toString(CharStreamSourceUtil.java:63)
at net.htmlparser.jericho.Renderer.toString(Renderer.java:150)
at com.ontometrics.scraper.extraction.DefaultFieldExtractor.extractFieldsFromUL(DefaultFieldExtractor.java:312)
at com.ontometrics.scraper.extraction.DefaultFieldExtractor.extractFieldsFromULs(DefaultFieldExtractor.java:204)
at com.ontometrics.scraper.extraction.DefaultFieldExtractor.getFields(DefaultFieldExtractor.java:65)`

The page content on which failure happens is referred to in test DefaultFieldExtractorTest.canExtractFieldsWOOutOfMemoryError().

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.