grabber's Introduction

grabber

Grabber is a concurrent declarative web scraper and downloader.

Features:

Simple tree-like JSON configuration
XPath and Regexp extractors
Parallel parsing and extraction
Parallel download
Ability to bail out early (e.g. for updating)
Fails fast on config errors, tolerates web errors
Follow, every, and single extraction modes
Multiple XPaths or Regexps per stage
Multi-grouped regexps with a separator (e.g. extract to CSV)
It's rather fast

Run grabber -h to see command-line options.

See examples/ directory and consult the code to learn the format of the config files.

Note that for tumblr.json you'll need to replace all occurrences of {{name}} with a proper account (subdomain) name and all occurrences of {{paging}} with the (XPath's text() operator) contents of what your target blog uses for 'next page' (or semantically equivalent). You may also notice that the format is already template-friendly, so you can easily write a script for generating per-blog templates.

The examples provided are certainly not exhaustive.

Advice:

Remember you can build your config iteratively by using the log command, so that you make sure the current level works as it should before going further.

When downloading:

For the first run set bail to 0 and use options -quiet -stdout, you may also wish to pipe the output of the run to tee log. Then inspect the output/logfile for any errors. If it looks ok set bail to something reasonable e.g. if you have 10 assets per page set it to 20.

Todo / Bugs

Needs testing 'in the wild'
Better documentation
Ability to use Content-Disposition
Full config parsing and error checking during load
Test suite

Copyright

Absolutely no warranty. See LICENSE.txt for details.

grabber's People

Contributors

Stargazers

Watchers

grabber's Issues

panic error on jobs:97

[{  
  "name": "Menzel NOUR",
  "url": "http://www.booking.com/hotel/tn/menzel-nour.html",
  "bail": 10, 
  "path": "./",
  "do": {
    "command": "print",
    "action": {
      "mode": "every", "type": "xpath",
      "args": ["//*[@id='hp_hotel_name']"]
    }   
  }         
}]

2016/11/30 22:13:02 Target: Menzel NOUR
panic: runtime error: cgo argument has Go pointer to Go pointer

goroutine 11 [running]:
panic(0x55b6b368b5c0, 0xc420112170)
	/usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/moovweb/gokogiri/xml._cgoCheckPointer0(0xc4201142a0, 0x0, 0x0, 0x0, 0x2004dc58)
	??:0 +0x59
github.com/moovweb/gokogiri/xml.(*XmlNode).serialize(0xc42010c260, 0x41, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42010c240, 0x55b6b368be20, ...)
	/home//Projects/go/src/github.com/moovweb/gokogiri/xml/node.go:773 +0x10c
github.com/moovweb/gokogiri/xml.(*XmlNode).ToHtml(0xc42010c260, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42010c240, 0xc42004dd58, 0x55b6b3429705, ...)
	/home//Projects/go/src/github.com/moovweb/gokogiri/xml/node.go:826 +0x83
github.com/moovweb/gokogiri/xml.(*XmlNode).ToBuffer(0xc42010c260, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home//Projects/go/src/github.com/moovweb/gokogiri/xml/node.go:833 +0x91
github.com/moovweb/gokogiri/xml.(*XmlNode).String(0xc42010c260, 0x0, 0x0)
	/home//Projects/go/src/github.com/moovweb/gokogiri/xml/node.go:841 +0x4a
main.(*Job).doXPath(0xc4200bea20, 0x55b6b3a57a50, 0x0, 0x0, 0x0, 0x0)
	/home//Projects/go/src/github.com/drbig/grabber/jobs.go:97 +0x2fe
main.parser()
	/home//Projects/go/src/github.com/drbig/grabber/workers.go:45 +0xb38
created by main.main
	/home//Projects/go/src/github.com/drbig/grabber/main.go:89 +0x586

Recommend Projects

drbig / grabber Goto Github PK

grabber's Introduction

grabber

Todo / Bugs

Copyright

grabber's People

Contributors

Stargazers

Watchers

Forkers

grabber's Issues

panic error on jobs:97

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs