GithubHelp home page GithubHelp logo

andeya / pholcus Goto Github PK

View Code? Open in Web Editor NEW
7.5K 455.0 1.7K 24.12 MB

Pholcus is a distributed high-concurrency crawler software written in pure golang

License: Apache License 2.0

Go 99.98% Shell 0.02%
spider crowler

pholcus's Issues

centOS 7.1

go build github.com/henrylee2cn/pholcus: /usr/lib/golang/pkg/tool/linux_amd64/link: signal: killed

arm上编译不过通

src/github.com/henrylee2cn/pholcus/app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
src/github.com/henrylee2cn/pholcus/app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString

go
buf.Release
buf.Sysname
arm下返回uint8导致
charsToString(ca [65]int8) => charsToString(ca [65]uint8)是否合适

how to get ajax data in dom

页面的部分数据是用jquery 的ajax动态获取添加的,这部分内容用什么方法能爬到?

compile error : cannot find package "github.com/henrylee2cn/pholcus_lib/jiban"

[root@centos pholcus]# pwd
/root/go/src/github.com/henrylee2cn/pholcus

[root@centos pholcus]# go build
../pholcus_lib/pholcus_lib.go:17:2: cannot find package "github.com/henrylee2cn/pholcus_lib/jiban" in any of:
/usr/lib/golang/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOROOT)
/root/go/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOPATH)
[root@centos pholcus]# go install
../pholcus_lib/pholcus_lib.go:17:2: cannot find package "github.com/henrylee2cn/pholcus_lib/jiban" in any of:
/usr/lib/golang/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOROOT)
/root/go/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOPATH)

能否支持增量方式抓取?

output都是以StarttTime为目录,这样每次抓取的结果,都保存在不同路径。
是否支持增量方式?在之前结果的基础上,抓取新增内容,保存在同个文件中?

使用pholcus的疑惑

1.Pholcus是如何爬虫网页,数据如何提取,提取如何存储到数据库
2.Pholcus数据库配置问题,我使用的是mysql,按照给出的demo,和说明文档,我试着用单机版去跑比如京东搜索iPhone6s,结果却是
image
啥也没有。
3.比如我有一个需求,很简单,可能很多人和我的想法差不多,就是去爬虫一个网站获取想要塞选的信息,存储到数据库中,Pholcus该如何去做,能给出一个简单的demo么。
4.Pholcus如果是需要去完善一个框架并且更多人一起参与,那么良好的文档和清晰的demo我觉得更为重要。

期待您们的回复,一个热爱Go的开发者!

运行web任务 crash

Windows7 64bit

可以访问9090web页面,运行RUN的时候crash,错误如下:
2015/12/22 18:05:23 [pholcus] server Running on 0.0.0.0:9090
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x49dd3c]

goroutine 379 [running]:
sync/atomic.AddUint64(0x1313f634, 0x1, 0x0, 0x33064b18, 0x6f89fb)
c:/go/src/sync/atomic/asm_386.s:112 +0xc
github.com/henrylee2cn/pholcus/app/scheduler.(_Matrix).Push(0x1313f620, 0x1355e5
b0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/scheduler/schedul
er.go:182 +0x10a
github.com/henrylee2cn/pholcus/app/spider.(_Spider).ReqmatrixPush(0x134dd240, 0x
1355e5b0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:
227 +0x2c
github.com/henrylee2cn/pholcus/app/spider.(_Context).AddQueue(0x13542ae0, 0x1355
e5b0, 0x24)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/context.go
:78 +0x1e5
github.com/pholcus/spider_lib.glob.func38(0x13542ae0, 0x13542b00, 0x0, 0x0)
D:/go/workspace/src/github.com/pholcus/spider_lib/jdsearch.go:56 +0x26f
github.com/henrylee2cn/pholcus/app/spider.(_Context).Aid(0x13542ae0, 0x13542b00,
0x13492f5c, 0x1, 0x1, 0x0, 0x0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/context.go
:197 +0x168
github.com/pholcus/spider_lib.glob.func37(0x13542ae0)
D:/go/workspace/src/github.com/pholcus/spider_lib/jdsearch.go:43 +0x186
github.com/henrylee2cn/pholcus/app/spider.(_Spider).Start(0x134dd240)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:
170 +0x8f
github.com/henrylee2cn/pholcus/app/crawl.(_crawler).Start(0x134dd4c0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/crawl/crawl.go:61
+0x43
github.com/henrylee2cn/pholcus/app.(_Logic).goRun.func1(0x133c5340, 0x0, 0x33064
af8, 0x134dd4c0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/app.go:563 +0x74
created by github.com/henrylee2cn/pholcus/app.(_Logic).goRun
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/app.go:566 +0xe3

关于下载器phantom

这下载器怎么配置?官方规则里面用phantom下载文件的貌似不能正常运行?

动态规则解析错误,用xml包含js是否有问题?建议直接使用纯js文件吧

func main() {
    type Spider struct {
        Script    string   `xml:"Script"`
    }
    result := Spider{Script: "none"}
    data := `
        <Spider>
            <Script>
            1 < 2
            </Script>
        </Spider>
    `
    err := xml.Unmarshal([]byte(data), &result)
    if err != nil {
        fmt.Printf("error:", err)
        return
    }
    fmt.Printf("Script: %v", result.Script)
}

Script元素内的js代码,如果有“<”符号,xml.Unmarshal解析过不了,“>“符号正常,其他未测试。

个人觉得:用xml包含js不是太友好,建议直接使用纯js文件吧

如何在web中自定义规则。

如题.大腿,我想要在web层面自定义爬虫规则.应该从哪里下手? 。
通过表单.填写规则。点击按钮 socket 推送信息.然后采集. 感觉这样会比较灵活..不需要每次都写代码

panic: runtime error: invalid memory address or nil pointer dereference

环境:win10 golang1.6 无C编译器
idea:liteide
编译之后运行没有问题,但是点击run之后就直接挂掉了

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x6121fc]

goroutine 67 [running]:
panic(0xb6b700, 0x11dae030)
H:/CoderTools/go1.6.windows-386/go/src/runtime/panic.go:464 +0x326
sync/atomic.AddUint64(0x11e64134, 0x1, 0x0, 0x35552068, 0x85109b)
H:/CoderTools/go1.6.windows-386/go/src/sync/atomic/asm_386.s:112 +0xc
github.com/henrylee2cn/pholcus/app/scheduler.(_Matrix).Push(0x11e64120, 0x11efc000)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/scheduler/scheduler.go:180 +0x10a
github.com/henrylee2cn/pholcus/app/spider.(_Spider).ReqmatrixPush(0x11ec34a0, 0x11efc000)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:269 +0x2c
github.com/henrylee2cn/pholcus/app/spider.(_Context).AddQueue(0x11e622a0, 0x11efc000, 0x1d)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/context.go:79 +0x32c
github.com/pholcus/spider_lib.glob.func66(0x11e622a0, 0x11e622c0, 0x0, 0x0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/pholcus/spider_lib/taobaosearch.go:54 +0x242
github.com/henrylee2cn/pholcus/app/spider.(_Context).Aid(0x11e622a0, 0x11e622c0, 0x12165f48, 0x1, 0x1, 0x0, 0x0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/context.go:198 +0x168
github.com/pholcus/spider_lib.glob.func65(0x11e622a0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/pholcus/spider_lib/taobaosearch.go:43 +0x186
github.com/henrylee2cn/pholcus/app/spider.(_Spider).Start(0x11ec34a0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:197 +0x97
github.com/henrylee2cn/pholcus/app/crawl.(_crawler).Start(0x121bc280)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/crawl/crawl.go:59 +0x43
github.com/henrylee2cn/pholcus/app.(_Logic).goRun.func1(0x12151ea0, 0x0, 0x35552048, 0x121bc280)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/app.go:596 +0x74
created by github.com/henrylee2cn/pholcus/app.(_Logic).goRun
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/app.go:599 +0xfe

How to find the child nodes whose contains whitespaces?

Since a ' ' in selector means parsing the descendant one, how to find the node in below code?

<div class="test abc">
...
</div>

The node's class value contains a whitespace, I wonder if there should be some escape operations.

不错的项目,哪里有spider规则文档么?

不错的项目,框架也写的杠杠的
但好像一些基本的文档反而是没有,估计大神都不屑吧
求基本spider文档
GetSpiderLib
如何获取默认的spider
这些默认模板没找到,参考一下也好

看了一些issue,至于规则应用开发,应该是还没调试框架吧

能否添加这样一个方法,方便调试

大神的框架非常好用,就是写爬虫规则的时候,每次调试都要重启服务,
希望提供类似这样调用,方便调试.

package main

import "github.com/henrylee2cn/pholcus"

func main() {
	PholcusSpider.Test(&request.Request{
		// Request对象
		Url: "http://www.baidu.com",
		// 其他参数...
	}, func(ctx *Context) {
		ctx.GetDom()
		// .......

		// 根据请求对象,返回 ctx 对象,方便测试
		// 不用每次修改了方法,需要重启服务器,调试比较麻烦
		// 调试 OK 了直接复制到程序里面去,这样会方便很多
	})
}

功能建议:分批输出(入库)可新增定时输出(时间判断)

爬虫可能抓取某些更新频率不高的网站,如一天新增5篇文章,同时设置每10条数据入库一次。
这会导致数据迟迟不能入库则无法后续处理,而设置为每1~5条数据输出会造成较大的数据库压力。

建议:可同时设置 分批输出的 数量时间 限制,如每达到10条数据或每五分钟可输出一次

请问多列表情况如何采集

目前遇到一个问题是:

  • 目标站一个列表十几万页:

  • 问题:

  • 采集列表没有入库,中间断掉所有数据就没了,如果一页页采集需要写十万多个列表页地址,也不合适

  • 列表没抓取完,并不会开始内容抓取

  • 希望通过方式:

  1. 一个线程抓取列表 、一个线程抓取内容页,区分开,但是没有任务分布式案例,就是如何将任务push到调度线程(新手跪求demo 3q)
  2. 一个线程:列表抓取入库做记录;另外一个线程读库开始抓取内容并标记抓取状态,不知道可行不

减少内存占用

写了个简单的规则抓取酒店信息,层级为:国家->城市->酒店.
数据量,国家200+,城市8万+,酒店70万+
目前,感觉效率非常低,用的是单机+web模式.
抓了一天才抓了200+的酒店,而且我只拿酒店的名称和描述.
系统为centos,top一下发现pholcus几乎占满了内存(4G),而数据量其实并不大.内存占满的情况下效率几乎为0...
我比较想知道,在抓取时,不会释放内存吗?这算是一个问题吗?

Kafka Error

[E] kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.
2017/08/07 14:25:38 [E] circuit breaker is open
2017/08/07 14:25:38 [E] circuit breaker is open

hi there i got this error, when use kafka output

反馈一个bug,日志达到一定条数后会协程会卡死

经过长时间的采集图片后,发现到后来内存剧增,并大量goroutine存在,加了一些日志做跟踪,发现协程在输出完文件后,就在日志那里卡住了,导致图片的内存无法释放
1.代码如下,加了些日志:
ww6lkno7qtvj u23y1 1i2c

2.日志如下:
fk w lm4qm59pbs vgorfy

3.堆栈跟踪如下:
ldb8_8cyy3kws4_baz45fto

4.根本原因,是在单机模式时,照样向socketlog推送了日志,导致日志channel爆满。这里应该加个判断是否为client模式
g xzfhe04ry s2i27bqw0

image

而bl.steal这个chennel是client模式下,client发送给server的channel,在单机模式下没有从chennel获取数据,这个channel只有在client模式下才能被读取到
lptfbf 1f7_n lmiilf18rs

93g2iybwi uf4 6 y ukxt

t i5377o g b3ehms_ 4

English documentation please

Hi

I can understand most of what is happening with the lib. But unfortunately, I cannot explore in its depth due to my lack of language understanding. Is there a plan to have an English version of the docs for this wonderful library!?

运行错误

./pholcus.go:44: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.HOST
./pholcus.go:44: cannot assign to config.MYSQL_OUTPUT.HOST
./pholcus.go:46: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.DB
./pholcus.go:46: cannot assign to config.MYSQL_OUTPUT.DB
./pholcus.go:48: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.USER
./pholcus.go:48: cannot assign to config.MYSQL_OUTPUT.USER
./pholcus.go:50: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.PASSWORD
./pholcus.go:50: cannot assign to config.MYSQL_OUTPUT.PASSWORD
./pholcus.go:52: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.MAX_CONNS
./pholcus.go:52: cannot assign to config.MYSQL_OUTPUT.MAX_CONNS
./pholcus.go:52: too many errors

我把MGO_OUTPUT 相关都注掉了
写了密码
其他都没动.
数据库已建好pholcus

go1.5.1

Reloadable 不可重复下载的判断条件不充分

判断 Reloadable 是否允许重复下载时有以下判断

func (self *Matrix) Push(req *request.Request) {
	...
	// 不可重复下载的req
	if !req.IsReloadable() {
		// 已存在成功记录时退出
		if self.hasHistory(req.Unique()) {
			return
		}
		// 添加到临时记录
		self.insertTempHistory(req.Unique())
	}
	...
}

实际上依赖func (self *Request) Unique() string判断是否相同请求

// 请求的唯一识别码
func (self *Request) Unique() string {
	if self.unique == "" {
		block := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method))
		self.unique = hex.EncodeToString(block[:])
	}
	return self.unique
}

如果一个 POST 请求填写了 PostData, 则不能正确的辨别是否是同一个请求

POST /somewhere

page=1&keyword=XXX

期待结果:

// 请求的唯一识别码
func (self *Request) Unique() string {
	if self.unique == "" {
		block := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method + self.PostData))
		self.unique = hex.EncodeToString(block[:])
	}
	return self.unique
}

该逻辑的调整会对已经存储的数据造成较大的影响。

arm处理器下编译出错

执行 go run example_main.go

app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString

可以帮忙看下为什么吗?
环境是在 树莓派的 arm处理器下

mac下代理筛选无效

您好.mac 10.12.3 (16D32)下,我按照config.ini下的配置建立了proxy.lib,填写了http://ip:port 的代理多条.启用web后,控制台显示可用代理一直是0.我已经确认过代理可以正常使用.请问我该如何解决.感谢您能在百忙之中帮忙解答.

如何将规则库导入

下载运行了 web 版,想直接试试公共的规则库,打开单机模式后,一条规则都没有!
查看了 config.go 文件,好像也不是在那里配置, 还是需要先导入到 mongo ?

arm系统兼容

$ uname -a
Linux raspberrypi 4.1.13+ #826 PREEMPT Fri Nov 13 20:13:22 GMT 2015 armv6l GNU/Linux
#一个叫树莓派的开发板基于debian jessie定制的arm 版的系统。
#http://mirrordirector.raspbian.org/raspbian/
$ go version
go version go1.7.4 linux/arm
#安装包是go1.7.4.linux-armv6l.tar.gz http://www.golangtc.com/download

build pholcus的时候报错

$ go build

github.com/henrylee2cn/pholcus/app/downloader/surfer/agent

app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString

遂把 agent_linux.go里的charsToString参数类型由int8 改为uint8,才构建成功

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.