andeya / pholcus Goto Github PK
View Code? Open in Web Editor NEWPholcus is a distributed high-concurrency crawler software written in pure golang
License: Apache License 2.0
Pholcus is a distributed high-concurrency crawler software written in pure golang
License: Apache License 2.0
how to switch cookies from surf to phantom, is automatic switch or manual?
go build github.com/henrylee2cn/pholcus: /usr/lib/golang/pkg/tool/linux_amd64/link: signal: killed
src/github.com/henrylee2cn/pholcus/app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
src/github.com/henrylee2cn/pholcus/app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString
go
buf.Release
buf.Sysname
arm下返回uint8导致
charsToString(ca [65]int8) => charsToString(ca [65]uint8)是否合适
页面的部分数据是用jquery 的ajax动态获取添加的,这部分内容用什么方法能爬到?
学长请问您是学网络的吗
mysql.go 文件
create函数
增加 DEFAULT CHARSET=utf8
如果不指明,有的数据库默认表结构没有指定的时候,会造成中文插入乱码.
小问题.请改一下.
[root@centos pholcus]# pwd
/root/go/src/github.com/henrylee2cn/pholcus
[root@centos pholcus]# go build
../pholcus_lib/pholcus_lib.go:17:2: cannot find package "github.com/henrylee2cn/pholcus_lib/jiban" in any of:
/usr/lib/golang/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOROOT)
/root/go/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOPATH)
[root@centos pholcus]# go install
../pholcus_lib/pholcus_lib.go:17:2: cannot find package "github.com/henrylee2cn/pholcus_lib/jiban" in any of:
/usr/lib/golang/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOROOT)
/root/go/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOPATH)
output都是以StarttTime为目录,这样每次抓取的结果,都保存在不同路径。
是否支持增量方式?在之前结果的基础上,抓取新增内容,保存在同个文件中?
多谢多谢,如果能伪造IP
看到项目是支持 phantomjs的 但是不知道如何使用
大神请问爬到的数据怎么存到mysql数据库 指定的table里面去啊?
Windows7 64bit
可以访问9090web页面,运行RUN的时候crash,错误如下:
2015/12/22 18:05:23 [pholcus] server Running on 0.0.0.0:9090
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x49dd3c]
goroutine 379 [running]:
sync/atomic.AddUint64(0x1313f634, 0x1, 0x0, 0x33064b18, 0x6f89fb)
c:/go/src/sync/atomic/asm_386.s:112 +0xc
github.com/henrylee2cn/pholcus/app/scheduler.(_Matrix).Push(0x1313f620, 0x1355e5
b0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/scheduler/schedul
er.go:182 +0x10a
github.com/henrylee2cn/pholcus/app/spider.(_Spider).ReqmatrixPush(0x134dd240, 0x
1355e5b0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:
227 +0x2c
github.com/henrylee2cn/pholcus/app/spider.(_Context).AddQueue(0x13542ae0, 0x1355
e5b0, 0x24)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/context.go
:78 +0x1e5
github.com/pholcus/spider_lib.glob.func38(0x13542ae0, 0x13542b00, 0x0, 0x0)
D:/go/workspace/src/github.com/pholcus/spider_lib/jdsearch.go:56 +0x26f
github.com/henrylee2cn/pholcus/app/spider.(_Context).Aid(0x13542ae0, 0x13542b00,
0x13492f5c, 0x1, 0x1, 0x0, 0x0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/context.go
:197 +0x168
github.com/pholcus/spider_lib.glob.func37(0x13542ae0)
D:/go/workspace/src/github.com/pholcus/spider_lib/jdsearch.go:43 +0x186
github.com/henrylee2cn/pholcus/app/spider.(_Spider).Start(0x134dd240)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:
170 +0x8f
github.com/henrylee2cn/pholcus/app/crawl.(_crawler).Start(0x134dd4c0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/crawl/crawl.go:61
+0x43
github.com/henrylee2cn/pholcus/app.(_Logic).goRun.func1(0x133c5340, 0x0, 0x33064
af8, 0x134dd4c0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/app.go:563 +0x74
created by github.com/henrylee2cn/pholcus/app.(_Logic).goRun
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/app.go:566 +0xe3
Welcome to discuss the development of pholcus v2.0 ...
这下载器怎么配置?官方规则里面用phantom下载文件的貌似不能正常运行?
怎么实现定时采集
windows的,谢谢
func main() {
type Spider struct {
Script string `xml:"Script"`
}
result := Spider{Script: "none"}
data := `
<Spider>
<Script>
1 < 2
</Script>
</Spider>
`
err := xml.Unmarshal([]byte(data), &result)
if err != nil {
fmt.Printf("error:", err)
return
}
fmt.Printf("Script: %v", result.Script)
}
Script元素内的js代码,如果有“<”符号,xml.Unmarshal解析过不了,“>“符号正常,其他未测试。
个人觉得:用xml包含js不是太友好,建议直接使用纯js文件吧
如题.大腿,我想要在web层面自定义爬虫规则.应该从哪里下手? 。
通过表单.填写规则。点击按钮 socket 推送信息.然后采集. 感觉这样会比较灵活..不需要每次都写代码
环境:win10 golang1.6 无C编译器
idea:liteide
编译之后运行没有问题,但是点击run之后就直接挂掉了
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x6121fc]
goroutine 67 [running]:
panic(0xb6b700, 0x11dae030)
H:/CoderTools/go1.6.windows-386/go/src/runtime/panic.go:464 +0x326
sync/atomic.AddUint64(0x11e64134, 0x1, 0x0, 0x35552068, 0x85109b)
H:/CoderTools/go1.6.windows-386/go/src/sync/atomic/asm_386.s:112 +0xc
github.com/henrylee2cn/pholcus/app/scheduler.(_Matrix).Push(0x11e64120, 0x11efc000)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/scheduler/scheduler.go:180 +0x10a
github.com/henrylee2cn/pholcus/app/spider.(_Spider).ReqmatrixPush(0x11ec34a0, 0x11efc000)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:269 +0x2c
github.com/henrylee2cn/pholcus/app/spider.(_Context).AddQueue(0x11e622a0, 0x11efc000, 0x1d)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/context.go:79 +0x32c
github.com/pholcus/spider_lib.glob.func66(0x11e622a0, 0x11e622c0, 0x0, 0x0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/pholcus/spider_lib/taobaosearch.go:54 +0x242
github.com/henrylee2cn/pholcus/app/spider.(_Context).Aid(0x11e622a0, 0x11e622c0, 0x12165f48, 0x1, 0x1, 0x0, 0x0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/context.go:198 +0x168
github.com/pholcus/spider_lib.glob.func65(0x11e622a0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/pholcus/spider_lib/taobaosearch.go:43 +0x186
github.com/henrylee2cn/pholcus/app/spider.(_Spider).Start(0x11ec34a0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:197 +0x97
github.com/henrylee2cn/pholcus/app/crawl.(_crawler).Start(0x121bc280)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/crawl/crawl.go:59 +0x43
github.com/henrylee2cn/pholcus/app.(_Logic).goRun.func1(0x12151ea0, 0x0, 0x35552048, 0x121bc280)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/app.go:596 +0x74
created by github.com/henrylee2cn/pholcus/app.(_Logic).goRun
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/app.go:599 +0xfe
感觉同一个ip使用大量不同的user-agent 会被认为是爬虫吧...
Since a ' ' in selector means parsing the descendant one, how to find the node in below code?
<div class="test abc">
...
</div>
The node's class value contains a whitespace, I wonder if there should be some escape operations.
i got error like this when use new form for login,
[https://thelookbookwholesale.comhttps://thelookbookwholesale.com/login.php?action=process]
不错的项目,框架也写的杠杠的
但好像一些基本的文档反而是没有,估计大神都不屑吧
求基本spider文档
GetSpiderLib
如何获取默认的spider
这些默认模板没找到,参考一下也好
看了一些issue,至于规则应用开发,应该是还没调试框架吧
Phantom Downloader
拿到的cookie会有丢失,只获取到了第一条
大神的框架非常好用,就是写爬虫规则的时候,每次调试都要重启服务,
希望提供类似这样调用,方便调试.
package main
import "github.com/henrylee2cn/pholcus"
func main() {
PholcusSpider.Test(&request.Request{
// Request对象
Url: "http://www.baidu.com",
// 其他参数...
}, func(ctx *Context) {
ctx.GetDom()
// .......
// 根据请求对象,返回 ctx 对象,方便测试
// 不用每次修改了方法,需要重启服务器,调试比较麻烦
// 调试 OK 了直接复制到程序里面去,这样会方便很多
})
}
爬虫可能抓取某些更新频率不高的网站,如一天新增5篇文章,同时设置每10条数据入库一次。
这会导致数据迟迟不能入库则无法后续处理,而设置为每1~5条数据输出会造成较大的数据库压力。
建议:可同时设置 分批输出的 数量 和 时间 限制,如每达到10条数据或每五分钟可输出一次
hmmm... As I see ,you use phantomJS
to solve this problem ? But ,you do not recommend us to do this , so ,there is any solution if I only use the default Golang Client?
目标站一个列表十几万页:
问题:
采集列表没有入库,中间断掉所有数据就没了,如果一页页采集需要写十万多个列表页地址,也不合适
列表没抓取完,并不会开始内容抓取
希望通过方式:
写了个简单的规则抓取酒店信息,层级为:国家->城市->酒店.
数据量,国家200+,城市8万+,酒店70万+
目前,感觉效率非常低,用的是单机+web模式.
抓了一天才抓了200+的酒店,而且我只拿酒店的名称和描述.
系统为centos,top一下发现pholcus几乎占满了内存(4G),而数据量其实并不大.内存占满的情况下效率几乎为0...
我比较想知道,在抓取时,不会释放内存吗?这算是一个问题吗?
[E] kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.
2017/08/07 14:25:38 [E] circuit breaker is open
2017/08/07 14:25:38 [E] circuit breaker is open
hi there i got this error, when use kafka output
Hi
I can understand most of what is happening with the lib. But unfortunately, I cannot explore in its depth due to my lack of language understanding. Is there a plan to have an English version of the docs for this wonderful library!?
mysql.go里的94行的
self.sqlCode += );
改为 self.sqlCode += ) default charset=utf8;
就ok了
./pholcus.go:44: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.HOST
./pholcus.go:44: cannot assign to config.MYSQL_OUTPUT.HOST
./pholcus.go:46: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.DB
./pholcus.go:46: cannot assign to config.MYSQL_OUTPUT.DB
./pholcus.go:48: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.USER
./pholcus.go:48: cannot assign to config.MYSQL_OUTPUT.USER
./pholcus.go:50: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.PASSWORD
./pholcus.go:50: cannot assign to config.MYSQL_OUTPUT.PASSWORD
./pholcus.go:52: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.MAX_CONNS
./pholcus.go:52: cannot assign to config.MYSQL_OUTPUT.MAX_CONNS
./pholcus.go:52: too many errors
我把MGO_OUTPUT 相关都注掉了
写了密码
其他都没动.
数据库已建好pholcus
go1.5.1
如果是mysql入库有外键关联表需求,pholcus能不能完成
判断 Reloadable
是否允许重复下载时有以下判断
func (self *Matrix) Push(req *request.Request) {
...
// 不可重复下载的req
if !req.IsReloadable() {
// 已存在成功记录时退出
if self.hasHistory(req.Unique()) {
return
}
// 添加到临时记录
self.insertTempHistory(req.Unique())
}
...
}
实际上依赖func (self *Request) Unique() string
判断是否相同请求
// 请求的唯一识别码
func (self *Request) Unique() string {
if self.unique == "" {
block := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method))
self.unique = hex.EncodeToString(block[:])
}
return self.unique
}
如果一个 POST
请求填写了 PostData, 则不能正确的辨别是否是同一个请求
POST /somewhere
page=1&keyword=XXX
期待结果:
// 请求的唯一识别码
func (self *Request) Unique() string {
if self.unique == "" {
block := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method + self.PostData))
self.unique = hex.EncodeToString(block[:])
}
return self.unique
}
该逻辑的调整会对已经存储的数据造成较大的影响。
How to set custom unique index at specific fields, for example "url" at mgo output? please tell me how.
执行 go run example_main.go
app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString
可以帮忙看下为什么吗?
环境是在 树莓派的 arm处理器下
您好.mac 10.12.3 (16D32)下,我按照config.ini下的配置建立了proxy.lib,填写了http://ip:port 的代理多条.启用web后,控制台显示可用代理一直是0.我已经确认过代理可以正常使用.请问我该如何解决.感谢您能在百忙之中帮忙解答.
下载运行了 web 版,想直接试试公共的规则库,打开单机模式后,一条规则都没有!
查看了 config.go 文件,好像也不是在那里配置, 还是需要先导入到 mongo ?
缺少本包,需要增加
"github.com/henrylee2cn/pholcus/runtime/cache"
[root@dev henrylee2cn]# go get -u github.com/henrylee2cn/pholcus
pholcus/exec/exec_linux.go:18: undefined: cache in cache.Task
如题.
$ uname -a
Linux raspberrypi 4.1.13+ #826 PREEMPT Fri Nov 13 20:13:22 GMT 2015 armv6l GNU/Linux
#一个叫树莓派的开发板基于debian jessie定制的arm 版的系统。
#http://mirrordirector.raspbian.org/raspbian/
$ go version
go version go1.7.4 linux/arm
#安装包是go1.7.4.linux-armv6l.tar.gz http://www.golangtc.com/download
build pholcus的时候报错
$ go build
app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString
遂把 agent_linux.go里的charsToString参数类型由int8 改为uint8,才构建成功
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.