andeya / pholcus Goto Github PK

View Code? Open in Web Editor NEW

7.5K 7.5K 1.7K 24.12 MB

Pholcus is a distributed high-concurrency crawler software written in pure golang

License: Apache License 2.0

Go 99.98% Shell 0.02%

crowler spider

pholcus's Introduction

Pholcus

Pholcus（幽灵蛛）是一款纯 Go 语言编写的支持分布式的高并发爬虫软件，仅用于编程学习与研究。

它支持单机、服务端、客户端三种运行模式，拥有Web、GUI、命令行三种操作界面；规则简单灵活、批量任务并发、输出方式丰富（mysql/mongodb/kafka/csv/excel等）；另外它还支持横纵向两种抓取模式，支持模拟登录和任务暂停、取消等一系列高级功能。

免责声明

本软件仅用于学术研究，使用者需遵守其所在地的相关法律法规，请勿用于非法用途！！如在**大陆频频爆出爬虫开发者涉诉与违规的新闻。
郑重声明：因违法违规使用造成的一切后果，使用者自行承担！！

爬虫原理

框架特点

为具备一定Go或JS编程基础的用户提供只需关注规则定制、功能完备的重量级爬虫工具；
支持单机、服务端、客户端三种运行模式；
GUI(Windows)、Web、Cmd 三种操作界面，可通过参数控制打开方式；
支持状态控制，如暂停、恢复、停止等；
可控制采集量；
可控制并发协程数；
支持多采集任务并发执行；
支持代理IP列表，可控制更换频率；
支持采集过程随机停歇，模拟人工行为；
根据规则需求，提供自定义配置输入接口
有mysql、mongodb、kafka、csv、excel、原文件下载共五种输出方式；
支持分批输出，且每批数量可控；
支持静态Go和动态JS两种采集规则，支持横纵向两种抓取模式，且有大量Demo；
持久化成功记录，便于自动去重；
序列化失败请求，支持反序列化自动重载处理；
采用surfer高并发下载器，支持 GET/POST/HEAD 方法及 http/https 协议，同时支持固定UserAgent自动保存cookie与随机大量UserAgent禁用cookie两种模式，高度模拟浏览器行为，可实现模拟登录等功能；
服务器/客户端模式采用Teleport高并发SocketAPI框架，全双工长连接通信，内部数据传输格式为JSON。

下载安装

go get -u -v github.com/henrylee2cn/pholcus

创建项目

package main

import (
    "github.com/henrylee2cn/pholcus/exec"
    // _ "pholcus_lib_pte" // 同样你也可以自由添加自己的规则库
)

func main() {
    // 设置运行时默认操作界面，并开始运行
    // 运行软件前，可设置 -a_ui 参数为"web"、"gui"或"cmd"，指定本次运行的操作界面
    // 其中"gui"仅支持Windows系统
    exec.DefaultRun("web")
}

编译运行

正常编译方法

cd {{replace your gopath}}/src/github.com/henrylee2cn/pholcus
go install 或者 go build

Windows下隐藏cmd窗口的编译方法

cd {{replace your gopath}}/src/github.com/henrylee2cn/pholcus
go install -ldflags="-H=windowsgui -linkmode=internal" 或者 go build -ldflags="-H=windowsgui -linkmode=internal"

查看可选参数:

pholcus -h

Web版操作界面截图如下：

GUI版操作界面之模式选择界面截图如下

Cmd版运行参数设置示例如下

$ pholcus -_ui=cmd -a_mode=0 -c_spider=3,8 -a_outtype=csv -a_thread=20 -a_dockercap=5000 -a_pause=300
-a_proxyminute=0 -a_keyins="<pholcus><golang>" -a_limit=10 -a_success=true -a_failure=true

*注意：*Mac下如使用代理IP功能，请务必获取root用户权限，否则无法通过ping获取可以代理！

运行时目录文件

├─pholcus 软件
│
├─pholcus_pkg 运行时文件目录
│  ├─config.ini 配置文件
│  │
│  ├─proxy.lib 代理IP列表文件
│  │
│  ├─spiders 动态规则目录
│  │  └─xxx.pholcus.html 动态规则文件
│  │
│  ├─phantomjs 程序文件
│  │
│  ├─text_out 文本数据文件输出目录
│  │
│  ├─file_out 文件结果输出目录
│  │
│  ├─logs 日志目录
│  │
│  ├─history 历史记录目录
│  │
└─└─cache 临时缓存目录

动态规则示例

特点：动态加载规则，无需重新编译软件，书写简单，添加自由，适用于轻量级的采集项目。
xxx.pholcus.html

<Spider>
    <Name>HTML动态规则示例</Name>
    <Description>HTML动态规则示例 [Auto Page] [http://xxx.xxx.xxx]</Description>
    <Pausetime>300</Pausetime>
    <EnableLimit>false</EnableLimit>
    <EnableCookie>true</EnableCookie>
    <EnableKeyin>false</EnableKeyin>
    <NotDefaultField>false</NotDefaultField>
    <Namespace>
        <Script></Script>
    </Namespace>
    <SubNamespace>
        <Script></Script>
    </SubNamespace>
    <Root>
        <Script param="ctx">
        console.log("Root");
        ctx.JsAddQueue({
            Url: "http://xxx.xxx.xxx",
            Rule: "登录页"
        });
        </Script>
    </Root>
    <Rule name="登录页">
        <AidFunc>
            <Script param="ctx,aid">
            </Script>
        </AidFunc>
        <ParseFunc>
            <Script param="ctx">
            console.log(ctx.GetRuleName());
            ctx.JsAddQueue({
                Url: "http://xxx.xxx.xxx",
                Rule: "登录后",
                Method: "POST",
                PostData: "[email protected]&amp;password=44444444&amp;login_btn=login_btn&amp;submit=login_btn"
            });
            </Script>
        </ParseFunc>
    </Rule>
    <Rule name="登录后">
        <ParseFunc>
            <Script param="ctx">
            console.log(ctx.GetRuleName());
            ctx.Output({
                "全部": ctx.GetText()
            });
            ctx.JsAddQueue({
                Url: "http://accounts.xxx.xxx/member",
                Rule: "个人中心",
                Header: {
                    "Referer": [ctx.GetUrl()]
                }
            });
            </Script>
        </ParseFunc>
    </Rule>
    <Rule name="个人中心">
        <ParseFunc>
            <Script param="ctx">
            console.log("个人中心: " + ctx.GetRuleName());
            ctx.Output({
                "全部": ctx.GetText()
            });
            </Script>
        </ParseFunc>
    </Rule>
</Spider>

静态规则示例

特点：随软件一同编译，定制性更强，效率更高，适用于重量级的采集项目。
xxx.go

func init() {
    Spider{
        Name:        "静态规则示例",
        Description: "静态规则示例 [Auto Page] [http://xxx.xxx.xxx]",
        // Pausetime: 300,
        // Limit:   LIMIT,
        // Keyin:   KEYIN,
        EnableCookie:    true,
        NotDefaultField: false,
        Namespace:       nil,
        SubNamespace:    nil,
        RuleTree: &RuleTree{
            Root: func(ctx *Context) {
                ctx.AddQueue(&request.Request{Url: "http://xxx.xxx.xxx", Rule: "登录页"})
            },
            Trunk: map[string]*Rule{
                "登录页": {
                    ParseFunc: func(ctx *Context) {
                        ctx.AddQueue(&request.Request{
                            Url:      "http://xxx.xxx.xxx",
                            Rule:     "登录后",
                            Method:   "POST",
                            PostData: "[email protected]&password=123456&login_btn=login_btn&submit=login_btn",
                        })
                    },
                },
                "登录后": {
                    ParseFunc: func(ctx *Context) {
                        ctx.Output(map[string]interface{}{
                            "全部": ctx.GetText(),
                        })
                        ctx.AddQueue(&request.Request{
                            Url:    "http://accounts.xxx.xxx/member",
                            Rule:   "个人中心",
                            Header: http.Header{"Referer": []string{ctx.GetUrl()}},
                        })
                    },
                },
                "个人中心": {
                    ParseFunc: func(ctx *Context) {
                        ctx.Output(map[string]interface{}{
                            "全部": ctx.GetText(),
                        })
                    },
                },
            },
        },
    }.Register()
}

代理IP

代理IP写在/pholcus_pkg/proxy.lib文件，格式如下，一行一个IP：

http://183.141.168.95:3128
https://60.13.146.92:8088
http://59.59.4.22:8090
https://180.119.78.78:8090
https://222.178.56.73:8118
http://115.228.57.254:3128
http://49.84.106.160:9000

在操作界面选择“代理IP更换频率”或命令行设置-a_proxyminute参数，进行使用
*注意：*Mac下如使用代理IP功能，请务必获取root用户权限，否则无法通过ping获取可以代理！

FAQ

请求队列中，重复的URL是否会自动去重？

url默认情况下是去重的，但是可以通过设置Request.Reloadable=true忽略重复。

URL指向的页面内容若有更新，框架是否有判断的机制？

url页面内容的更新，框架无法直接支持判断，但是用户可以自己在规则中自定义支持。

请求成功是依据web头的状态码判断？

不是判断状态，而是判断服务器有无响应流返回。即，404页面同样属于成功。

请求失败后的重新请求机制？

每个url尝试下载指定次数之后，若依然失败，则将该请求追加到一个类似defer性质的特殊队列中。  
在当前任务正常结束后，将自动添加至下载队列，再次进行下载。如果依然有没下载成功的，则保存至失败历史记录。  
当下次执行该条爬虫规则时，可通过选择继承历史失败记录，把这些失败请求自动加入defer性质的特殊队列……（后面是重复步骤）

pholcus's People

Contributors

Stargazers

Watchers

Forkers

mouse225 liunianchao fjstdoit fengxiaomi kissthink sandeepone waylandgod jumping wangkai2014 mension xladykiller gale320 xxoxx xormplus floki2020 liwei5365 kaca zhoudianyou younglinav2 comdex email2hf mrxiaoz tdr130 boosheng harrykong askyer monolithic golang-lib easygogogo cluo samons youleiy kevinhuo88888 yunhor fashtimedotcom keyman9848 mickelfeng manbuheiniu sinuos-zz gotao fysoft2006 yanga0 tycheo kvsl9 000fan000 codyguo guangguang foxchen yxiaoli jacobjacob lllhhhqqq asd1355215911 tifancy chenkaigithub hcxiong jjz cw2018 hilerchyn elances rorovic hujunlong cautonwong duhaibo0404 eastmacro advanced-programs crazy-airhead hecbang sunicorn2011 jacton tntest vizewang wangjun oneplus7 jl2005 no2key lihuanghai toophy jandychang linqtosql lonelypale arschlochnop wangcrystal ookamisama smypai husttaowen shiyuanwu heidsoft-paas haoqoo haishengliang yingwu007 stevenliuit hwsyy janchou longmonhau garming yb7 lotosbin heesey keyor zhangshun2014

pholcus's Issues

text_out 文件夹中文名称问题

你好，
当LANG=zh_CN.UTF-8时，文件名被命名为了中文，这个地方是否可以改成数字。

可以访问9090web页面，运行RUN的时候crash，错误如下：
2015/12/22 18:05:23 [pholcus] server Running on 0.0.0.0:9090
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x49dd3c]

goroutine 379 [running]:
sync/atomic.AddUint64(0x1313f634, 0x1, 0x0, 0x33064b18, 0x6f89fb)
c:/go/src/sync/atomic/asm_386.s:112 +0xc
github.com/henrylee2cn/pholcus/app/scheduler.(_Matrix).Push(0x1313f620, 0x1355e5
b0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/scheduler/schedul
er.go:182 +0x10a
github.com/henrylee2cn/pholcus/app/spider.(_Spider).ReqmatrixPush(0x134dd240, 0x
1355e5b0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:
227 +0x2c
github.com/henrylee2cn/pholcus/app/spider.(_Context).AddQueue(0x13542ae0, 0x1355
e5b0, 0x24)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/context.go
:78 +0x1e5
github.com/pholcus/spider_lib.glob.func38(0x13542ae0, 0x13542b00, 0x0, 0x0)
D:/go/workspace/src/github.com/pholcus/spider_lib/jdsearch.go:56 +0x26f
github.com/henrylee2cn/pholcus/app/spider.(_Context).Aid(0x13542ae0, 0x13542b00,
0x13492f5c, 0x1, 0x1, 0x0, 0x0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/context.go
:197 +0x168
github.com/pholcus/spider_lib.glob.func37(0x13542ae0)
D:/go/workspace/src/github.com/pholcus/spider_lib/jdsearch.go:43 +0x186
github.com/henrylee2cn/pholcus/app/spider.(_Spider).Start(0x134dd240)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:
170 +0x8f
github.com/henrylee2cn/pholcus/app/crawl.(_crawler).Start(0x134dd4c0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/crawl/crawl.go:61
+0x43
github.com/henrylee2cn/pholcus/app.(_Logic).goRun.func1(0x133c5340, 0x0, 0x33064
af8, 0x134dd4c0)
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/app.go:563 +0x74
created by github.com/henrylee2cn/pholcus/app.(_Logic).goRun
D:/go/workspace/src/github.com/henrylee2cn/pholcus/app/app.go:566 +0xe3

怎么才能使用固定的user-agent

感觉同一个ip使用大量不同的user-agent 会被认为是爬虫吧...

怎么实现定时采集

功能建议：分批输出（入库）可新增定时输出（时间判断）

爬虫可能抓取某些更新频率不高的网站，如一天新增5篇文章，同时设置每10条数据入库一次。
这会导致数据迟迟不能入库则无法后续处理，而设置为每1~5条数据输出会造成较大的数据库压力。

建议：可同时设置分批输出的数量和时间限制，如每达到10条数据或每五分钟可输出一次

请问这个能否伪造ip呢，我试了下，没多久就被封了，已经选择了300ms暂停以及并发为1

多谢多谢，如果能伪造IP

不错的项目，哪里有spider规则文档么？

不错的项目，框架也写的杠杠的
但好像一些基本的文档反而是没有，估计大神都不屑吧
求基本spider文档
GetSpiderLib
如何获取默认的spider
这些默认模板没找到，参考一下也好

看了一些issue，至于规则应用开发，应该是还没调试框架吧

panic: runtime error: invalid memory address or nil pointer dereference

环境：win10 golang1.6 无C编译器
idea：liteide
编译之后运行没有问题，但是点击run之后就直接挂掉了

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x6121fc]

goroutine 67 [running]:
panic(0xb6b700, 0x11dae030)
H:/CoderTools/go1.6.windows-386/go/src/runtime/panic.go:464 +0x326
sync/atomic.AddUint64(0x11e64134, 0x1, 0x0, 0x35552068, 0x85109b)
H:/CoderTools/go1.6.windows-386/go/src/sync/atomic/asm_386.s:112 +0xc
github.com/henrylee2cn/pholcus/app/scheduler.(_Matrix).Push(0x11e64120, 0x11efc000)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/scheduler/scheduler.go:180 +0x10a
github.com/henrylee2cn/pholcus/app/spider.(_Spider).ReqmatrixPush(0x11ec34a0, 0x11efc000)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:269 +0x2c
github.com/henrylee2cn/pholcus/app/spider.(_Context).AddQueue(0x11e622a0, 0x11efc000, 0x1d)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/context.go:79 +0x32c
github.com/pholcus/spider_lib.glob.func66(0x11e622a0, 0x11e622c0, 0x0, 0x0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/pholcus/spider_lib/taobaosearch.go:54 +0x242
github.com/henrylee2cn/pholcus/app/spider.(_Context).Aid(0x11e622a0, 0x11e622c0, 0x12165f48, 0x1, 0x1, 0x0, 0x0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/context.go:198 +0x168
github.com/pholcus/spider_lib.glob.func65(0x11e622a0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/pholcus/spider_lib/taobaosearch.go:43 +0x186
github.com/henrylee2cn/pholcus/app/spider.(_Spider).Start(0x11ec34a0)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/spider/spider.go:197 +0x97
github.com/henrylee2cn/pholcus/app/crawl.(_crawler).Start(0x121bc280)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/crawl/crawl.go:59 +0x43
github.com/henrylee2cn/pholcus/app.(_Logic).goRun.func1(0x12151ea0, 0x0, 0x35552048, 0x121bc280)
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/app.go:596 +0x74
created by github.com/henrylee2cn/pholcus/app.(_Logic).goRun
H:/CoderTools/go1.6.windows-386/src/src/github.com/henrylee2cn/pholcus/app/app.go:599 +0xfe

English documentation please

I can understand most of what is happening with the lib. But unfortunately, I cannot explore in its depth due to my lack of language understanding. Is there a plan to have an English version of the docs for this wonderful library!?

能否支持增量方式抓取？

output都是以StarttTime为目录，这样每次抓取的结果，都保存在不同路径。
是否支持增量方式？在之前结果的基础上，抓取新增内容，保存在同个文件中？

app\distribute\client_api.go:13: cannot use task2 literal (type task2) as type func(teleport.NetData) *teleport.NetData in map value

這個是什麼問題？

能支持代理动态接口？

有在项目中如何使用phantomjs方面的教程吗？

看到项目是支持 phantomjs的但是不知道如何使用

arm系统兼容

$ uname -a
Linux raspberrypi 4.1.13+ #826 PREEMPT Fri Nov 13 20:13:22 GMT 2015 armv6l GNU/Linux
＃一个叫树莓派的开发板基于debian jessie定制的arm 版的系统。
#http://mirrordirector.raspbian.org/raspbian/
$ go version
go version go1.7.4 linux/arm
#安装包是go1.7.4.linux-armv6l.tar.gz http://www.golangtc.com/download

build pholcus的时候报错

$ go build

github.com/henrylee2cn/pholcus/app/downloader/surfer/agent

app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString

遂把 agent_linux.go里的charsToString参数类型由int8 改为uint8，才构建成功

不知道能不能处理外键

如果是mysql入库有外键关联表需求，pholcus能不能完成

安装包错install error

安装过程中出现错误，如图所示。。。

centOS 7.1

go build github.com/henrylee2cn/pholcus: /usr/lib/golang/pkg/tool/linux_amd64/link: signal: killed

能在releases里发布gui编译版本吗

windows的，谢谢

能否支持在线编辑和测试爬虫?

如题.

动态规则解析错误，用xml包含js是否有问题？建议直接使用纯js文件吧

func main() {
    type Spider struct {
        Script    string   `xml:"Script"`
    }
    result := Spider{Script: "none"}
    data := `
        <Spider>
            <Script>
            1 < 2
            </Script>
        </Spider>
    `
    err := xml.Unmarshal([]byte(data), &result)
    if err != nil {
        fmt.Printf("error:", err)
        return
    }
    fmt.Printf("Script: %v", result.Script)
}

Script元素内的js代码，如果有“<”符号，xml.Unmarshal解析过不了，“>“符号正常，其他未测试。

个人觉得：用xml包含js不是太友好，建议直接使用纯js文件吧

2015-12-06 22:26:28 最新的代码 exec模块缺少import cache包

缺少本包,需要增加
"github.com/henrylee2cn/pholcus/runtime/cache"

[root@dev henrylee2cn]# go get -u github.com/henrylee2cn/pholcus

github.com/henrylee2cn/pholcus/exec

pholcus/exec/exec_linux.go:18: undefined: cache in cache.Task

arm上编译不过通

src/github.com/henrylee2cn/pholcus/app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
src/github.com/henrylee2cn/pholcus/app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString

go
buf.Release
buf.Sysname
arm下返回uint8导致
charsToString(ca [65]int8) => charsToString(ca [65]uint8)是否合适

Kafka Error

[E] kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.
2017/08/07 14:25:38 [E] circuit breaker is open
2017/08/07 14:25:38 [E] circuit breaker is open

hi there i got this error, when use kafka output

How do U solve the problem about the dynamical JavaScript file ?

hmmm... As I see ,you use phantomJS to solve this problem ? But ,you do not recommend us to do this , so ,there is any solution if I only use the default Golang Client?

How to use NewForm() (Error duplicate domain)

i got error like this when use new form for login,
[https://thelookbookwholesale.comhttps://thelookbookwholesale.com/login.php?action=process]

使用pholcus的疑惑

1.Pholcus是如何爬虫网页，数据如何提取，提取如何存储到数据库
2.Pholcus数据库配置问题，我使用的是mysql，按照给出的demo，和说明文档，我试着用单机版去跑比如京东搜索iPhone6s，结果却是

啥也没有。
3.比如我有一个需求，很简单，可能很多人和我的想法差不多，就是去爬虫一个网站获取想要塞选的信息，存储到数据库中，Pholcus该如何去做，能给出一个简单的demo么。
4.Pholcus如果是需要去完善一个框架并且更多人一起参与，那么良好的文档和清晰的demo我觉得更为重要。

期待您们的回复，一个热爱Go的开发者！

arm处理器下编译出错

执行 go run example_main.go

app/downloader/surfer/agent/agent_linux.go:17: cannot use buf.Sysname (type [65]uint8) as type [65]int8 in argument to charsToString
app/downloader/surfer/agent/agent_linux.go:27: cannot use buf.Release (type [65]uint8) as type [65]int8 in argument to charsToString

可以帮忙看下为什么吗？
环境是在树莓派的 arm处理器下

反馈一个bug，日志达到一定条数后会协程会卡死

经过长时间的采集图片后，发现到后来内存剧增，并大量goroutine存在，加了一些日志做跟踪，发现协程在输出完文件后，就在日志那里卡住了，导致图片的内存无法释放
1.代码如下，加了些日志：

2.日志如下：

3.堆栈跟踪如下：

4.根本原因，是在单机模式时，照样向socketlog推送了日志，导致日志channel爆满。这里应该加个判断是否为client模式

而bl.steal这个chennel是client模式下，client发送给server的channel，在单机模式下没有从chennel获取数据，这个channel只有在client模式下才能被读取到

大神可以写一个怎么自定义pipline的教程吗?

bro how to store the spider result into mysql's table?

大神请问爬到的数据怎么存到mysql数据库指定的table里面去啊?

链接Mysql数据库时用的UTF8链接,建表的时候没有指明表字符集,会引起字符集错误.

mysql.go 文件
create函数

增加 DEFAULT CHARSET=utf8

如果不指明,有的数据库默认表结构没有指定的时候,会造成中文插入乱码.
小问题.请改一下.

如何在web中自定义规则。

如题.大腿,我想要在web层面自定义爬虫规则.应该从哪里下手? 。
通过表单.填写规则。点击按钮 socket 推送信息.然后采集. 感觉这样会比较灵活..不需要每次都写代码

Reloadable 不可重复下载的判断条件不充分

判断 Reloadable 是否允许重复下载时有以下判断

func (self *Matrix) Push(req *request.Request) {
	...
	// 不可重复下载的req
	if !req.IsReloadable() {
		// 已存在成功记录时退出
		if self.hasHistory(req.Unique()) {
			return
		}
		// 添加到临时记录
		self.insertTempHistory(req.Unique())
	}
	...
}

实际上依赖func (self *Request) Unique() string判断是否相同请求

// 请求的唯一识别码
func (self *Request) Unique() string {
	if self.unique == "" {
		block := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method))
		self.unique = hex.EncodeToString(block[:])
	}
	return self.unique
}

如果一个 POST 请求填写了 PostData，则不能正确的辨别是否是同一个请求

POST /somewhere

page=1&keyword=XXX

期待结果：

// 请求的唯一识别码
func (self *Request) Unique() string {
	if self.unique == "" {
		block := md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method + self.PostData))
		self.unique = hex.EncodeToString(block[:])
	}
	return self.unique
}

该逻辑的调整会对已经存储的数据造成较大的影响。

compile error : cannot find package "github.com/henrylee2cn/pholcus_lib/jiban"

[root@centos pholcus]# pwd
/root/go/src/github.com/henrylee2cn/pholcus

[root@centos pholcus]# go build
../pholcus_lib/pholcus_lib.go:17:2: cannot find package "github.com/henrylee2cn/pholcus_lib/jiban" in any of:
/usr/lib/golang/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOROOT)
/root/go/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOPATH)
[root@centos pholcus]# go install
../pholcus_lib/pholcus_lib.go:17:2: cannot find package "github.com/henrylee2cn/pholcus_lib/jiban" in any of:
/usr/lib/golang/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOROOT)
/root/go/src/github.com/henrylee2cn/pholcus_lib/jiban (from $GOPATH)

这个东东只能Windows下用吧

运行错误

./pholcus.go:44: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.HOST
./pholcus.go:44: cannot assign to config.MYSQL_OUTPUT.HOST
./pholcus.go:46: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.DB
./pholcus.go:46: cannot assign to config.MYSQL_OUTPUT.DB
./pholcus.go:48: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.USER
./pholcus.go:48: cannot assign to config.MYSQL_OUTPUT.USER
./pholcus.go:50: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.PASSWORD
./pholcus.go:50: cannot assign to config.MYSQL_OUTPUT.PASSWORD
./pholcus.go:52: undefined: config.MYSQL_OUTPUT in config.MYSQL_OUTPUT.MAX_CONNS
./pholcus.go:52: cannot assign to config.MYSQL_OUTPUT.MAX_CONNS
./pholcus.go:52: too many errors

我把MGO_OUTPUT 相关都注掉了
写了密码
其他都没动.
数据库已建好pholcus

go1.5.1

Suggestions for the development of pholcus v2.0?

Welcome to discuss the development of pholcus v2.0 ...

How to find the child nodes whose contains whitespaces?

Since a ' ' in selector means parsing the descendant one, how to find the node in below code?

<div class="test abc">
...
</div>

The node's class value contains a whitespace, I wonder if there should be some escape operations.

请问多列表情况如何采集

目前遇到一个问题是：

目标站一个列表十几万页：
问题：
采集列表没有入库，中间断掉所有数据就没了,如果一页页采集需要写十万多个列表页地址，也不合适
列表没抓取完，并不会开始内容抓取
希望通过方式：

一个线程抓取列表、一个线程抓取内容页，区分开，但是没有任务分布式案例，就是如何将任务push到调度线程（新手跪求demo 3q）
一个线程：列表抓取入库做记录；另外一个线程读库开始抓取内容并标记抓取状态，不知道可行不

为啥对net包用得这么熟练呢

学长请问您是学网络的吗

能否添加这样一个方法,方便调试

大神的框架非常好用,就是写爬虫规则的时候,每次调试都要重启服务,
希望提供类似这样调用,方便调试.

package main

import "github.com/henrylee2cn/pholcus"

func main() {
	PholcusSpider.Test(&request.Request{
		// Request对象
		Url: "http://www.baidu.com",
		// 其他参数...
	}, func(ctx *Context) {
		ctx.GetDom()
		// .......

		// 根据请求对象,返回 ctx 对象,方便测试
		// 不用每次修改了方法,需要重启服务器,调试比较麻烦
		// 调试 OK 了直接复制到程序里面去,这样会方便很多
	})
}