Comments (4)
另一个办法是,Spider 有自己的 Unique 方法,取Request.Unique的时候,优先使用Spider.Unique 方法,如果未定义,则使用 md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method))
from pholcus.
我也赞同第二种方法。
给我点时间,我来加上Spider.Unique 方法。
或者你提交PR也可以。
from pholcus.
因为 Request 中只有 Spider 名,没有 *Spider ,所以为 Spider 增加 Unique 方法改动比较大,如果考虑到兼容性、易用性,最简单的做法是为 Request增加 SetUnique 方法
// 请求的唯一识别码,外部计算写入
func (self *Request) SetUnique(s string) bool {
if self.unique == "" {
self.unique = s
return true
}
return false
}
应用场景
// 某些网站URL生成规则与上一页的内容(如最后一条内容的ID作为下一页的开始ID)有关
// 因为每次请求的最后一条ID不同,Reloadable无法准确判断,无法缓存
url := "http://www.example.com/?abcdef...&size=10&start=LAST_ITEMID_OF_LAST_PAGE"
// 计算URL Hash时使用自己定义的规则
hashUrl := "http://www.example.com/?page=1"
req := &request.Request{
Method: "GET"
Url: url,
Rule: "Result",
Reloadable: page == 1,
Temp: map[string]interface{}{"url": url, "id": id, "page": page},
}
// 以 Method 加 Url 计算唯一hash 为例
block := md5.Sum([]byte(req.Method+hashUrl))
unique := hex.EncodeToString(block[:])
req.SetUnique(unique)
ctx.AddQueue(req)
from pholcus.
当我抓取某个url的列表,下次再抓的时候,目标网站的内容更新了,但是不会再抓
from pholcus.
Related Issues (20)
- config中的版本号未修改
- 使用代理无法成功
- 历史记录名称存在bug,windows下有可能无法写入历史记录文件
- scraping a site that require using javascript to scroll down
- 请问这个框架支持json请求吗 HOT 2
- go get github.com/henrylee2cn/pholcus 的时候报错了 HOT 5
- 安装问题 HOT 1
- 不知道什么原因,单机版没有成功抓取到任何信息
- runtime error: slice bounds out of range HOT 1
- fatal: repository 'https://github.com/henrylee2cn/pholcus_lib/' not found HOT 1
- 请问 /web/pholcus-web.go 中 appInit() 中app.SetLog(Lsc).SetAppConf() 中的Lsc 是什么呢? HOT 2
- 执行 go get -u -v github.com/henrylee2cn/pholcus 失败 HOT 7
- windwos 编译出错 HOT 5
- 使用Context的AddQueue方法并发添加元素时,直接挂了请问是为什么呢?
- 完善主从分布式爬虫开发时的业务编排示例文档
- Any roadmap for english docs?
- Kafka:kafka: invalid configuration
- 分布式中你是如何实现客户端调用Rule函数呢?
- 建个楼,能把下面Python的爬虫用golang pholcus重写
- 运行example的demo 出错, teleport 的包好像改名了, 以前的里面的方法也都没有了 HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pholcus.