GithubHelp home page GithubHelp logo

Comments (4)

zerozh avatar zerozh commented on May 18, 2024

另一个办法是,Spider 有自己的 Unique 方法,取Request.Unique的时候,优先使用Spider.Unique 方法,如果未定义,则使用 md5.Sum([]byte(self.Spider + self.Rule + self.Url + self.Method))

from pholcus.

andeya avatar andeya commented on May 18, 2024

我也赞同第二种方法。
给我点时间,我来加上Spider.Unique 方法。
或者你提交PR也可以。

from pholcus.

zerozh avatar zerozh commented on May 18, 2024

因为 Request 中只有 Spider 名,没有 *Spider ,所以为 Spider 增加 Unique 方法改动比较大,如果考虑到兼容性、易用性,最简单的做法是为 Request增加 SetUnique 方法

// 请求的唯一识别码,外部计算写入
func (self *Request) SetUnique(s string) bool {
	if self.unique == "" {
		self.unique = s
		return true
	}
	return false
}

应用场景

// 某些网站URL生成规则与上一页的内容(如最后一条内容的ID作为下一页的开始ID)有关
// 因为每次请求的最后一条ID不同,Reloadable无法准确判断,无法缓存
url := "http://www.example.com/?abcdef...&size=10&start=LAST_ITEMID_OF_LAST_PAGE"
// 计算URL Hash时使用自己定义的规则
hashUrl := "http://www.example.com/?page=1"
req := &request.Request{
	Method: "GET"
	Url:        url,
	Rule:       "Result",
	Reloadable: page == 1,
	Temp:       map[string]interface{}{"url": url, "id": id, "page": page},
}

// 以 Method 加 Url 计算唯一hash 为例
block := md5.Sum([]byte(req.Method+hashUrl))
unique := hex.EncodeToString(block[:])

req.SetUnique(unique)
ctx.AddQueue(req)

from pholcus.

fengdou902 avatar fengdou902 commented on May 18, 2024

当我抓取某个url的列表,下次再抓的时候,目标网站的内容更新了,但是不会再抓

from pholcus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.