GithubHelp home page GithubHelp logo

viciousstar / bitcointalkspider Goto Github PK

View Code? Open in Web Editor NEW
16.0 16.0 9.0 693 KB

Using scrapy to crawl some dates from www.bitcointalk.org and store data in Mongodb,also can plot it by pylab.

Python 100.00%

bitcointalkspider's People

Contributors

xivid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bitcointalkspider's Issues

将帖子爬取源改为Print Page

通过分析print page发现,每一层楼的内容都在一个style为"margin: 0 5ex;"的div里,所以用posts = response.xpath('//div[@Style="margin: 0 5ex;"]')可以提取到所有的内容。
但是,每层楼的作者、主题、时间怎么获取?
此问题得以解决的关键在于,这三个属性都被用b标签标了出来。像下面这样:
Title: Program for automatic portfolio rebalancing?!
Post by: timgri on February 18, 2015, 10:52:34 PM

每个楼层上面都会有三个b标签,依次对应主题、作者和时间。我们只要把所有b标签提取出来,每三个为一组,就可以获得每个楼层的信息。
我们可能会担心,如果帖子内容里有粗体怎么办?我试验了一下,自己回了一贴,用上了各种格式,发现格式确实在print page里面也显示出来了。但是这不会影响我们提取b标签,因为内容里可能存在的b标签都在

标签里面,我们只需要提取
标签之外的就好。
首先,n = len(posts)为楼层的总数(包括楼主)。infos = response.xpath('//b/text()')为所有粗体信息。对于第i层(设楼主为第0层),我们只需用infos[i_3]、infos[i_3+1]、infos[i*3+2]即可获得该层的主题、作者和时间。

优化帖子items models

现在的对于帖子items使用的模型如下:

Thread(用来表示一个主帖)

  1. topic
  2. content
  3. user
  4. time
  5. url(= response.url)
  6. ofBoard: [boardname1, boardname2, ...]

Post(表示主帖下的一个回复)

  1. user
  2. topic
  3. time
  4. content

当爬取一个主帖时,首先开一个Thread,将主帖的主题(topic)、楼主昵称(user)、开贴时间(time)、所属板块(ofBoard)和url存到里面,把content设为一个空列表,然后对于每一层楼(包括楼主那一层),都开一个Post,将这一层楼的作者、标题、发表时间、内容存进去,再把这个Post追加到Thread的content列表里。

这样设计,不利于在数据统计时灵活地响应用户的各种查询请求,如查询某时间段内有多少主题,某时间段内有多少回复,某时间段内有多少帖子(主题+回复),查询某一个主题有多少回复,并且不容易定位到某一个回复,因为一个回复其实是Thread中content键下的一个子document。

现在将Post模型删掉,只用一个Thread模型,将主题和回复视为同级的,通过加一个flag键来表示一个Thread是一个主题还是一个回复。在mongoDB中,每个年月(如201504)作为一个collection,该月下所有新产生的主题和回复都是该collection里面的“同等公民”document。这样查询起来比较灵活方便。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.