GithubHelp home page GithubHelp logo

anjavon-vv / articlespider Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 9.27 MB

Python编写简单搜索引擎之爬虫篇(计算站内相关文章pagerank值)

Home Page: https://blog.csdn.net/sinat_41135487/article/details/106456764

Python 100.00%
scrapy python spider tgbus

articlespider's Introduction

Python编写简单搜索引擎之爬虫篇(计算站内相关文章PageRank值)

       爬取电玩巴士部分文章作为后台数据,根据页面内相关文章计算PR值。爬取与计算均较为简单,不考虑复杂度,因此大量数据下运行时间较长有待改进。

       具体是学习Mooc网bobby老师的课程,个人总结和教程之后写。(多么鲜艳的Flag)

        搜索引擎搭建项目指路

技术栈

  • Python3
    • virtualenv、virtualenvwrapper(不必要,但建议使用, 安装教程
  • 爬虫框架scrapy:pip install scrapy
  • 搜索引擎支撑elasticsearch:
    • jdk8+
    • elasticsearch-rtf :大神开发的适用于中文的版本
    • elasticsearch-head :可视化数据
    • kibana :运行不必要,学习ES建议安装
    • python编写接口包elasticsearch_dsl_py:pip install elasticsearch-dsl
  • pagerank矩阵计算numpy:pip install numpy
  • redis:pip install redis
    • windows下需安装redis-windows
    • 用于记录爬取总数传给搜索引擎(不重要、可直接注释相关代码)

运行

       因为在虚拟机写的python物理机运行ES所以改了各种连接配置

             解决:替换所有的192.168.1.106为localhost

  • 运行 main.py 开始爬虫(默认设置爬取500页、需半小时左右、可在tgbus.py内修改)

  • 运行 pagerank.py 开始计算pr值

    • 存在重复扫描问题、待解决……
    • 程序运行较慢,主要是在写入和查询es的地方,还有在筛选相关内容的算法上。

P.S.如果网站有浏览量、点赞数、收藏数之类的数据可以作为添加网页权重值的依据改进为其他算法(比如HITS、TrustRank)

欢迎指正与讨论!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.