GithubHelp home page GithubHelp logo

hhy5277 / thesaurusspider Goto Github PK

View Code? Open in Web Editor NEW

This project forked from wulc/thesaurusspider

0.0 1.0 0.0 48 KB

下载搜狗、百度、QQ输入法的词库文件的 python 爬虫,可用于构建不同行业的词汇库

License: MIT License

Python 100.00%

thesaurusspider's Introduction

搜狗、百度、QQ输入法词库爬虫

用python实现的爬取搜狗、百度、QQ输入法词库的爬虫。各文件夹对应的内容如下

每个输入法均采用了单线程和多线程实现了爬取功能。多线程的速度要远快于单线程,线程数目建议设为5~10,或者保留默认的设定数5。

通过urllib2、Queue、re、threading等python自带模块实现,无依赖的第三方模块。使用时将singleThreadDownload.py(单线程下载)或 multiThreadDownload.py(多线程下载)中的主函数中的baseDir改为自己的下载路径即可运行单线程下载或多线程下载,注意baseDir末尾没有/。

如果有下载不成功的文件或解析不成功的页面,在下载根目录会生成下载日志,记录这些文件和页面的URL信息,方便debug。

关于实现的具体细节可参考这篇文章

下载的词库文件并非文本格式,而是各个输入法自己定制的二进制格式,关于词库文件的解码并转为文本格式可参考这个repository

2017.01.13更新

百度输入法词库的网页布局已改版,词库的下载链接通过js代码获取,并且采取了一定的反爬虫措施(返回500,502错误)。500, 502表示内部服务器错误,但有的网站在针对爬虫访问的时候也会利用错误码500或502来反爬,百度词库正是这样。

解决方法:

1.虽然下载时通过js代码获取下载链接,但是分析点击下载链接时的http request头中的Request URL,可以发现实际的下载链接还是一个静态链接https://shurufa.baidu.com/dict_innerid_download?innerid=,其中innerid=后跟着的是词库文件的标示ID,可在网页中获取。

2.对于返回500,502错误码的反爬虫措施,通过重新进行请求解决,因为百度词库在返回500或502后会返回一个200,所以实际上并不是服务器出问题,更像是为了反爬而以一定概率出现这类状态码

注意:因为百度输入法采取了一定的反爬虫措施,为了降低返回502,500错误的几率,请求的 user-agent 不再固定,而是采用第三方库user-agent 生成,使用前需要先通过easy_install user-agentpip install user-agent安装。

thesaurusspider's People

Contributors

efeiefei avatar wulc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.