GithubHelp home page GithubHelp logo

konhay / weibo-spider Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 55 KB

Crawler program for popular Chinese social media Sina Weibo (mobile site). It is often used to build unstructured and image datasets.

Python 100.00%
crawlers phantomjs scrapy selenium weibo wordcloud

weibo-spider's Introduction

weibo-spider

Introduction

This is a Sina Weibo (mobile site) crawler program. Weibo is the most popular social media in Chinese Mainland. We clean and organize the data crawled, based on which word-cloud figure can be carried out.

Code Structure

scrapy startproject [yourproject] will create a scrapy project.

scrapy.cfg is the configuration file for the project.

setting.py is used to set the parameters of the request, use the proxy, crawl the data after file saving.

/spider/sinaSpider.py is the main code of the crawler.

middlewares.py is the middleware for scrapy's request and its related processing. It is mainly the rotation of UserAgent, Cookies and agents.

items.py is the definition file of the data structure that needs to be extracted.

pipelines.py is to further process the data extracted from items, and the connection to mongdb is in this.

Libraries

scrapy is an application framework for crawling website data and extracting structured data. It is a very powerful and easy-to-use crawler framework that not only provides some basic components out of the box, but also provides powerful customization capabilities.

selenium is a tool for testing Web applications. Selenium tests run directly in the browser, just as real users do. We use selenium mainly to simulate the behavior of users to log in to Weibo and get cookies.

PhantomJS is a non-interface, scriptable WebKit browser engine. It natively supports several web standards: DOM manipulation, CSS selectors, JSON, Canavs, etc.

Reference

web_scraping_with_python

weibo-spider's People

Contributors

konhay avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.