GithubHelp home page GithubHelp logo

afscrap's Introduction

#Afscraper

##Presentation Afscraper is a small node.js tool designed to crawl aufeminin's forums in order to gather relevant pieces of information for social researches. Basically, it will browse every threads of a particular forum and search them for some keywords. The threads are then archived in a mongo database or in json format where considered matched against keywords.

##Installation To install Afscraper, simply clone it on your computer and install its dependancies.

git clone https://github.com/Yomguithereal/afscrap.git
cd afscrap
sudo npm install

##Workflow Afscraper workflow is divided into two/three steps:

###1. Fetch the threads links from a particular forum

The first step is to determine a forum to scrap and get the url of its summary. Hence, you can start checking the forum and get back urls corresponding to threads.

The crawler will start from the current date and stop when he has completed a full year.

You can also limit the search to threads having a specified minimum of posts.

The result will be a json file that you will have to pass to step 2 to begin the real crawl.

Example:

node afscrap_forum -u/--url [url-of-forum-summary]

options : -m/--minimum [minimum-of-posts]
          -o/--output [output-directory], default: 'forum_lists/'

###2. Fetch the relevant threads

The second step consist in the actual crawl of the threads and the recuperation of the relevant ones.

The tool must be fed with a json file created by step 1 and a configuration file including the keywords you want to match.

The results can be outputed either to a local mongo database, either to json files.

You can use more than one process and go faster but be advised that you will reach aufemin's sites limit very fast. Only increase processes on small thread lists. Moreover, take notice that the url fetcher pauses for 3-5 seconds between each url to prevent being kicked by aufeminin.

The recuperation of threads can be interrupted and every results are cached. It is therefore possible to fetch your results in more than one time.

Example :

node afscrap_thread -l/--list [path-to-thread-list] -k/--keywords [path-to-keywords]

options : -f/--format [output-format], either json or mongo, default: mongo
          -o/--output [output-directory], default: 'output/' (if json)
          -p/--processes [nb-of-processes], default: 1

###3. (Optional) Compile results to text

If you decided to store your results in a mongo database, you may want to obtain only textual information from the threads you gathered.

The third tool will therefore parse a result database and create text files from it.

Example:

node afscrap_compile -d/--database [name-of-database]

options : -o/--output [output-directory], default: name of your base

##Keywords To register your keywords, you must create a json file respecting this format

[
	"keyword-1"
	,"keyword-2"
	,"keyword-n"
]

##Proxy Configuration If you want to configure a proxy, refer to config/config.example.json and just drop the .example part when your proxy is specified.

##Dependancies

node (>= 0.10.9)
request
cheerio
colors
commander
mongoose
async
moment

afscrap's People

Contributors

yomguithereal avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.