GithubHelp home page GithubHelp logo

kienpt / ache Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lansehaixing/ache

0.0 1.0 0.0 3.33 MB

ACHE focused crawler

License: GNU General Public License v2.0

Shell 0.04% PHP 14.91% Java 85.05%

ache's Introduction

ACHE focused crawler

Introduction

ACHE is an implementation of a focused crawler. A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process 1.

Installation

Download with Conda

You can download ache from Binstar 2 with Conda 3 by running:

  conda install -c memex ache

Build from source with Gradle

To build ache from source, you can run the following commands in your terminal:

  git clone https://github.com/chdoig/ache.git
  cd ache
  ./gradlew clean installApp

which will generate an installation package under /build/install/.

Alternatively, you can build a zip archive:

  git clone https://github.com/chdoig/ache.git
  cd ache
  ./gradlew clean distZip

which will generate a zip file of your project under /build/distributions/.

Learn more about Gradle: http://www.gradle.org/documentation.

Running

Building a model

To run the ache crawler, you'll first need to build a model.

$ ache buildModel <target storage config path> <training data path> <output path>

<target storage config path> is the path to the configuration of the target storage.

<training_data> is the path to the directory containing positive and negative examples.

<output path> is the new directory where you want to save the generated model files: pageclassifier.model and pageclassifier.features.

For example:

ache buildModel conf/sample_crawl/target_storage.cfg training_data models/sample_model/

Running a crawl

To start a crawl, run:

ache startCrawl <data output path> <config path> <seed path> <model path> <lang detect profile path>

<data output path> is the path to the directory you want to store your output.

<config path> is the path to the config directory.

<seed path> is the path to the seed list file.

<model path> is the path to the model directory (containing pageclassifier.model and pageclassifier.features).

<lang detect profile path> is the path to the language detection profile. Note: We are currently refactoring the code. You'll be able to find it under resources in the near future. You can currently download here: https://code.google.com/p/language-detection/wiki/Downloads.

For example,

$ ache startCrawl sample_crawl conf/sample_config seeds/sample_crawl.seeds models/sample_model/ libs/profiles/

Requirements

To use ache, you'll need the following:

  • JDK 1.6+

ache's People

Contributors

chdoig avatar aterrel avatar

Watchers

Kien Pham avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.