GithubHelp home page GithubHelp logo

dreadjr / node-carrot2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from teehanlax/node-carrot2

0.0 2.0 0.0 108 KB

Carrot2 Document Clustering Server implementation for Node.js

License: MIT License

node-carrot2's Introduction

node-carrot2 - Carrot2 DCS implementation for Node.js

This library requires the Carrot2 Document Clustering Server - an open source clustering engine available at http://project.carrot2.org/index.html. Installation instructions and configuration can be found at http://project.carrot2.org/documentation.html. Carrot2 was originally designed for clustering search results from web queries, and thus uses a "search result" metaphor (which we've upheld), but it can also be used for any small (a few thousand) collection of documents.

Install the package:

npm install carrot2

Basic Use

The basic use of node-carrot2 involves providing a set of documents to the cluster server and receiving a SearchResult object through a callback. For a complete example, refer to examples/basic.js.

Step 1: Include the package

var carrot2 = require('carrot2');

Step 2: Create an instance of the DCS interface

DocumentClusteringServer can accept an optional parameter object with host and port properties.

var dcs = new carrot2.DocumentClusteringServer(params);

Step 3: Create a SearchResult object and populate it with documents

Each document contains an id, title, url, snippet, and optional custom parameters:

var sr = new carrot2.SearchResult();
sr.addDocument("ID", "Title", "http://www.site.com/", "This is a snippet.", {my_key1:my_value1, my_key2:my_value2});

Step 4: Call the cluster method

dcs.cluster(sr, {algorithm:'lingo'}, [ 
        {key:"LingoClusteringAlgorithm.desiredClusterCountBase", value:10},
        {key:"LingoClusteringAlgorithm.phraseLabelBoost", value:1.0}
], function(err, sr) {
    if (err) console.log(err);
    var cluster = sr.clusters;
});

For a complete list of customizable Carrot2 attributes, refer to the Component documentation: http://download.carrot2.org/head/manual/index.html#chapter.components.

NOTE: Currently the DCS parameters object supports algorithm, ids (set of document id's to use - defaults to all), and max (maximum number of documents to supply). Possible algorithm's are:

  • lingo — Lingo Clustering (default)
  • stc — Suffix Tree Clustering
  • kmeans — Bisecting k-means
  • url — By URL Clustering
  • source — By Source Clustering

External Use

Alternatively, you can cluster an external search engine results by suppling a query string instead of a SearchResult to the cluster method. For a complete example, refer to examples/external.js.

dcs.cluster('my query', {algorithm:'stc', source:"bing-web"}, [ 
        {key:"LingoClusteringAlgorithm.desiredClusterCountBase", value:10},
        {key:"LingoClusteringAlgorithm.phraseLabelBoost", value:1.0}
], function(err, sr) {
    if (err) console.log(err);
	var cluster = sr.clusters;
});

NOTE: The DCS parameters object supports source (search engine to use), and results (number of search results to grab from source). Possible external sources include:

  • etools — eTools Metasearch Engine
  • bing-web — Bing Search
  • boss-web — Yahoo Web Search
  • wiki — Wikipedia Search (with Yahoo Boss)
  • boss-images — Yahoo Image Search
  • boss-news — Yahoo Boss News Search
  • pubmed — PubMed medical database
  • indeed — Jobs from indeed.com
  • xml — XML
  • google-desktop — Google Desktop search
  • solr — Solr Search Engine

Results

A SearchResult object returned in a cluster callback looks like:

{ query: 'seattle',
  cap: 100,
  id_increment: 0,
  documents: [ ... ],
  documentHash: { ... },
  idHash: {},
  clusters: 
   [ { id: '[\'Washington\']',
	  size: 13,
	  score: 39.551955526331575,
	  phrases: [ 'Washington' ],
	  documents: 
	   [ { id: 1 },
	     { id: 4 },
	     { id: 25 },
	     { id: 26 },
	     { id: 36 },
	     { id: 39 },
	     { id: 45 },
	     { id: 47 },
	     { id: 64 },
	     { id: 71 },
	     { id: 73 },
	     { id: 75 },
	     { id: 95 } ],
	  attributes: { score: 39.551955526331575 } }
	,

...
	     
  clusterHash: 
   { '[\'Washington\']': 
      { id: '[\'Washington\']',
	  size: 13,
	  score: 39.551955526331575,
	  phrases: [ 'Washington' ],
	  documents: 
	   [ { id: 1 },
	     { id: 4 },
	     { id: 25 },
	     { id: 26 },
	     { id: 36 },
	     { id: 39 },
	     { id: 45 },
	     { id: 47 },
	     { id: 64 },
	     { id: 71 },
	     { id: 73 },
	     { id: 75 },
	     { id: 95 } ],
	  attributes: { score: 39.551955526331575 } },
     
...
     
    } 
}

For detailed documentation on Carrot2 JSON output reference http://download.carrot2.org/head/manual/index.html#section.architecture.output-json.

License

See the file

node-carrot2's People

Contributors

pnitsch avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.