GithubHelp home page GithubHelp logo

a4tunado / google-all-pairs-similarity-search Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 36 KB

Automatically exported from code.google.com/p/google-all-pairs-similarity-search

License: Apache License 2.0

Makefile 1.91% C++ 98.09%

google-all-pairs-similarity-search's Introduction

Welcome to the Google all pairs similarity search package!

This package provides a bare-bones implementation of the
"All-Pairs-Binary" algorithm described in the following paper:

R. J. Bayardo, Yiming Ma, Ramakrishnan Srikant. Scaling Up All-Pairs
Similarity Search. In Proc. of the 16th Int'l Conf. on World Wide Web,
131-140, 2007. (download from: http://www.bayardo.org/ps/www2007.pdf)

===============================================================================
BUILDING

*** For Linux & similar platforms ***

Simply typing "make" in the directory containing this file will build
the executable.  This code depends on the google-sparsehash package
(http://code.google.com/p/google-sparsehash/). If the build fails, you
may have to update the CFLAGS within the Makefile to specify the
location of the sparsehash header files.

If you'd rather not use the google sparsehash library, it is
straightforward to modify the algorithm to use STL hash_map instead.

*** For Win32 / Microsoft VC++ v9 & a command shell ***

NOTE: The build under windows does not exploit (and hence it does not
require) the Google sparsehash library. Instead it uses the regular
STL hash_map, which isn't quite as fast but seems to work well
enough. To build:

C:\google-all-pairs-similarity-search>"%VS90COMNTOOLS%vsvars32.bat"

C:\google-all-pairs-similarity-search>nmake -f Makefile.w32

===============================================================================
DATASET FORMAT

The dataset format expected by the algorithm is "apriori binary."  In
an apriori binary encoded dataset, each vector has the following
format where each component is encoded as a raw 4-byte integer:

<record id> <number of features> <fid 1> <fid 2> ... <fid n>

(Endianness of the integers should match that of your platform,
e.g. little-endian for Intel x86 architectures.)

Record ids can be arbitrary integers. Feature ids should be assigned
such that feature id "i" corresponds to the "ith" least frequently
occuring feature in the dataset.  For example, the feature with id
whose integer value is "1" should be the least frequently occurring
value in the dataset.  Feature ids within a vector should then appear
in increasing order of their id.

Records in the dataset should appear in order of increasing vector
size (a vector's size is its number of features.)

It is easy to extend the algorithm to read CSV formatted data if
desired.

You can download the apriori-binary little-endian encoded dblp dataset
from here: http://www.bayardo.org/bin/dblp_le.bin.gz

===============================================================================
RUNNING THE ALGORITHM

To run the algorithm under Linux/Unix:

./ap <sim_threshold> <dataset_path>

Under Windows:

ap <sim_treshold> <dataset_path>

For example, to mine all pairs of vectors with .9 or higher cosine
similarity from the dblp dataset on a Linux-type system, you might
type:

[~/google-all-pairs-similarity-search]: /ap .9 dblp_le.bin > some_file.txt
; User specified similarity threshold: 0.9
; Found 35022 similar pairs.
; Candidates considered: 9274991
; Vector intersections performed: 2063790
; Total running time: 5 seconds
[~/google-all-pairs-similarity-search]:

The output of the algorithm is text format. Each line will contain a
pair of vector ids that were found to be similar, followed by the
actual cosine similarity score.

NOTE: The algorithm is currently configured to use no more than ~1GB
of main memory. The algorithm will enter an "out of core" mode should
the dataset require more than this amount of RAM to process in a
single pass. The constant bounding RAM usage can be changed in main.cc
as desired.

google-all-pairs-similarity-search's People

Contributors

roberto-bayardo avatar

Stargazers

 avatar

Watchers

 avatar

google-all-pairs-similarity-search's Issues

How to use two data sets to compute their intersection?

Roberto: Sorry for the late reply but for whatever reason, the first
notification about your Jan 2nd question got lost in my spam filter.
Since you closed the original ticket I am opening a new one with
clarifications.

What I meant is the ability to provide as an input not one dataset but two
dataset. 

In this setting, one dataset would be some "reference" and the second
dataset a "query" dataset. 
The goal would be to find all items in the "query" set that are similar to
items in the "reference" data set above a certain threshold: basically
returning the similarity intersection between the two sets as opposed to
the current setting where only pairs within the same are considered. I
guess one way could be to merge the sets and discard pairs returned from
the same set, though that does seem pretty naive.  

Original issue reported on code.google.com by [email protected] on 26 Jan 2010 at 6:59

dblp_le.bin dataset has sets containing duplicate values

The dblp_le.bin from the downloads section uploaded Aug 2007 does not satisfy 
the input requirements of the all-pairs implementation: there are a handful 
of vectors which contain duplicate features. This leads to a few strange 
results.

Original issue reported on code.google.com by [email protected] on 8 Jan 2010 at 12:28

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.