GithubHelp home page GithubHelp logo

pombredanne / set-similarity-search-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ekzhu/set-similarity-search-benchmarks

0.0 1.0 0.0 14 KB

Benchmark Datasets for Set Similarity Search

License: Apache License 2.0

set-similarity-search-benchmarks's Introduction

Set Similarity Search Bencmarks

Benchmark data sets for set similarity search algorithms.

Data set Note Number of sets Number of tokens File size Papers
BMS-POS (Source) A set is a purchase in a shop; a token is a product category in that purchase 515,597 1,657 3.8 MB 1
Kosarak (Source) A set is a user; a token is a link clicked by the user 990,002 41,270 13 MB 1
Flickr A set is a photo; a token is a tag or a word from the title 1,680,490 810,660 29 MB 1,4
Netflix (Source) A set is a user; a token is a movie rated by the user 480,189 17,770 166 MB 1
Orkut (Source) A set is a user; a token is a group membership of the user 1,853,285 15,293,693 378 MB 1
Canada-US-UK Open Data
Query Benchmark 1k
Query Benchmark 10k
Query Benchmark 100k
A set is a table column; a token is a data value 745,414 562,320,456 2.52 GB 2
WDC Web Table 2015, English Relational-Only
Query Benchmark 100
Query Benchmark 1k
Query Benchmark 10k
A set is a table column; a token is a data value 163,510,917 184,644,583 4.32 GB 2,3

All data sets follow the same format:

  • Compressed using gzip.
  • First line of the main file is <number of sets> <number of tokens> and optionally a third number <sum of all set sizes>
  • All other lines are <set size>\t<1>,<2>,<3>,..., where \t is a tab separator, <1> and so on are tokens.
  • All tokens are integers, transformed from the original strings using a global ascending frequency order.

Papers in set similarity search using the above data sets:

  1. An Empirical Evaluation of Set Similarity Join Techniques, VLDB 2016
  2. LSH Ensemble: Internet Scale Domain Search, VLDB 2016
  3. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, SIGMOD 2019 (To Appear)
  4. Spatio-textual similarity joins, VLDB 2012

set-similarity-search-benchmarks's People

Contributors

ekzhu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.