Light

pombredanne / set-similarity-search-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ekzhu/set-similarity-search-benchmarks

0.0 1.0 0.0 14 KB

Benchmark Datasets for Set Similarity Search

License: Apache License 2.0

set-similarity-search-benchmarks's Introduction

Set Similarity Search Bencmarks

Benchmark data sets for set similarity search algorithms.

Data set	Note	Number of sets	Number of tokens	File size	Papers
BMS-POS (Source)	A set is a purchase in a shop; a token is a product category in that purchase	515,597	1,657	3.8 MB	1
Kosarak (Source)	A set is a user; a token is a link clicked by the user	990,002	41,270	13 MB	1
Flickr	A set is a photo; a token is a tag or a word from the title	1,680,490	810,660	29 MB	1,4
Netflix (Source)	A set is a user; a token is a movie rated by the user	480,189	17,770	166 MB	1
Orkut (Source)	A set is a user; a token is a group membership of the user	1,853,285	15,293,693	378 MB	1
Canada-US-UK Open Data Query Benchmark 1k Query Benchmark 10k Query Benchmark 100k	A set is a table column; a token is a data value	745,414	562,320,456	2.52 GB	2
WDC Web Table 2015, English Relational-Only Query Benchmark 100 Query Benchmark 1k Query Benchmark 10k	A set is a table column; a token is a data value	163,510,917	184,644,583	4.32 GB	2,3

All data sets follow the same format:

Compressed using gzip.
First line of the main file is <number of sets> <number of tokens> and optionally a third number <sum of all set sizes>
All other lines are <set size>\t<1>,<2>,<3>,..., where \t is a tab separator, <1> and so on are tokens.
All tokens are integers, transformed from the original strings using a global ascending frequency order.

Papers in set similarity search using the above data sets:

An Empirical Evaluation of Set Similarity Join Techniques, VLDB 2016
LSH Ensemble: Internet Scale Domain Search, VLDB 2016
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, SIGMOD 2019 (To Appear)
Spatio-textual similarity joins, VLDB 2012

set-similarity-search-benchmarks's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs