GithubHelp home page GithubHelp logo

dynamicer's Introduction

End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data

This is the source code for the framework proposed in the paper

L. Gazzarri, and M. Herschel. "End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data." ICDE 2021

Datasets

All the datasets have been downloaded from the JedAI repository. Datasets for 'cora', 'cddb', and 'amazon-google are in data/dirtyErDatasets. The dataset for 'movies' is in data/cleanCleanErDatasets. The larger dataset 'dbpedia' can be downloaded from this Mendeley repository, used to assess JedAI performance.

  • dataset 'dbpedia': file newDBPedia.tar.xz in Mendeley's Real Clean-Clean ER data

To download additional datasets from JedAI you can run the data/download.sh script from inside the data/ directory (svn required).

Requirements

The framework is written in Scala (version 2.13.1) and it requires SBT and OpenJDK to be installed and executed.

Library dependencies are listed in the SBT configuration file build.sbt.

Installation

To install and download library dependencies:

sbt publishLocal

Run

Clean-Clean ER. To run the sequential program for the 'movies' dataset (imdb-dbpedia) .

sbt "runMain SequentialCCMain -d1 imdb -d2 dbpedia -gt movies  -bc 0.05  -fi 0.05 -o movies.csv"

Dirty ER. To run the sequential program for the 'cddb' dataset (2M).

sbt "runMain SequentialDirtyMain -d1 cddb -gt cddb -bc 0.05  -fi 0.05 -o cdb.csv"

Parallel Clean-Clean ER. To run the parallel program (PP) for the 'dbpedia' dataset.

sbt "runMain AkkaPipelineNoSplitCCMain -d1 DBPedia1 -d2 DBPedia2 -gt DBPedia  -bc 0.005 -fi 0.05 -nb 2 -nc 6 -nw 12 -o dbpedia.csv"

Parallel Clean-Clean ER. To run the parallel program (MPP) for the 'dbpedia' dataset.

sbt "runMain AkkaPipelineMicroBatchOptimizedNoSplitCCMain -d1 DBPedia1 -d2 DBPedia2 -gt DBPedia  -bc 0.005 -fi 0.05 -nb 2 -nc 6 -nw 12 -o dbpedia.csv"

About the options:

  • '-d1' specifies the first dataset.
  • '-d2' specifies the second dataset (for Clean-Clean ER).
  • '-gt' specifies the groundtruth file.'
  • '-bc' and '-fi' specify the parameters for block pruning and block ghosting. For the dataset 'dbpedia' set -bc 0.005.
  • '-nb' specifies the number of threads performing comparison generation
  • '-nc' specifies the number of threads performing comparison cleaning
  • '-nw' specifies the number of threads performing the pairwise comparison step

For the parallel solutions, the total number of threads is 5+nb+nc+nw.

For larger datasets consider to increase the heap size. For example for 'dbpedia':

export SBT_OPTS="-Xmx40G"

Contact

For any problem contact me at [email protected]

dynamicer's People

Contributors

dbguilherme avatar lais-caldeira avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.