GithubHelp home page GithubHelp logo

dblink's Introduction

dblink: Distributed End-to-End Bayesian Entity Resolution

dblink is a Spark package for performing unsupervised entity resolution (ER) on structured data. It's based on a Bayesian model called blink (Steorts, 2015), with extensions proposed in (Marchant et al., 2021). Unlike many ER algorithms, dblink approximates the full posterior distribution over clusterings of records (into entities). This facilitates propagation of uncertainty to post-ER analysis, and provides a framework for answering probabilistic queries about entity membership.

dblink approximates the posterior using Markov chain Monte Carlo. It writes samples (of clustering configurations) to disk in Parquet format. Diagnostic summary statistics are also written to disk in CSV format—these are useful for assessing convergence of the Markov chain.

Documentation

The step-by-step guide includes information about building dblink from source and running it locally on a test data set. Further details about configuration options for dblink is provided here.

Example: RLdata

Two synthetic data sets RLdata500 and RLdata10000 are included in the examples directory as CSV files. These data sets were extracted from the RecordLinkage R package and have been used as benchmark data sets in the entity resolution literature. Both contain 10 percent duplicates and are non-trivial to link due to added distortion. Standard entity resolution metrics can be computed as unique ids are provided in the files. Config files for these data sets are included in the examples directory: see RLdata500.conf and RLdata10000.conf. To run these examples locally (in Spark pseudocluster mode), ensure you've built or obtained the JAR according to the instructions above, then change into the source code directory and run the following command:

$SPARK_HOME/bin/spark-submit \
  --master "local[*]" \
  --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties" \
  --conf "spark.driver.extraClassPath=./target/scala-2.11/dblink-assembly-0.2.0.jar" \
  ./target/scala-2.11/dblink-assembly-0.2.0.jar \
  ./examples/RLdata500.conf

(To run with RLdata10000 instead, replace RLdata500.conf with RLdata10000.conf.) Note that the config file specifies that output will be saved in the ./examples/RLdata500_results/ (or ./examples/RLdata10000_results) directory.

How to: Add dblink as a project dependency

Note: This won't work yet. Waiting for project to be accepted.

Maven:

<dependency>
  <groupId>com.github.cleanzr</groupId>
  <artifactId>dblink</artifactId>
  <version>0.2.0</version>
</dependency>

sbt:

libraryDependencies += "com.github.cleanzr" % "dblink" % "0.2.0"

How to: Build a fat JAR

You can build a fat JAR using sbt by running the following command from within the project directory:

$ sbt assembly

This should output a JAR file at ./target/scala-2.11/dblink-assembly-0.2.0.jar relative to the project directory. Note that the JAR file does not bundle Spark or Hadoop, but it does include all other dependencies.

Contact

If you encounter problems, please open an issue on GitHub. You can also contact the main developer by email <GitHub username> <at> gmail.com

License

GPL-3

Citing the package

Marchant, N. G., Kaplan, A., Elazar, D. N., Rubinstein, B. I. P. and Steorts, R. C. (2021). d-blink: Distributed End-to-End Bayesian Entity Resolution. Journal of Computational and Graphical Statistics, 30(2), 406–421. DOI: 10.1080/10618600.2020.1825451 arXiv: 1909.06039.

dblink's People

Contributors

ngmarchant avatar resteorts avatar

Forkers

souravbaner-da

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.