GithubHelp home page GithubHelp logo

xllcheng / lshdb Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dimkar121/lshdb

0.0 2.0 0.0 17.62 MB

LSHDB is a parallel and distributed data engine, which relies on Locality-Sensitive Hashing and noSQL systems, for performing record linkage (and privacy-preserving record linkage) and similarity search tasks.

Java 100.00%

lshdb's Introduction

LSHDB

LSHDB

Maven Central License

LSHDB is a parallel and distributed data engine, which relies on the locality-sensitive hashing (LSH) technique and noSQL systems, for performing record linkage (including privacy-preserving record linkage - PPRL) and similarity search tasks. Parallelism lies at the core of its mechanism, since queries are executed in parallel using a pool of threads.

The relevant demo paper "LSHDB: A Parallel and Distributed Engine for Record Linkage and Similarity Search" by Dimitrios Karapiperis (HoU), Aris Gkoulalas-Divanis (IBM), and Vassilios S. Verykios (HoU) was presented in IEEE ICDM 2016, which was held in Barcelona, Spain.

The main features of LSHDB are:

  • Easy extensibility Support of any noSQL data store, or LSH technique can be easily plugged by extending or implementing the respective abstract classes or interfaces.
  • Support of both the online query-driven mode and the offline batch process of record linkage LSHDB works in two modes; the first mode allows the resolution of the submitted queries in real time, while the second mode works in the traditional offline mode, which reports the results after the record linkage task has been completed.
  • Suport of the PPRL mode In the case of PPRL, each participating party, termed also as a data custodian, may send its records, which have been previously masked, to a Trusted Third Party (TTP). The TTP configures and uses LSHDB for performing the linkage task and eventually sending the results back to the respective data custodians.
  • Ease of use Queries can be submitted against a data store using just four lines of code.
  • Similarity sliding The user can specify the desired level of similarity between the query and the returned values by using the similarity sliding feature.
  • Polymorphism of the response The result set can be returned either in terms of Java objects, or in JSON format for interoperability purposes.
  • Support of distributed queries A query can be forwarded to multiple instances of LSHDB to support data stores that have been horizontally partitioned into multiple compute nodes.
  • Support of storing structured and semi-structured data A data store may contain homogeneous or heterogeneous data.

The dependency info for downloading the jar (ver. 1.0) from the central maven repo is:

<dependency>
    <groupId>gr.eap.LSHDB</groupId>
    <artifactId>LSHDB</artifactId>
    <version>1.0</version>
</dependency>

Stores created by LSHDB can be accessed either in-line or using sockets. In the in-line mode, using a simple initialization code snippet of the following form:

String folder = "/home/LSHDB/stores";
String storeName = "dblp";
String engine = "gr.eap.LSHDB.LevelDB";
HammingLSHStore lsh = new HammingLSHStore(folder, storeName, engine);

or using the compact factory call (provided that the configuration is given in a XML file -- see below):

HammingLSHStore.open("dblp");

one opens a data store named dblp, which has been serialized under /home/LSHDB/stores, and has been created using Hamming LSH and LevelDB (utilizing this native interface ) as the underlying LSH implementation and noSQL engine, respectively.

In the following, using the DBLP database, we will showcase how one can insert some records, and submit similarity queries either by using Java objects or by performing asynchronous AJAX requests.

Inserting records into a store

Assume a store that contains the titles of the publications contained in DBLP along with the name of their first author. In order to support queries with respect to these names, we have to specify a keyed field, from which specialized data structures will be constructed and persisted. If one also needs to submit queries uisng the titles of the publications, then he/she should simply add an additional keyed field.

Key key1 = new HammingKey("author");
Key key2 = new HammingKey("title");
HammingConfiguration hc = new HammingConfiguration(folder, storeName, engine, new Key[]{key1, key2}, true);
hc.saveConfiguration();
HammingLSHStore lsh = new HammingLSHStore(folder, storeName, engine, hc, true);

// iterate the records from a relational db or from a text file
Record record = new Record();
record.setId(id);  // this value uniquely identifies an author
record.set("author", fullName);
record.set("author"+Key.TOKENS, new String[]{surname}); 
record.set("title", title);
// extract some important keywords from the title
record.set("title"+Key.TOKENS, keywords); // keywords should be a String array
lsh.insert(record);

lsh.close();

The object record may store any kind of fields depending on the running application; a publication may refer to a cenference record.set("conference", conferenceInfo); or to a journal record.set("journal", journalInfo);.

Querying a store

The following snippet submits a similarity query against the dblp store, using keyed fields.

QueryRecord query = new QueryRecord(n); // n denotes the max number of the returned records.
query.setKeyedField("author", new String[]{"John"},1.0,true);
Result result = lsh.query(query);
result.prepare();
ArrayList<Record> arr = result.getRecords();

A one-line code that both opens a data store and submits a query is as follows:

HammingLSHStore.open("dblp").query(q).asList();

Using the above query for the records mentioned above, the results contain the following entries:

  • M. R. Stalin John An investigation of ball burnishing process on CNC lathe using finite element analysis
  • Christian John Transformation und ontologische Formulierung multikriterieller Problemstellungen
  • Benjamin Johnen A Dynamic Time Warping algorithm for industrial robot
  • Donghee Yvette Wohn Understanding Perceived Social Support through Communication Time Frequency and Media Multiplexity
  • Colette Johnen Memory Efficient Self-stabilizing Distance-k Independent Dominating Set Construction etc.

By sliding the threshold to the left (tightening) query.setKeyedField("author", new String[]{"John"},.8,true); we narrow the reults, which get closer to the query value ("John"):

  • Aaron Johnson Computational Objectivity in Depression Assessment for Unstructured Large Datasets
  • M. R. Stalin John An investigation of ball burnishing process on CNC lathe using finite element analysis
  • Christian John Transformation und ontologische Formulierung multikriterieller Problemstellungen
  • Michael Johnson Unifying Set-Based Delta-Based and Edit-Based Lenses
  • Rachel St. John Spatially explicit forest harvest scheduling with difference equations etc.

Running LSHDB as a server instance

In case one needs to run LSHDB as a server instance, then, should provide the following minimum configuration:

<LSHDB>
   <alias>local</alias>
   <port>4443</port>
   <stores>
     <store>
       <name>dblp</name>  
       <target>/home/LSHDB/stores</target>
       <engine>gr.eap.LSHDB.LevelDB</engine>
       <LSHStore>gr.eap.LSHDB.HammingLSHStore</LSHStore>
       <LSHConfiguration>gr.eap.LSHDB.HammingConfiguration</LSHConfiguration>    
     </store>
   </stores>    
</LSHDB>

Save the above snippet as config.xml into the src/main/resources directory of your project and then simply run:

mvn exec:java -Dexec.mainClass="gr.eap.LSHDB.Server",

which will fire up a LSHDB instance, hosting a single store, and listening on all network interfaces of the local machine on port 4443.

The correpsonding client application should specify the server/port through a client object, and, in turn, submit the query.

Client client = new Client(server, port);
Result result = client.queryServer(query);

Note that the query object holds the name of the store that will be queried. LSHDB does not maintain any server-side persistent connections.

In all the above listings, the handling of any checked thrown exceptions (such as StoreInitException, ConnectException, UnknownHostException etc.) is omitted for brevity.

Performing asynchronous AJAX requests

Assuming a fully functional instance running on localhost at port 4443, which hosts the dblp store, one by submitting the url http://localhost:4443/JSON/dblp?author_Query=John through a web browser, receives the results in JSON format. A more advanced option is to use jquery as follows:

    $.ajax({
	url:"http://"+server+":"+port+"/JSON/dblp",
	type:"get",
	data:{author_Query: $('#authorText').val()},
        dataType: 'jsonp', 
        success: function(json) {
			if (json.error){
			        out="Error: "+json.errorMessage;
			} else { 
		                out="<table>"; 	  
			        for(i = 0; i < json.length; i++) {
				        out += "<tr><td>"+(i+1)+".</td><td>" +  json[i].author + "</td><td>" + 
				        json[i].title + "</td><td>" +  json[i].year + "</td></tr>";
			         }
                           	 out += "</table>";
			}
                        $('#container').html(out);
        }
    });

Distributed settings

To showcase the distributed extensions of LSHDB, assume that records of the dblp store have been horizontally partitioned to three compute nodes, namely n1, n2, and n3, where n2 and n3 have been registered as remote nodes to n1. Subsequently, a client may submit a query to n1, which forwards that query to n2 and n3 in parallel using a pool of threads. Upon completion of the local and remote queries, n1 sends the results back to the client. The following snippet registers n2 and n3 to n1.

<remote_nodes>
      	<remote_node>
		<alias>n2</alias>
		<port>4443</port>
	   	<url>some ip or fqdn</url>
	   	<enabled>true</enabled>
      	</remote_node>
      	<remote_node>
		<alias>n3</alias>
	   	<url>some ip or fqdn</url>
           	<port>4443</port>
	   	<enabled>true</enabled>
       	</remote_node>
</remote_nodes>

We also have to denote which of these server aliases support our specified stores. This is achieved by adding the following snippet to the correponding store tags.

<remote_stores> 
       <remote_store>
 	     <alias>n2</alias>
	</remote_store>
	<remote_store>
	     <alias>n3</alias>
	 </remote_store>
</remote_stores>

Evaluating LSHDB

Test data sets have been uploaded at Harvard Dataverse using Hamming LSH and LevelDB.

Specifically, this repository includes:

  • the query file Q.txt, extracted from the NCVR list,
  • the queried data set A.txt, and
  • the corresponding LevelDB data store, which includes the records of A.txt.

For each record of Q.txt, we generated 10 records by applying four (edit, delete, insert) edit distance operations (chosen at random) in order to build A.txt. You may use the client Swing application TestApp_NCVR.java, which is icluded in the apps package. Do not forget to change a couple of variables therein regarding the physical location of the data store (e.g., /home/user/LEVELDB) and the physical location of the query file Q.txt (e.g., /home/user/Q.txt). First, compile the package (mvn compile) and then run the application by issuing the following command:

mvn exec:java -Dexec.mainClass="gr.eap.LSHDB.apps.TableApp_NCVR"

References

For the interested reader, we suggest the following research papers:

LSHDB ######Image owned by Sega sai [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons **

lshdb's People

Contributors

dimkar121 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.