shahrukhsaba / duke Goto Github PK

Automatically exported from code.google.com/p/duke

Python 3.90% Java 95.96% HTML 0.14%

duke's Introduction

  ========
    DUKE
  ========


Duke is a fast deduplication and record linkage engine written in
Java, based on Lucene. No documentation is included in the
distribution. To see how to use it, see

  http://code.google.com/p/duke/wiki/GettingStarted

instead.

You may also want to look at the examples in doc/example-data,
particularly dogfood.xml and countries.xml.


For a description of what's new in release 1.0, see
  http://code.google.com/p/duke/wiki/ReleaseNotes


--- EXAMPLES

In the doc/examples directory are two examples. One finding duplicates
and one doing record linkage.


dogfood.ntriples contains data about papers presented at Semantic Web
conferences, with some inadvertent duplicates. Running

java no.priv.garshol.duke.Duke --testdebug --testfile=dogfood-test.txt dogfood-sparql.xml

shows the results of running deduplication.


countries-mondial.csv and countries-dbpedia.csv both contain basic
data about countries. Running countries.xml makes Duke pair each
country from one file with the corresponding country in the other.
Run:

java no.priv.garshol.duke.Duke --testdebug --testfile=countries-test.txt countries.xml

duke's People

Watchers

duke's Issues

Should be possible for data sources to assert difference

It should be possible to assert A owl:differentFrom B in a data source, and for 
this to prevent Duke from ever claiming that A owl:sameAs B.

The JDBCLinkDatabase component in Duke already supports this. If a row (A, B, 
DIFFERENT, ASSERTED) were to appear in the database, Duke would never add an 
owl:sameAs between A and B. However, Duke cannot now get this information from 
the UMIC and into the link database.

To add support for that we'd need to:
  * Add a Collection<Link> getLinks() method to the Record interface, so that records can arrive in Duke with pre-known link information.
  * Add support for populating this data to individual data sources.

Oh, and we also need to update the code so that this gets written correctly to 
the link database.

Original issue reported on code.google.com by [email protected] on 1 Nov 2011 at 10:23

Lookup property analysis must consider maybeThreshold

If the maybe threshold is set, that should be used for lookup property analysis 
rather than the certain threshold.

Original issue reported on code.google.com by [email protected] on 10 Jun 2011 at 7:51

Formalize and document API

Need to do some refactoring to ensure that the API for embedding Duke is 
optimal. Ideally the driving loop should be implemented only once. Also, the 
Deduplicator and Database should be merged. And the code to retrieve a fully 
functional database should be simpler.

Original issue reported on code.google.com by [email protected] on 21 May 2011 at 8:05

Support writing links to SPARQL endpoint

We'll need to handle different dialects, but that should be doable.

Original issue reported on code.google.com by [email protected] on 25 Aug 2011 at 7:26

Configure Database connection via JNDI

We have a web application were all database connections are configured via 
JNDI. This allows us, for example, to set up different database connections for 
the test and our production system without maintaining different war files. 

A datasource configuration could look like this:

<jdbc>
    <param name="jndi-path" value="java:comp/env/jdbc/CONNECTION_NAME"/>

    <column name=.../>
</jdbc>

The actual data source would be configured within the context of the web 
application's servlet container.

In any case: nice project!

Original issue reported on code.google.com by [email protected] on 4 Nov 2011 at 8:21

Upgrade to Lucene 3.3

Should also test the performance difference in doing so.

Original issue reported on code.google.com by [email protected] on 25 Aug 2011 at 7:31

Implement LinkDatabase interface persistently

Either on top of H2, or on top of something simpler. Could be just a b-tree 
implementation, really.

Original issue reported on code.google.com by [email protected] on 7 Apr 2011 at 2:05

Blocking: #8

Support for SPARQL link database

That is, use the LinkDatabase interface to maintain same-as statements in a 
triple store via SPARQL and SPARQL Update.

Original issue reported on code.google.com by [email protected] on 4 Sep 2011 at 1:59

Implement proper escaping in NTriplesWriter

This writer is now being used for real, so we need to complete it and add full 
support for escaping etc

Original issue reported on code.google.com by [email protected] on 31 Aug 2011 at 10:09

Blocking: #18, #18

Fuzzy search in Lucene

Apparently it's possible to do fast fuzzy searches in Lucene 3.x. Need to find 
out how. Keywords are "ngram index" and "spellcheck". Haven't found anything 
yet, but need to see if there is a way to do this.

Original issue reported on code.google.com by [email protected] on 24 Aug 2011 at 8:52

Must handle all properties having the same probability

It may happen that the user sets all properties to the same probability. We 
must handle this case correctly. (Whatever correctly might be. Not sure yet.)

Original issue reported on code.google.com by [email protected] on 10 Jun 2011 at 7:50

Better configuration options for handling link output

It should be possible to choose an output for the links, whether it be 
JDBCLinkDatabase or one of the other alternatives. Need to consider how, though.

Original issue reported on code.google.com by [email protected] on 25 Aug 2011 at 7:27

Make Maven repository for Duke

There should be a Maven repository for Duke where people can get the .jar

Original issue reported on code.google.com by [email protected] on 10 Jun 2011 at 7:51

Support SPARQL test files

That is, we should be able to query a SPARQL endpoint for definitive link 
information.

Original issue reported on code.google.com by [email protected] on 4 Sep 2011 at 1:40

Can we add support for TFIDF matching?

Term frequency matching is generally considered the best string matching 
approach, but requires a source of information about term frequencies. Can we 
come up with some way to add this?

Original issue reported on code.google.com by [email protected] on 25 Aug 2011 at 10:09

Finalize Jaro-Winkler implementation

The current implementation does not include the final three JW adjustments 
described by Yancey. Also, it does not include the full battery of tests that 
the LingPipe people documented. Get all this in.

Original issue reported on code.google.com by [email protected] on 20 May 2011 at 1:29

Open a discussion forum

Is there an open discussion forum for duke?

Original issue reported on code.google.com by [email protected] on 4 Nov 2011 at 9:05

Formalize the LinkDatabase concept

Need to lift the whole LinkDatabase concept up to a higher lever, formalize it, 
and integrate it properly into the whole architecture. Need more LinkDatabase 
implementations, too.

Original issue reported on code.google.com by [email protected] on 4 Nov 2011 at 10:18

Add persistent store management actions to command-line client

The command-line client needs to be able to implement at least the "reindex" 
action. Other actions may be needed, too.

Original issue reported on code.google.com by [email protected] on 2 Apr 2011 at 10:24

Blocked on: #1

Build command-line client

We need to build a command-line client that supports CSV and JDBC data sources 
so that we can try out the basic engine and configuration to ensure that 
performance is acceptable.

Original issue reported on code.google.com by [email protected] on 2 Apr 2011 at 10:06

Blocking: #6, #6
Blocked on: #3, #5

PersonNameComparator: handling of short words

I may be wrong but it seems the PersonNameComparator has a couple of bugs:

1. The execution does not really reach the else if responsible for short tokens 
handling:
line 88
} else if (t1[ix].length() + t2[ix].length() <= 4)
          // it's not an initial, so if the strings are 4 characters
          // or less, we quadruple the edit dist
          d = d * 4;
        else


2. In line 72, t1.length needs to be t2.length? As t1 is always the longer 
token.
} else if (d > 1 && (ix + 1) <= t1.length)

What do you think?

Original issue reported on code.google.com by [email protected] on 28 Oct 2011 at 12:06

Design data source API

Design the data source API and implement two (CSV & JDBC) data sources so that 
we can try it out. Also make sure it will support the RDF push use case.

Original issue reported on code.google.com by [email protected] on 2 Apr 2011 at 10:10

Blocking: #1, #5

Test file should be in CSV format

Having a special test file format is idiotic. We need to change over to using 
just plain CSV files.

Original issue reported on code.google.com by [email protected] on 4 Sep 2011 at 1:39

Character encoding in NTriples parser

We need to fix this parser so that it (a) only allows US-ASCII character 
literals (as per spec) and (b) interprets character escapes correctly.

Original issue reported on code.google.com by [email protected] on 11 May 2011 at 1:08

Turn Database into an interface

After all, we could conceivably create more than just one Database backend. For 
example, for smaller datasets we could simply keep all records and do full n x 
n matching.

Original issue reported on code.google.com by [email protected] on 18 Sep 2011 at 11:37

Save/Retrieve multiple property values in the lucene database

Hello,

 I would like to indexing a record in which a property has >1 values.
At the moment I can see that only 1st value gets saved in the lucene database:

String value = record.getValue(propname);

 So that when a candidate is retrieved from the database all other values are lost and the comparison is not what I'd like it to be.

 Would it be possible to correctly save/retrieve the whole collection of property values in the database?

Thanks

Original issue reported on code.google.com by [email protected] on 13 Jun 2011 at 6:32

Support exporting link files to NTriples format

We need to be able to write link files in NTriples format, to make it easier to 
work with Linked Data.

Original issue reported on code.google.com by [email protected] on 24 May 2011 at 2:51

Blocked on: #33

Support for Hadoop processing

Longer-term we should be able to farm out processing work to Hadoop clusters.

Original issue reported on code.google.com by [email protected] on 4 Sep 2011 at 1:25

Match entities against fix index

For our application I'd like to build once a day an index of our entity 
database and match new entities (online) against this index.

Is this setting supported by duke?

Original issue reported on code.google.com by [email protected] on 4 Nov 2011 at 9:17

Duke should be version-stamped

The META-INF file should have the version number, as well as methods in the 
Duke API. The command-line client should also be able to print the version 
number.

Original issue reported on code.google.com by [email protected] on 9 Sep 2011 at 6:27

The fixed-size search result loses matches

At the moment we never retrieve more than 50 search results from Lucene. We 
need to change this to use a variable-size result set that gets sized 
dynamically.

Original issue reported on code.google.com by [email protected] on 22 May 2011 at 9:29

Support NTriples test files

We should support loading NTriples test files.

Original issue reported on code.google.com by [email protected] on 4 Sep 2011 at 1:39

Design and implement XML configuration language

We need to be able to write up configurations in an XML format, and to be able 
to load these configurations.

Original issue reported on code.google.com by [email protected] on 2 Apr 2011 at 10:16

Blocking: #1
Blocked on: #3

Validation of config file

We need to do some kind of structural validation of the config file to help 
users get the format right. The best choice is to use RELAX-NG, but this 
involves an extra dependency (Jing). We could try using a DTD, if we can 
convince the crap standard parser to load our DTD and not listen to the 
document.

Original issue reported on code.google.com by [email protected] on 21 May 2011 at 9:22

Try opening reader directly from the writer

One of the costliest operations we perform right now is IndexWriter.commit(), 
and in fact we introduced the whole troublesome batching concept specifically 
to be able to live with this limitation. It's possible to open a special reader 
from a writer to get "near real-time" searching, and we should try out whether 
this works better.

http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/index/IndexReader
.html#open(org.apache.lucene.index.IndexWriter, boolean)

Original issue reported on code.google.com by [email protected] on 25 Aug 2011 at 7:30

Support for multithreaded processing

We should be able to use threads to make use of all the processor cores in 
modern machines. Below is an outline of how it might be done.

one thread runs the data source and collects records from there into
a queue.

another set of threads collects records from the queue and indexes
them. it seems that multiple threads doing indexes should work.
http://darksleep.com/lucene/ once indexed the records are stuffed
into a second queue.

a pool of threads picks records from the second queue and does the
matching on them

Original issue reported on code.google.com by [email protected] on 4 Sep 2011 at 1:22

Implement SDshare server

Plug in as backend to some SDshare server framework, and build it on top of the 
LinkDatabase.

Original issue reported on code.google.com by [email protected] on 7 Apr 2011 at 2:13

Blocked on: #7

Implement SPARQL data source

We need a SPARQL data source so that we can process data from SPARQL stores. 
Note that this needs to work in two different modes: (1) batch mode where we 
get all the data (probably with some kind of paging), and (2) incremental mode 
where an outside source tells us what resources to process.

Original issue reported on code.google.com by [email protected] on 7 Apr 2011 at 2:18

Blocking: #10

Set up a standard performance test

Need this in order to be able to judge the performance impact of various 
changes.

Original issue reported on code.google.com by [email protected] on 4 Sep 2011 at 1:21

Proper command-line syntax parsing

Need this in order to add useful control switches to the command-line client.

Original issue reported on code.google.com by [email protected] on 9 Apr 2011 at 9:03

Build SDshare client backend

Need to implement a backend to the Ontopia SDshare client which will receive 
snapshots and fragments and use these to trigger processing via the SPARQL data 
source.

Original issue reported on code.google.com by [email protected] on 7 Apr 2011 at 2:22

Blocked on: #9

Proper tests for handling of null and ""

In RecordImpl and elsewhere.

Original issue reported on code.google.com by [email protected] on 5 Sep 2011 at 12:41

SPARQL data source hangs

What steps will reproduce the problem?

1. java no.priv.garshol.duke.Duke --showmatches dogfood.xml


What is the expected output? What do you see instead?

When I do a kill -SIGQUIT {PID} I get the following trace:


2011-05-21 15:39:25
Full thread dump Java HotSpot(TM) 64-Bit Server VM (19.1-b02-334 mixed mode):

"Low Memory Detector" daemon prio=5 tid=10184e800 nid=0x108b69000 runnable 
[00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=9 tid=10184d000 nid=0x108a66000 waiting on 
condition [00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=9 tid=10184b800 nid=0x108963000 waiting on 
condition [00000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=9 tid=10184a800 nid=0x108860000 waiting on 
condition [00000000]
   java.lang.Thread.State: RUNNABLE

"Surrogate Locker Thread (CMS)" daemon prio=5 tid=101849000 nid=0x10875d000 
waiting on condition [00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=8 tid=101830000 nid=0x108643000 in Object.wait() 
[108642000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <7f3001300> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
    - locked <7f3001300> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=10182f000 nid=0x108532000 in 
Object.wait() [108531000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <7f30011d8> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <7f30011d8> (a java.lang.ref.Reference$Lock)

"main" prio=5 tid=101801800 nid=0x100501000 runnable [100500000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
    - locked <7f3e12df0> (a java.io.BufferedInputStream)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
    - locked <7f3e10418> (a sun.net.www.protocol.http.HttpURLConnection)
    at no.priv.garshol.duke.SparqlClient.getResponse(SparqlClient.java:52)
    at no.priv.garshol.duke.SparqlClient.execute(SparqlClient.java:30)
    at no.priv.garshol.duke.SparqlDataSource$SparqlIterator.fetchNextPage(SparqlDataSource.java:106)
    at no.priv.garshol.duke.SparqlDataSource$SparqlIterator.next(SparqlDataSource.java:92)
    at no.priv.garshol.duke.SparqlDataSource$SparqlIterator.next(SparqlDataSource.java:43)
    at no.priv.garshol.duke.Duke.main(Duke.java:82)

"VM Thread" prio=9 tid=10182a000 nid=0x10842f000 runnable 

"Gang worker#0 (Parallel GC Threads)" prio=9 tid=101804800 nid=0x1007c7000 
runnable 

"Gang worker#1 (Parallel GC Threads)" prio=9 tid=101805800 nid=0x1017cc000 
runnable 

"Concurrent Mark-Sweep GC Thread" prio=9 tid=101808000 nid=0x1080b6000 runnable 
"VM Periodic Task Thread" prio=10 tid=101850800 nid=0x108c6c000 waiting on 
condition 

"Exception Catcher Thread" prio=10 tid=101802800 nid=0x100604000 runnable 
JNI global references: 1704

Heap
 par new generation   total 19136K, used 14785K [7f3000000, 7f44c0000, 7f44c0000)
  eden space 17024K,  86% used [7f3000000, 7f3e707e8, 7f40a0000)
  from space 2112K,   0% used [7f40a0000, 7f40a0000, 7f42b0000)
  to   space 2112K,   0% used [7f42b0000, 7f42b0000, 7f44c0000)
 concurrent mark-sweep generation total 63872K, used 0K [7f44c0000, 7f8320000, 7fae00000)
 concurrent-mark-sweep perm gen total 21248K, used 8748K [7fae00000, 7fc2c0000, 800000000)






What version of the product are you using? On what operating system?

Using duke-0.2-SNAPSHOT.jar built from source with Java version "1.6.0_24" on 
Mac OS X 10.5.8

Original issue reported on code.google.com by Michael.Hausenblas on 21 May 2011 at 2:46

Weighted Levenshtein

We need to be able to treat some edits as larger than others. Particularly 
edits involving numbers are important.

Original issue reported on code.google.com by [email protected] on 1 Oct 2011 at 6:46

Support for more complex string matching

At the moment we only support string equality matching. Need to extend this 
with support for tokenized matching and Levenshtein distance matching.

Original issue reported on code.google.com by [email protected] on 2 Apr 2011 at 10:07

Normalizer should strip accents

One way to do it is given here: 
http://blog.smartkey.co.uk/2009/10/how-to-strip-accents-from-strings-using-java-
6/

Original issue reported on code.google.com by [email protected] on 22 Aug 2011 at 9:10

Add MatchListener method for debugging record scores

Would it be possible to add a method in the MatchListener which would be called 
regardless of whether a match has been identified or not. For debugging and 
fine-tuning purposes it would be nice to see what kind of probabilities each 
record scores. It would be even better  if each property showed what 
probability it scored on its own.

Thanks

Original issue reported on code.google.com by [email protected] on 13 Jun 2011 at 6:50

Why not just extend Lucene scoring?

Just curious. You seem to be doing Bayes calculation after getting results from 
Lucene. Why not implement your own scoring instead? Wouldn't that work? Like - 
https://issues.apache.org/jira/browse/LUCENE-2091

Original issue reported on code.google.com by [email protected] on 18 Jun 2011 at 1:00

Support for qgrams matching

Need to implement a qgrams comparator.

Original issue reported on code.google.com by [email protected] on 25 Aug 2011 at 10:10

Merge matching records and use this for improved matching

Once two records have been found to match this can be exploited to go back and 
see if adding more information about the entity allows us to find more matches.

Original issue reported on code.google.com by [email protected] on 2 Apr 2011 at 10:15

shahrukhsaba / duke Goto Github PK

duke's Introduction

duke's People

Watchers

duke's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs