GithubHelp home page GithubHelp logo

Comments (27)

mick avatar mick commented on May 30, 2024

@nilesh-c

Really like your enthusiasm for this project, I think it could be very useful.

I think that is great how you are already thinking about how you might contribute to dedupe! For your proposal you should be clear about what you plan to do this summer as GSoC, and make sure that is feasible. Try to be specific about what you plan on working on.

from dev.

fgregg avatar fgregg commented on May 30, 2024

Wow! This would be very exciting for us, the dev team behind dedupe. Mark Huberty has implemented a bunch of edit distances in the dev branch, but we can definitely use more. Parallel processing would be really great and feasible to tackle over the summer (IMHO).

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

Thank you @dthompson and @fgregg! I'm pretty excited. Sorry about the delay, I got caught up with university lab examinations and assignments. I'll submit my initial proposal via melange in a day and let you know in this thread, so that you can give some feedback over how I can improve it.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

@dthompson,
I have a question regarding the proposal template - is it preferred if I keep the template structure just as it is (like on this page - https://google-melange.appspot.com/gsoc/org/google/gsoc2013/codeforamerica) and "fill in" my details? Or, should I take the liberty of modifying it a bit here and there, beautifying it, of course ensuring that all the information that is asked is properly provided?

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

@fgregg Please check my post on the open-source-duplication group here: https://groups.google.com/forum/?fromgroups=#!topic/open-source-deduplication/y_NUg3-k6jk

I'm facing some problems with the examples.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

@fgregg, I have a question about the different phases (as seen from a high level) of de-duplication in dedupe. I need to be sure because parallelizing each phase will need a different approach. Am I correct if I say that the following are the basic phases (that can be processed in parallel):

i) Pre-processing/cleaning (Easy to run in parallel)

ii) Active learning (Can be run in parallel, similar implementation avail. How expensive is this, memory-wise and CPU-wise? Do you think this phase may become a bottleneck if we let it run sequentially, say, for 10M records?)

iii) Learning good block rules (How expensive is the greedy algorithm for doing this, memory-wise and CPU-wise? Any ideas on how to do this with MapReduce? Do you think this phase may become a bottleneck if we let it run sequentially, say, for 10M records?)

iv) Blocking (Planned. Easy to run in parallel)

v) Applying string similarity functions on individual blocks (Planned. Can be run in parallel)

vi) Clustering (Can be run in parallel, similar implementation avail.)

Did I miss anything in the de-duplication flow?

from dev.

fgregg avatar fgregg commented on May 30, 2024

That's the big parts.

Right now, comparing strings is the biggest bottleneck, and promises the largest benefit for parallelization. For the learning stages, it doesn't make a lot of sense to think about parallelization, because the bottleneck is user labeling which won't allow the training set to ever get that big.

Just a note, although I would love to have more folks working on dedupe, you need to make sure your proposal responds to code for america needs, as they are sponsoring org. Make sure @dthompson thinks this is the right track, it seems like CFA wants to build facilities that make dedupe easier to use, while the main benefit of parallelization would be to make dedupe more powerful.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

Thanks, Forest.

I have studied a few research papers and have picked a method by which I can immediately start working on (iv) and (v), it's solid and I expect good results - as string matching is the bottleneck, I guess it should be focused on first. I have almost finished my initial proposal, just need to make some changes once I confirm the important point that you mentioned, with @dthompson.

In my proposal, I have a three objectives. 1. Parallelizing single dataset de-duplication with dedupe. 2. Implementing de-duplication (also parallel) with two datasets - making a few changes to 1.
And 3 - Making a REST-based API and a web interface/frontend for 1. and 2.

@dthompson do you think this is okay? If you feel that the current focus for GSoC should be making dedupe easier to use and adding support for dual-dataset de-duplication only, I can omit the parallelization bit. I have already added the three of these in my proposal. As per your reply, I shall modify it and post.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

I also feel that being able to run the backend of the data matcher over multiple cores or machines would make it actually scalable and capable of processing huge datasets and making changes visible on the frontend in a reasonable amount of time.

from dev.

mick avatar mick commented on May 30, 2024

@nilesh-c if you want to include working on making dedupe more powerful, that sounds great. You should make sure to work with @fgregg how that can best be accomplished.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

@dthompson - awesome. Thanks.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

@dthompson @fgregg - I have submitted my initial proposal. Please let me know if I can integrate some changes into it to make it better. Do ask me if you need any clarifications.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

I just updated my proposal - the link to the "Efficient Parallel Set-Similarity Joins Using MapReduce" was broken. I fixed it. :)

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

@fgregg Please check out this page and scroll down to the section on performance comparison results:
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

The test map-reduce code works on the Google n-grams dataset. It's textual data. Java+Hadoop seems to be 50% faster than Python+Hadoop Streaming. Dedupe's code would be pretty much CPU-bound and textual (there wouldn't be any need for adding binary data types in the future, I suppose? Hadoop Streaming doesn't perform well for binary data). So we can assume, all things being equal, Java may be 50% faster than Python. Again, mixing Java and Python may have its overheads. So the feasible options before us are:
i) Python with Hadoop Streaming (which I think I might go with)
ii) Java
What do you think?

from dev.

fgregg avatar fgregg commented on May 30, 2024

First, I would want an implementation of parallelism that could take
advantage of multiple cores on the same machine. Probably using the
multiprocessing library. Of course, we should design this so it's easy to
extend this to multiple machines.

The actual string comparators are written, and will continue to be written
in C (actually Cython that's compiled to c), so any solution will have to
support that. I would need a very compelling reason to include Java into
the mix.

On Sat, May 4, 2013 at 3:05 PM, nilesh-c [email protected] wrote:

@fgregg https://github.com/fgregg Please check out this page and scroll
down to the section on performance comparison results:

http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

The test map-reduce code works on the Google n-grams dataset. It's textual
data. Java+Hadoop seems to be 50% faster than Python+Hadoop Streaming.
Dedupe's code would be pretty much CPU-bound and textual (there wouldn't be
any need for adding binary data types in the future, I suppose? Hadoop
Streaming doesn't perform well for binary data). So we can assume, all
things being equal, Java may be 50% faster than Python. Again, mixing Java
and Python may have its overheads. So the feasible options before us are:
i) Python with Hadoop Streaming (which I think I might go with)
ii) Java
What do you think?


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-17440646
.

773.888.2718
2231 N. Monticello Ave
Chicago, IL 60647

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

I understand. Doing it will Apache Hadoop will let it take advantage of multiple cores on a single machine, and can be run in a distributed manner on multiple machines (clusters) seamlessly. I was making sure which language to use with Hadoop. I will stick to Python.

Now, another question:

Will it be better if we make the parallel dedupe (with a dependency on Apache Hadoop) a separate project/repo altogether? If someone just wants to use dedupe quickly on a small dataset on his/her laptop and doesn't feel like installing Hadoop on it, he/she can just clone the normal dedupe. If parallelism is needed, one can clone the Hadoop-based repo, install the dependencies using a script I will provide and proceed.

The problem with the above idea is that developers will have to maintain two separate codebases. OR, we can have both implementations in the current repo and let the user choose before running. This can get a tad bit complicated though, I suppose, but probably worth it. What do you suggest?

from dev.

fgregg avatar fgregg commented on May 30, 2024

I would prefer to see how far we can go with python's multiprocessing
library, which is part of the core python library, before accepting that we
need to use Hadoop.

If multiprocessing is acceptable, the same codebase can be used for one cpu
or hundreds.

If multiprocessing, alone is unacceptable, then I would still prefer we go to a python solution http://wiki.python.org/moin/ParallelProcessing

On Sun, May 5, 2013 at 4:11 AM, nilesh-c [email protected] wrote:

I understand. Doing it will Apache Hadoop will let it take advantage of
multiple cores on a single machine, and can be run in a distributed manner
on multiple machines (clusters) seamlessly. I was making sure which
language to use with Hadoop. I will stick to Python.

Now, another question:

Will it be better if we make the parallel dedupe (with a dependency on
Apache Hadoop) a separate project/repo altogether? If someone just wants to
use dedupe quickly on a small dataset on his/her laptop and doesn't feel
like installing Hadoop on it, he/she can just clone the normal dedupe. If
parallelism is needed, one can clone the Hadoop-based repo, install the
dependencies using a script I will provide and proceed.

The problem with the above idea is that developers will have to maintain
two separate codebases. OR, we can have both implementations in the current
repo and let the user choose before running. This can get a tad bit
complicated though, I suppose, but probably worth it. What do you suggest?


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-17448970
.

773.888.2718
2231 N. Monticello Ave
Chicago, IL 60647

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

Alright. If we eliminate Hadoop (though while using Hadoop we wouldn't have needed to write Java code anyway) for now and look for pure python options, we've got a few to choose from:

i) Raw standard library mapreduce implementation using Pool (Useful for initial testing, but not good for much else. No cluster/distributed file system)

ii) octo.py or mincemeat.py - Very simple Hadoop implementations built upon the Python Standard Library. These just work, but don't compare with high quality implementations. The main problems being - i) there is no distributed file system ii) all data is stored in memory, so if the datasets are huge, they won't fit. iii) All data is sent via the network between the clusters, and for huge datasets, these frameworks don't scale well.

iii) Disco is a very complete implementation which I'd be more inclined to choose. It has lots of documentation and is well-developed. It's mainly in Python with a bit of Erlang code to drive the DFS. It's a nice framework and is probably in the same performance ballpark as Hadoop as benchmarks suggest.

Once we decide what to use, I may start with a bit of prototyping.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

@fgregg - Shall I go with Disco? I think it's in line with what you want. It's also on the Python ParallelProcessing page you mentioned.

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

@fgregg - Please check my forked repo of dedupe here. I have added support for parellelly running core.scoreDuplicates using multiprocessing.Pool, Pool.map(). I just tried this out to measure the execution time difference. Here is the benchmark.

The basic Pool.map() would be good for testing out prototype map-reduce code (probably not a good idea for serious work with big datasets). I can later integrate it with Disco to support distributed processing and the works.

from dev.

fgregg avatar fgregg commented on May 30, 2024

This is really cool! I've waiting until Code For America decides who they
want to go with. They should have their decision soon.

Best,

Forest

On Tue, May 14, 2013 at 4:55 PM, nilesh-c [email protected] wrote:

@fgregg https://github.com/fgregg - Please check my forked repo of
dedupe here https://github.com/nilesh-c/dedupe. I have added support
for parellelly running core.scoreDuplicates using multiprocessing.Pool,
Pool.map(). I just tried this out to measure the execution time difference.
Here https://github.com/nilesh-c/dedupe/blob/master/parallel_report.mdis the benchmark.

The basic Pool.map() would be good for testing out prototype map-reduce
code (probably not a good idea for serious work with big datasets). I can
later integrate it with Disco to support distributed processing and the
works.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-17907926
.

773.888.2718
2231 N. Monticello Ave
Chicago, IL 60647

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

Thanks Forest! OK, sounds good.
On May 15, 2013 3:31 AM, "Forest Gregg" [email protected] wrote:

This is really cool! I've waiting until Code For America decides who they
want to go with. They should have their decision soon.

Best,

Forest

On Tue, May 14, 2013 at 4:55 PM, nilesh-c [email protected]
wrote:

@fgregg https://github.com/fgregg - Please check my forked repo of
dedupe here https://github.com/nilesh-c/dedupe. I have added support
for parellelly running core.scoreDuplicates using multiprocessing.Pool,
Pool.map(). I just tried this out to measure the execution time
difference.
Here https://github.com/nilesh-c/dedupe/blob/master/parallel_report.mdis
the benchmark.

The basic Pool.map() would be good for testing out prototype map-reduce
code (probably not a good idea for serious work with big datasets). I
can
later integrate it with Disco to support distributed processing and the
works.


Reply to this email directly or view it on GitHub<
https://github.com/codeforamerica/dev/issues/6#issuecomment-17907926>
.

773.888.2718
2231 N. Monticello Ave
Chicago, IL 60647


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-17908182
.

from dev.

webmaven avatar webmaven commented on May 30, 2024

Was this work ever completed?

from dev.

fgregg avatar fgregg commented on May 30, 2024

Which part?

from dev.

webmaven avatar webmaven commented on May 30, 2024

The parallelization, mostly.

But I am also curious as to whether any of his work on the fork at https://github.com/nilesh-c/dedupe was ever merged back.

from dev.

fgregg avatar fgregg commented on May 30, 2024

Yup, parallelization has been implemented.
http://dedupe.readthedocs.org/en/v0.5.3.0/API-documentation.html#Dedupe

On Thu, Jul 17, 2014 at 9:44 AM, Michael R. Bernstein <
[email protected]> wrote:

The parallelization, mostly.

But I am also curious as to whether his work on the fork at
https://github.com/nilesh-c/dedupe was ever merged back.


Reply to this email directly or view it on GitHub
#6 (comment).

773.888.2718
2231 N. Monticello Ave
Chicago, IL 60647

from dev.

nilesh-c avatar nilesh-c commented on May 30, 2024

Ah, I wanted to see it happen too. Pretty cool. :)

from dev.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.