Dear all, I have a document set that returns a duplicate accorind to unique():

Different results from unique() and difference of deduplicated set about litstudy HOT 2 OPEN

larsgrobe commented on August 10, 2024

Different results from unique() and difference of deduplicated set

from litstudy.

Comments (2)

stijnh commented on August 10, 2024

Hi! Thanks for the bug report.

I've given this some careful thought, and although this behavior might seem counter-intutive, it is indeed correct.

The - operator relies on "fuzzy" matching to determine which documents from the left-hand set should be excluded, based on the right-hand set. In the case you described, where there are two identical documents, docset-docset.unique() results in an empty set. This happens because the "fuzzy" matching treats the same document as present in both sets (likely due to matching DOI).

Nonetheless, I can see how it is odd that there is no way to retrieve which documents were removed by unique.

Would it work for you if we were to add a duplicates() method? This method would specifically return the duplicate documents, ensuring that len(docset) = len(docset.unique()) + len(docset.duplicates()).

from litstudy.

larsgrobe commented on August 10, 2024

Hi, yes, that would help. It was exactly the idea - I just wanted to see what had been identified as duplicated. Best, Lars.

from litstudy.

Different results from unique() and difference of deduplicated set about litstudy HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs