Comments (2)
Hi! Thanks for the bug report.
I've given this some careful thought, and although this behavior might seem counter-intutive, it is indeed correct.
The -
operator relies on "fuzzy" matching to determine which documents from the left-hand set should be excluded, based on the right-hand set. In the case you described, where there are two identical documents, docset-docset.unique()
results in an empty set. This happens because the "fuzzy" matching treats the same document as present in both sets (likely due to matching DOI).
Nonetheless, I can see how it is odd that there is no way to retrieve which documents were removed by unique
.
Would it work for you if we were to add a duplicates()
method? This method would specifically return the duplicate documents, ensuring that len(docset) = len(docset.unique()) + len(docset.duplicates())
.
from litstudy.
Hi, yes, that would help. It was exactly the idea - I just wanted to see what had been identified as duplicated. Best, Lars.
from litstudy.
Related Issues (20)
- 'No Edges Given' for Network Analysis HOT 4
- ValueError: n_components must be < n_features; got 50 >= 47 HOT 2
- `build_corpus` always removes words having a frequency below 5 HOT 4
- module 'networkx' has no attribute 'to_scipy_sparse_matrix' HOT 2
- Incompability with gensim 4 HOT 1
- Unexpected results from litstudy.plot_author_histogram() HOT 2
- Listing document titles HOT 3
- Support for google scholar HOT 1
- refine_scopus - low it/s speed; necessary to refine every time? HOT 1
- TypeError: object of type 'method' has no len() HOT 1
- Saving language models
- Documentation on search_ function queries HOT 3
- Search_semanticscholar with list
- Scopus400Error: Error translating query - Refining results with "source title" query argument HOT 6
- train_lda_model() fails to access gensim HOT 3
- Scopus400Error: Exceeds the maximum number allowed for the service level. HOT 1
- Scopus exceeds csv field limit
- SemanticScholar search optimization HOT 2
- DocumentIdentifier.matches() is case-sensitive
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litstudy.