Incorrect Docs
At the top of https://pyjedai.readthedocs.io/en/latest/tutorials/DirtyER.html
an attribute list is used for the data attr = ['Entity Id','author', 'title']
(by the way IMHO it does not make sense to include the Entity Id as it always will be different for each entity, as such it will just reduce the similarity score of identical entities, so I would suggest to remove 'Entity Id' from the attr list).
Later entity matching is instantiated without specifying an attribute list:
em = EntityMatching(
metric='jaccard',
similarity_threshold=0.0
)
This, however will result in all attributes of the entities to be compared, as EntityMatching is not falling back to using the attributes specified in the Data, see:
|
self.attributes: list = attributes |
The constructor uses the provided attributes or none. I would suggest to either update the tutorial:
em = EntityMatching(
metric='jaccard',
similarity_threshold=0.0,
attributes=attr
)
or even better, fallback to the use the data.attributes in the em.predict method if self.attributes is None.
Issues regarding similarity calculation
As I understand the _similarity method, attributes can be either a dict, a list or None. For reflecting the dict use case self.attributes should be allowed to be a dict, by changing its type to any here:
|
self.attributes: list = attributes |
More severe is that currently calculation of similarity is only correct if no attributes are specified.
For dict case if
should be elif
here:
|
if isinstance(self.attributes, list): |
Currently, last else case will overwrite calculated dict similarity.
For list case denominator should be outside the loop, not inside. So this line:
|
similarity /= len(self.attributes) |
should be deindented one step, otherwise sum will be divided by len(self.attributes)^2.
best