snowblink14 / smatch Goto Github PK
View Code? Open in Web Editor NEWSmatch tool: evaluation of AMR semantic structures
License: MIT License
Smatch tool: evaluation of AMR semantic structures
License: MIT License
I noticed that the smatch tool sometimes produces different results on the same input. Is this a known behavior?
Here is a simple example. I have the follwing two AMRs:
gold.amr
(v1 / res
:tar (v2 / rere
:rest (v3 / rest
:pay (v4 / cc
:na "cname1"))
:prt (v5 / and
:op1 (v6 / per
:qua "num1")
:op2 (v7 / chi
:qua "num2"))))
pred.amr
(v1 / se
:tar (v2 / rest
:ham (v3 / and
:op1 (v4 / mq
:qua "num1")
:op2 (v5 / per
:qua "num1"))))
I'm running the smatch tool like this:
./smatch.py -f gold.amr pred.amr --pr
Most of the time, the result is this:
Precision: 0.29
Recall: 0.42
F-score: 0.34
But from time to time, I also get this result:
Precision: 0.24
Recall: 0.33
F-score: 0.28
This issue is spun off of #10 as it's really a different issue despite also being about inverted roles.
Smatch compares triples which are deinverted, so (a ... :ARG0-of (b ...
becomes the triple (ARG0, b, a)
(in smatch's (role, source, target) order)). If this deinversion results in a constant becoming the source, then the triple is not considered by smatch, resulting in inflated scores. While these triples may not be considered valid AMR, they are nevertheless observed in the outputs of automatic systems and should be considered during evaluation.
Consider the following hypothetical AMRs as gold and as the outputs of three different systems. The triples that smatch computes (including the top triple) are in comments.
$ cat gold # original: (TOP, a, alpha) (instance, a, alpha) (ARG0, a, 1)
(a / alpha
:ARG0 1)
$ cat a # wrong value: (TOP, a, alpha) (instance, a, alpha) (ARG0, a, 2)
(a / alpha
:ARG0 2)
$ cat b # missing edge: (TOP, a, alpha) (instance, a, alpha)
(a / alpha)
$ cat c # wrong direction: (TOP, a, alpha) (instance, a, alpha) (ARG0, 1, a)
(a / alpha
:ARG0-of 1)
Now we compare these to the gold. The raw counts of matching triples used to compute precision and recall are in comments.
$ python smatch.py --pr -f a gold # P=2/3 R=2/3
Precision: 0.67
Recall: 0.67
F-score: 0.67
$ python smatch.py --pr -f b gold # P=2/2 R=2/3
Precision: 1.00
Recall: 0.67
F-score: 0.80
$ python smatch.py --pr -f c gold # P=2/2 R=2/3 (Note: P != 2/3)
Precision: 1.00
Recall: 0.67
F-score: 0.80
Smatch considers three types of triples: instances (e.g., (instance, s, see-01)
), attributes (e.g., (polarity, a, -)
, and edges (e.g., (ARG0, s, b)
). In all cases, the source is the variable of some node. When the source is a constant, it doesn't fit into these three categories.
The straightforward fix is to ensure that inverted triples whose sources are constants (such as the (ARG0, 1, a)
triple of c
) are counted in the denominators for P and R, perhaps grouped in some "extra triples" category if necessary. When the role is :domain-of
or :mod-of
, however, there may be different behavior (see #10), but these can perhaps be resolved when decoding the AMR and not during the smatch computation.
I don't need this fixed, but maybe it's of interest:
I noticed that this smatch version returns a score of 1.00 for the different graphs:
(s / see-01
:ARG0 (p / person
:name (n / name
:op1 "Hans")))
and
(s / see-01
:ARG0 (p / person
:name (n / name
:op1 "Hans_")))
Wondering if there's a bug, or if there is some reason for this? It might be some sort of preprocessing that happens here which is not obvious. Since there are also https link etc. in AMR that may contain stuff like "_" I think it may not be sensible to remove characters.
edit
It can be even more severe. The score is also 1.00 for the very different graphs:
(s / see-01
:ARG0 (p / person
:name (n / name
:op1 "Hans Meier")))
and
(s / see-01
:ARG0 (p / person
:name (n / name
:op1 "Hans")))
While the first two graphs could be some pre-processing quirk, the second two ones clearly seem like bug.
Hello,
I've the following AMR format, but it raises a warning and error:
# ::tok Hallmark could make a fortune off of this guy .
# ::alignments 0-1.1.1.2.1 1-1 2-1.1 4-1.1.2 6-1.1.2.1.r 7-1.1.2.1.1 8-1.1.2.1
(p / possible-01~e.1
:ARG1 (m / make-05~e.2
:ARG0 (c / company :wiki "Hallmark_Cards"
:name (n / name :op1 "Hallmark"~e.0))
:ARG1 (f / fortune~e.4
:source~e.6 (g / guy~e.8
:mod (t / this~e.7)))))
These are the errors:
File 1 has error/warning message:
*** Line 3 - Ignoring unexpected tokens: (p / possible-01~e.1
*** Line 4 - Ignoring unexpected token: :ARG1
*** Line 4 - Ignoring unexpected tokens: (m / make-05~e.2
*** Line 5 - Ignoring unexpected token: :ARG0
*** Line 6 - Ignoring unexpected token: "Hallmark"~e.0
*** Line 7 - Ignoring unexpected token: :ARG1
*** Line 7 - Ignoring unexpected tokens: (f / fortune~e.4
*** Line 8 - Ignoring unexpected token: :source~e.6
*** Line 8 - Ignoring unexpected tokens: (g / guy~e.8
*** Line 9 - Ignoring unexpected token: :mod
*** Line 9 - Ignoring unexpected tokens: (t / this~e.7
*** Line 9 - Non-matching close parenthesis.
*** Line 9 - Non-matching close parenthesis.
*** Line 9 - Non-matching close parenthesis.
*** Line 9 - Non-matching close parenthesis.
*** Line 9 - Non-matching close parenthesis.
11 errors and 5 warnings in 9 lines.
File 2 has error/warning message:
*** Line 3 - Ignoring unexpected tokens: (p / possible-01~e.1
*** Line 4 - Ignoring unexpected token: :ARG1
*** Line 4 - Ignoring unexpected tokens: (m / make-05~e.2
*** Line 5 - Ignoring unexpected token: :ARG0
*** Line 6 - Ignoring unexpected token: "Hallmark"~e.0
*** Line 7 - Ignoring unexpected token: :ARG1
*** Line 7 - Ignoring unexpected tokens: (f / fortune~e.4
*** Line 8 - Ignoring unexpected token: :source~e.6
*** Line 8 - Ignoring unexpected tokens: (g / guy~e.8
*** Line 9 - Ignoring unexpected token: :mod
*** Line 9 - Ignoring unexpected tokens: (t / this~e.7
*** Line 9 - Non-matching close parenthesis.
*** Line 9 - Non-matching close parenthesis.
*** Line 9 - Non-matching close parenthesis.
*** Line 9 - Non-matching close parenthesis.
*** Line 9 - Non-matching close parenthesis.
11 errors and 5 warnings in 9 lines.
Is the AMR in the correct format? Is something missing within Smatch?
Two related questions: What is the version of the current software? And what is considered the "official" fork (if there is one)? Some points to consider:
setup.py
says "1.0"1.0.1
(the only differences between the version on PyPI and the current master branch are the changes introduced by #20 and #21)@danielhers and @snowblink14 are listed as maintainers on PyPI, so what do you think? Should we bump the current version to, say, 2.0.3 so it's at least numerically consistent with the SemEval 2016 release? And perhaps the README could state whether this is the official (or at least preferred) Smatch code or if another repository should be used instead. I'm happy to do the work and provide a PR for these things but I cannot make the decision.
I tried evaluating a single sentence against itself and I got a Smatch score greater than one (!!), any idea why?
Thank you.
Details below:
python smatchnew/smatch/smatch.py -f q3.txt q3.txt
F-score: 1.11
cat q3.txt
# ::snt How many white settlers were living in Kenya in the 1950's ?
(l / live-01
:ARG0 (p / person
:ARG1-of (s / settle-03
:ARG1 p
:ARG4 c)
:ARG1-of (w / white-02)
:quant (a / amr-unknown))
:location (c / country :name "Kenya")
:time (d / date-entity :decade 1950))
Right now the only way to use smatch is as a command-line script.
I would like to use it from my Python code, so I do something like:
from smatch import f1_score
print(f1_score('(b / bark-01 :ARG0 (d / dog))', '(w / walk-01 :ARG0 (m / man))'))
Hi,
I pass two AMR strings with same meaning but do not get score 1. The only difference between the two strings is that one has ARG2 and another has ARG2-of. And I find that this result in different "TOP" attribute relation and thus the computed smatch score is not 1. I am wondering why the "TOP" attribute relation should be added and how to fix this problem.
Below are the two strings:
(e / except-01 :ARG2 (c/ change-01 :ARG1 (n/ nothing)) :ARG1 (p / pass-01 :ARG2 (l / law :name (n2 / name :op1 "Obaminationcare") :wiki "Patient_Protection_and_Affordable_Care_Act")))
(c / change-01:ARG1 (n / nothing):ARG2-of (e / except-01:ARG1 (p / pass-01:ARG2 (l / law :wiki "Patient_Protection_and_Affordable_Care_Act":name (n2 / name :op1 "Obaminationcare")))))
I use AMR.parse_AMR_line to parse the above two strings and get the following triples:
instance triple
('instance', 'e', 'except-01')
('instance', 'c', 'change-01')
('instance', 'n', 'nothing')
('instance', 'p', 'pass-01')
('instance', 'l', 'law')
('instance', 'n2', 'name')
attribute triple
('TOP', 'e', 'except-01')
('wiki', 'l', 'Patient_Protection_and_Affordable_Care_Act_')
('op1', 'n2', 'Obaminationcare_')
relation triple
('ARG2', 'e', 'c')
('ARG1', 'e', 'p')
('ARG1', 'c', 'n')
('ARG2', 'p', 'l')
('name', 'l', 'n2')
instance triple
('instance', 'c', 'change-01')
('instance', 'n', 'nothing')
('instance', 'e', 'except-01')
('instance', 'p', 'pass-01')
('instance', 'l', 'law')
('instance', 'n2', 'name')
attribute triple
('TOP', 'c', 'change-01')
('wiki', 'l', 'Patient_Protection_and_Affordable_Care_Act_')
('op1', 'n2', 'Obaminationcare_')
relation triple
('ARG1', 'c', 'n')
('ARG2', 'e', 'c')
('ARG1', 'e', 'p')
('ARG2', 'p', 'l')
('name', 'l', 'n2')
The AMRs with missing role names are accepted by Smatch and translated into triples.
(a0 / watch
: (a1 / boy)
:ARG1 (a2 / tv))
Triples by the smatch demo:
instance(a0,boy) ^ instance(a1,tv) ^ TOP(a0,boy) ^ ARG1(a0,a1)
This has several undesired consequences like licensing ill-formed AMRs that might get high scores.
Python 2 was retired at the beginning of the year, and it doesn't even ship with the lastest Ubuntu LTS (20.04) by default. I think smatch should remove explicit support for Python 2. Mainly this means that any workarounds for Python 2 are removed and it is no longer listed in setup.py
, tox.ini
, and .travis.yml
as a supported version. Users who absolutely need Python 2 should pin their smatch version to the current or previous version.
I've already prepared a branch for a pull request, but I created this issue in case there is any discussion about what to do.
Please add encoding='utf8'
as an arg to open
in the following line:
Line 880 in ad7e655
I found two quirks in the following example.
Perhaps something is wrong with my calculate_smatch
function, but I do not think so. (It is modified from score_amr_pairs
.)
from typing import List
import smatch
def calculate_smatch(refs_penman: List[str], preds_penman: List[str]):
total_match_num = total_test_num = total_gold_num = 0
n_invalid = 0
for sentid, (ref_penman, pred_penman) in enumerate(zip(refs_penman, preds_penman), 1):
best_match_num, test_triple_num, gold_triple_num = smatch.get_amr_match(
ref_penman, pred_penman, sent_num=sentid
)
total_match_num += best_match_num
total_test_num += test_triple_num
total_gold_num += gold_triple_num
# clear the matching triple dictionary for the next AMR pair
smatch.match_triple_dict.clear()
score = smatch.compute_f(total_match_num, total_test_num, total_gold_num)
return {
"smatch_precision": score[0],
"smatch_recall": score[1],
"smatch_fscore": score[2],
"ratio_invalid_amrs": n_invalid / len(preds_penman) * 100,
}
s = """(r / result-01
:ARG1 (c / compete-01
:ARG0 (w / woman)
:mod (p / preliminary)
:time (t / today)
:mod (p2 / polo
:mod (w2 / water)))
:ARG2 (a / and
:op1 (d / defeat-01
:ARG0 (t2 / team
:mod (c2 / country
:wiki +
:name (n / name
:op1 "Hungary")))
:ARG1 (t3 / team
:mod (c3 / country
:wiki +
:name (n2 / name
:op1 "Canada")))
:quant (s / score-entity
:op1 13
:op2 7))
:op2 (d2 / defeat-01
:ARG0 (t4 / team
:mod (c4 / country
:wiki +
:name (n3 / name
:op1 "France")))
:ARG1 (t5 / team
:mod (c5 / country
:wiki +
:name (n4 / name
:op1 "Brazil")))
:quant (s2 / score-entity
:op1 10
:op2 9))
:op3 (d3 / defeat-01
:ARG0 (t6 / team
:mod (c6 / country
:wiki +
:name (n5 / name
:op1 "Australia")))
:ARG1 (t7 / team
:mod (c7 / country
:wiki +
:name (n6 / name
:op1 "Germany")))
:quant (s3 / score-entity
:op1 10
:op2 8))
:op4 (d4 / defeat-01
:ARG0 (t8 / team
:mod (c8 / country
:wiki +
:name (n7 / name
:op1 "Russia")))
:ARG1 (t9 / team
:mod (c9 / country
:wiki +
:name (n8 / name
:op1 "Netherlands")))
:quant (s4 / score-entity
:op1 7
:op2 6))
:op5 (d5 / defeat-01
:ARG0 (t10 / team
:mod (c10 / country
:wiki +
:name (n9 / name
:op1 "United"
:op2 "States")))
:ARG1 (t11 / team
:mod (c11 / country
:wiki +
:name (n10 / name
:op1 "Kazakhstan")))
:quant (s5 / score-entity
:op1 10
:op2 5))
:op6 (d6 / defeat-01
:ARG0 (t12 / team
:mod (c12 / country
:wiki +
:name (n11 / name
:op1 "Italy")))
:ARG1 (t13 / team
:mod (c13 / country
:wiki +
:name (n12 / name
:op1 "New"
:op2 "Zealand")))
:quant (s6 / score-entity
:op1 12
:op2 2))))
"""
if __name__ == "__main__":
for _ in range(5):
smatch_score = calculate_smatch([s], [s])
print(smatch_score)
Output
{'smatch_precision': 0.8866666666666667, 'smatch_recall': 0.8866666666666667, 'smatch_fscore': 0.8866666666666667, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.88, 'smatch_recall': 0.88, 'smatch_fscore': 0.88, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.8666666666666667, 'smatch_recall': 0.8666666666666667, 'smatch_fscore': 0.8666666666666667, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.9266666666666666, 'smatch_recall': 0.9266666666666666, 'smatch_fscore': 0.9266666666666666, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.8533333333333334, 'smatch_recall': 0.8533333333333334, 'smatch_fscore': 0.8533333333333335, 'ratio_invalid_amrs': 0.0}
The non-determinism is very worrying to me. If an evaluation metric is not deterministic, how then can we compare systems to each other in a fair way? A difference of 0.92 vs 0.87 is massive for the same input/output.
Hi,
I'm using smatch 1.0.4 (installed via pip install) and I found a case where the precision (and F-mesure) are > 100%
gold:
# ::snt 22/02/2010 16:42
(d / date-entity
:time "16:42"
:day 22
:month 2
:year 2010)
predicted:
# ::snt 22/02/2010 16:42
(d / date-entity
:time "16:42"
:year 2010
:month 2
:day 22
:day 22)
evaluation:
$ smatch.py --pr -f gold predicted
Precision: 1.17
Recall: 1.00
F-score: 1.08
Thanks !
Dear authors,
Why dont you develop parallel computing?
Do you have any plan to extend this score?
Thanks
Currently Smatch is not aware that mod
is the inverse role of domain. Also it has simplistic treatment of inverse roles.
ROLE-of is always converted to ROLE
.
This rule in some cases produces non-existing roles, e.g., these are primary roles consist-of
, prep-on-behalf-of
or prep-out-of
, and their reduced versions are not AMR roles.
See the AMR issue
In the current implementation we can see that parsing AMR from line has an issue:
>>> import amr
>>> amr.AMR.parse_AMR_line("(z0 /chapter :mod 1)")
Node 0 z0
Value: chapter
Relations:
Attribute: TOP value top
The parsing script misses :mod 1
This issue was fixed in the Damente script: https://github.com/mdtux89/amr-evaluation
>>> import amr
>>> amr.AMR.parse_AMR_line("(z0 /chapter :mod 1)")
Node 0 z0
Value: chapter
Relations:
Attribute: mod value 1
Attribute: TOP value chapter
Background:
๐ for #5 (make it as PyPi package)
That will make smatch as an easily accessible library.
It will be nice to replace prints with log statements (so that library users can easily control)
Mappings:
Error -> logger.error
Verbose -> logger.info
VeryVerbose -> logger.debug
Then, we can set log level from CLI args for backward compatibiltiy.
default level = WARNING
When Verbose
flag is enabled, level=INFO
When VeryVerbose
flag is enabled, level=DEBUG
The current process for pushing a release to PyPI could be more direct. As @danielhers said in #22:
As for releases, a small detail is that they are currently actually created automatically only when a tag is created on my fork, because the PyPI token is encrypted with the public key associated with it:
Line 12 in a4f2e28
This means the process for a release is
- @snowblink14 creates a release (and thereby automatically a tag) on https://github.com/snowblink14/smatch
- I replicate this tag on https://github.com/danielhers/smatch and a Travis CI job automatically deploys to PyPI
To get rid of step (2), @snowblink14 would need to enable Travis CI for https://github.com/snowblink14/smatch, and then update the encrypted token in
.travis.yml
. I'll be glad to advise how to do that but I don't mind doing step (2) myself so I think the current situation is fine.
While I appreciate @danielhers's willingness to do the extra step, I think it creates an unnecessary barrier. Let's make it so PyPI is updated whenever a release is made in this repo.
If @snowblink14 does not want to setup TravisCI, then GitHub Actions are pretty nice. I've had a good experience using the python-publish workflow (see here), and the Python Packaging Authority (PyPA) also has their own version (see here).
Hi, thanks for your nice work.
For some reason, I need to compute the smatch directly from the triples such as
[('b', ':instance', 'believe-01'),
('b', ':ARG1', 'c8'),
('c8', ':instance', 'capable-01'),
('c8', ':ARG2', 'i'),
('i', ':instance', 'innovate-01'),
('i', ':ARG0', 'p'),
('p', ':instance', 'person'),
('p', ':mod', 'e2'),
('e2', ':instance', 'each'),
('c8', ':ARG1', 'p'),
('b', ':ARG0', 'p2'),
('p2', ':instance', 'person')]
Could you tell me how to do this using smatch
.
Thanks a lot!
when comparing SMATCH results with the scorer from the 2019 CoNLL Shared Task on Cross-Framework Meaning Representation Parsing (MRP), we discovered that SMATCH will only consider the TOP property correct if the node labels also match. this appears to double-penalize for label mismatches and is maybe not the intended behavior? for more technical detail and a minimal test case, please see the MRP mtool
issue.
This seems like a bug similar to the solved
which can cause Smatch scores to be artificially high by double counting edges and eventually reach values above 100%. The only way of knowing if a Smatch score below 100% is real or suffers from this bug is computing the Smatch of a file with itself (henceforce self-Smatch).
An example
# ::tok Uh ... Do you have legislative power or enforcement power ? <ROOT>
(h / have-03
:ARG0 (y / you)
:ARG1 (o / or
:op1 (p / power
:instrument-of (l / legislate-01))
:op2 (p2 / power
:instrument-of (e / enforce-01)))
:mod (u / uh
:mode expressive)
:mode interrogative)
has the expected self-Smatch of 100%, while
# ::tok Uh ... Do you have legislative power or enforcement power ? <ROOT>
(h / have-03
:ARG0 (y / you)
:ARG1 (o / or
:op1 (p / power
:instrument-of (l / legislate-01))
:op2 (p2 / power
:instrument-of (e / enforce-01)))
:mod (u / uh
:mode expressive)
:mode interrogative
:mode interrogative)
has self-Smatch 110.5% due to the repeated :mode interrogative
. The score can grow ad-infinitum just by repeating further.
I came across a difference in AMR graphis which is not detected by smatch.
comparing these two AMR graphs outputs a R/P/F of 1.00/1.00/1.00 by smatch (I am aware that :mod expressive
is not valid AMR but nevertheless it is what my AMR Parser created, and smatch should detect it)
(d / do-02
:ARG0 (ii / i)
:ARG1 (a / about
:op1 (d2 / disease
:name (n / name
:op1 "OCD")))
:location (p / psychology)
:time (t / today))
(d / do-02
:ARG0 (ii / i)
:ARG1 (a / about
:op1 (d2 / disease
:name (n / name
:op1 "OCD")))
:location (p / psychology)
:time (t / today)
:mod expressive)
I thinks this is due to the special treatment of :mod
/:domain
. any other relation is detected by the current version of smatch. For instance changing :mod expressive
into :op1 expressive
or even :toto expressive
makes smatch detecting the error and outputs an F-score of 0.9677. My first guess is, that :mod
is replaced by :domain
(and start and end point reversed) in amr.sty
There is another AMR difference not detected by smatch :
# ::id ENG_NA_020001_20161020_G0023FSVB_0001.4
# ::snt we are dyin of thirst in MARTISAN 25BIS its***medicine without frontier they brought food an water for us..
(m / multi-sentence
:snt1 (d / die-01
:ARG1 (w / we)
:ARG1-of (c / cause-01
:ARG0 (t / thirst-01
:ARG0 w))
:location (s / street-address-91
:ARG1 "25 bis"
:ARG2 (r / road
:name (n / name
:op1 "Martisan"))))
:snt2 (b / bring-01
:ARG0 (o / organization
:name (n2 / name
:op1 "Doctors"
:op2 "without"
:op3 "Frontiers"))
:ARG1 (a2 / and
:op1 (f / food)
:op2 (w2 / water))
:ARG2 (w3 / we)))
and (note the :ARG1 "25"
which is :ARG1 "25 bis"
in the graph above
(m / multi-sentence
:snt1 (d / die-01
:ARG1 (w / we)
:ARG1-of (c / cause-01
:ARG0 (t / thirst-01
:ARG0 w))
:location (s / street-address-91
:ARG1 "25 bis"
:ARG2 (r / road
:name (n / name
:op1 "Martisan"))))
:snt2 (b / bring-01
:ARG0 (o / organization
:name (n2 / name
:op1 "Doctors"
:op2 "without"
:op3 "Frontiers"))
:ARG1 (a2 / and
:op1 (f / food)
:op2 (w2 / water))
:ARG2 (w3 / we)))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.