GithubHelp home page GithubHelp logo

exe1023 / dialevalmetrics Goto Github PK

View Code? Open in Web Editor NEW
60.0 6.0 11.0 30.41 MB

Python 85.75% Jupyter Notebook 12.38% Shell 1.56% Dockerfile 0.02% Java 0.08% Perl 0.04% Makefile 0.07% Batchfile 0.06% CSS 0.04%

dialevalmetrics's Introduction

A Comprehensive Assessment of Dialog Evaluation Metrics

This repository contains the source code for the following paper:

A Comprehensive Assessment of Dialog Evaluation Metrics

Prerequisties

We use conda to mangage environments for different metrics.

Each directory in conda_envs holds an environment specification. Please install all of them before starting the next step.

Take the installation of conda_envs/eval_base for example, please run

conda env create -f conda_envs/eval_base/environment.yml

Note that there are some packages could not be installed via this method.

If you want find any packages such as bleurt, nlg-eval, and packages downloaded by spaCy are missing, please install it with official instructions.

We apologize for any inconvenience.

Data Preparation

The directory of each qualitiy-annotated data is placed in data, with the data_loader.py for parsing the data.

Please follow the below instructions to downlaod each dataset, place it into corresponding directory, and run the data_loader.py directly to see if you use the correct data.

DSTC6 Data

Download human_rating_scores.txt from https://www.dropbox.com/s/oh1trbos0tjzn7t/dstc6_t2_evaluation.tgz .

DSTC9 Data

Download and Place the data directory https://github.com/ictnlp/DialoFlow/tree/main/FlowScore/data into data/dstc9_data.

Engage Data

Download https://github.com/PlusLabNLP/PredictiveEngagement/blob/master/data/Eng_Scores_queries_gen_gtruth_replies.csv and rename it to engage_all.csv.

Fed Data

Download http://shikib.com/fed_data.json .

Grade Data

Download and place each directory in https://github.com/li3cmz/GRADE/tree/main/evaluation/eval_data as data/grade_data/[convai2|dailydialog|empatheticdialogues].

Also download the human_score.txt in https://github.com/li3cmz/GRADE/tree/main/evaluation/human_score into the corresponding data/grade_data/[convai2|dailydialog|empatheticdialogues].

Holistic Data

Download context_data_release.csv and fluency_data_release.csv from https://github.com/alexzhou907/dialogue_evaluation .

USR Data

Download TopicalChat and PersonaChat data from http://shikib.com/usr

Metric Installation

For baselines, we use the nlg-eval. Please folloow the instruction to install it.

For each dialog metrics, please follow the instructions in README in the corresponding directory.

Running Notes for Specific metrics

bert-as-service

PredictiveEngage, BERT-RUBER and PONE requires the running bert-as-service.

If you want to evaluate them, please install and run bert-as-service following the instrucitons here.

We also provide a script we used to run bert-as-service run_bert_as_service.sh, feel free to use it.

running USR and FED

We used a web server for running USR and FED in our experiments.

Please modify path in usr_fed/usr/usr_server.py and usr_fed/fed/fed_server.py to start the server, and modify the path in usr_fed_metric.py.

How to evaluate

  1. After you downlaod all datasets, run gen_data.py to transform all datasets into the input format for all metrics. If you only want to evaluate metric metric and dataset dataset, run with gen_data.py --source_data dataset --target_format metric

  2. Modify the path in run_eval.sh as specified in the script, since we need to activate Conda environment when running the script. Run eval_metrics.sh to evaluate all quality-anntoated data.

  3. Some metrics generate the output in its special format. Therefore, we should run read_result.py to read the results of those metrics and transform it into outputs. As step 1, you can specify the metric and data by read_result.py --metric metric --eval_data dataset.

  4. The outputs/METRIC/DATA/results.json holds the prediction score of each metrics (METRIC) and qualitiy-anntoated data (DATA), while running data_loader.py directly in each data directory also generates the corresponding human scores. You can perform any analysis with the data (The jupyter notebook used in our analysis will be released) .

For example, outputs/grade/dstc9_data/results.json could be


    'GRADE': # the metric name
    [
        0.2568123, # the score of the first sample
        0.1552132, 
        ...
        0.7812346
    ]

Results

All values are statistically significant to p-value < 0.05, unless marked by *.

USR Data

USR-TopicalChat USR-Pearsonachat
Turn-Level System-Level Turn-Level System-Level
P S P S P S P S
BLEU-4 0.216 0.296 0.874* 0.900 0.135 0.090* 0.841* 0.800*
METEOR 0.336 0.391 0.943 0.900 0.253 0.271 0.907* 0.800*
ROUGE-L 0.275 0.287 0.814* 0.900 0.066* 0.038* 0.171* 0.000*
ADEM -0.060* -0.061* 0.202* 0.700* -0.141 -0.085* 0.523* 0.400*
BERTScore 0.298 0.325 0.854* 0.900 0.152 0.122* 0.241* 0.000*
BLEURT 0.216 0.261 0.630* 0.900 0.065* 0.054* -0.125* 0.000*
QuestEval 0.300 0.338 0.943 1.000 0.176 0.236 0.885* 1.000
RUBER 0.247 0.259 0.876* 1.000 0.131 0.190 0.997 1.000
BERT-RUBER 0.342 0.348 0.992 0.900 0.266 0.248 0.958 0.200*
PONE 0.271 0.274 0.893 0.500* 0.373 0.375 0.979 0.800*
MAUDE 0.044* 0.083* 0.317* -0.200* 0.345 0.298 0.440* 0.400*
DEB 0.180 0.116 0.818* 0.400* 0.291 0.373 0.989 1.000
GRADE 0.200 0.217 0.553* 0.100* 0.358 0.352 0.811* 1.000
DynaEval -0.032* -0.022* -0.248* 0.100* 0.149 0.171 0.584* 0.800*
USR 0.412 0.423 0.967 0.900 0.440 0.418 0.864* 1.000
USL-H 0.322 0.340 0.966 0.900 0.495 0.523 0.969 0.800*
DialogRPT 0.120 0.105* 0.944 0.600* -0.064* -0.083* 0.347* 0.800*
Deep AM-FM 0.285 0.268 0.969 0.700* 0.228 0.219 0.965 1.000
HolisticEval -0.147 -0.123 -0.919 -0.200* 0.087* 0.113* 0.051* 0.000*
PredictiveEngage 0.222 0.310 0.870* 0.900 -0.003* 0.033* 0.683* 0.200*
FED -0.124 -0.135 0.730* 0.100* -0.028* -0.000* 0.005* 0.400*
FlowScore 0.095* 0.082* -0.150* 0.400* 0.118* 0.079* 0.678* 0.800*
FBD - - 0.916 0.100* - - 0.644* 0.800*

GRADE Data

GRADE-ConvAI2 GRADE-DailyDialog GRADE-EmpatheticDialogue
Turn-Level System-Level Turn-Level System-Level Turn-Level System-Level
P S P S P S P S P S P S
BLEU-4 0.003* 0.128 0.034* 0.000* 0.075* 0.184 1.000* 1.000 -0.051* 0.002* 1.000* 1.000
METEOR 0.145 0.181 0.781* 0.600* 0.096* 0.010* -1.000* -1.000 0.118 0.055* 1.000* 1.000
ROUGE-L 0.136 0.140 0.209* 0.000* 0.154 0.147 1.000* 1.000 0.029* -0.013* 1.000* 1.000
ADEM -0.060* -0.057* -0.368* -0.200* 0.064* 0.071* 1.000* 1.000 -0.036* -0.028* 1.000* 1.000
BERTScore 0.225 0.224 0.918* 0.800* 0.129 0.100* -1.000* -1.000 0.046* 0.033* 1.000* 1.000
BLEURT 0.125 0.120 -0.777* -0.400* 0.176 0.133 1.000* 1.000 0.087* 0.051* 1.000* 1.000
QuestEval 0.279 0.319 0.283* 0.400* 0.020* 0.006* -1.000* -1.000 0.201 0.272 1.000* 1.000
RUBER -0.027* -0.042* -0.458* -0.400* -0.084* -0.094* -1.000* -1.000 -0.078* -0.039* 1.000* 1.000
BERT-RUBER 0.309 0.314 0.885* 1.000 0.134 0.128 -1.000* -1.000 0.163 0.148 1.000* 1.000
PONE 0.362 0.373 0.816* 0.800* 0.163 0.163 -1.000* -1.000 0.177 0.161 1.000* 1.000
MAUDE 0.351 0.304 0.748* 0.800* -0.036* -0.073* 1.000* 1.000 0.007* -0.057* 1.000* 1.000
DEB 0.426 0.504 0.995 1.000 0.337 0.363 1.000* 1.000 0.356 0.395 1.000* 1.000
GRADE 0.566 0.571 0.883* 0.800* 0.278 0.253 -1.000* -1.000 0.330 0.297 1.000* 1.000
DynaEval 0.138 0.131 -0.996 -1.000 0.108* 0.120 -1.000* -1.000 0.146 0.141 -1.000* -1.000
USR 0.501 0.500 0.995 1.000 0.057* 0.057* -1.000* -1.000 0.264 0.255 1.000* 1.000
USL-H 0.443 0.457 0.971 1.000 0.108* 0.093* -1.000* -1.000 0.293 0.235 1.000* 1.000
DialogRPT 0.137 0.158 -0.311* -0.600* -0.000* 0.037* -1.000* -1.000 0.211 0.203 1.000* 1.000
Deep AM-FM 0.117 0.130 0.774* 0.400* 0.026* 0.022* 1.000* 1.000 0.083* 0.058* 1.000* 1.000
HolisticEval -0.030* -0.010* -0.297* -0.400* 0.025* 0.020* 1.000* 1.000 0.199 0.204 -1.000* -1.000
PredictiveEngage 0.154 0.164 0.601* 0.600* -0.133 -0.135 -1.000* -1.000 -0.032* -0.078* 1.000* 1.000
FED -0.090 -0.072* -0.254* 0.000* 0.080* 0.064* 1.000* 1.000 -0.014* -0.044* 1.000* 1.000
FlowScore - - - - - - - - - - - -
FBD - - -0.235* -0.400* - - -1.000* -1.000 - - -1.000* -1.000

DSTC6 Data

DSTC6
Turn-Level System-Level
P S P S
BLEU-4 0.131 0.298 -0.064* 0.050*
METEOR 0.307 0.323 0.633 0.084*
ROUGE-L 0.332 0.326 0.487 0.215*
ADEM 0.151 0.118 0.042* 0.347*
BERTScore 0.369 0.337 0.671 0.265*
BLEURT 0.326 0.294 0.213* 0.426*
QuestEval 0.188 0.242 -0.215* 0.206*
RUBER 0.114 0.092 -0.074* 0.104*
BERT-RUBER 0.204 0.217 0.825 0.093*
PONE 0.208 0.200 0.608 0.235*
MAUDE 0.195 0.128 0.739 0.217*
DEB 0.211 0.214 -0.261* 0.492
GRADE 0.119 0.122 0.784 0.611
DynaEval 0.286 0.246 0.342* -0.050*
USR 0.184 0.166 0.432* 0.147*
USL-H 0.217 0.179 0.811 0.298*
DialogRPT 0.170 0.155 0.567 0.334*
Deep AM-FM 0.326 0.295 0.817 0.674
HolisticEval 0.001* -0.004* 0.010 -0.002
PredictiveEngage 0.043 0.004* -0.094* -0.409*
FED -0.106 -0.083 0.221* 0.322*
FlowScore 0.064 0.095 0.352* 0.362*
FBD - - -0.481 -0.234*

PredictiveEngage-DailyDialog

PredictiveEngage-DailyDialog
Turn-Level
P S
QuestEval 0.296 0.341
MAUDE 0.104 0.060*
DEB 0.516 0.580
GRADE 0.600 0.622
DynaEval 0.167 0.160
USR 0.582 0.640
USL-H 0.688 0.699
DialogRPT 0.489 0.533
HolisticEval 0.368 0.365
PredictiveEngage 0.429 0.414
FED 0.164 0.159
FlowScore - -
FBD - -

HolisticEval-DailyDialog

HolisticEval-DailyDialog
Turn-Level
P S
QuestEval 0.285 0.260
MAUDE 0.275 0.364
DEB 0.584 0.663
GRADE 0.678 0.697
DynaEval -0.023* -0.009*
USR 0.589 0.645
USL-H 0.486 0.537
DialogRPT 0.283 0.332
HolisticEval 0.670 0.764
PredictiveEngage -0.033* 0.060*
FED 0.485 0.507
FlowScore - -
FBD - -

FED Data

FED
Turn-Level Dialog-Level
P S P S
QuestEval 0.037* 0.093* -0.032* 0.080*
MAUDE 0.018* -0.094* -0.047* -0.280
DEB 0.230 0.187 -0.130* 0.006*
GRADE 0.134 0.118 -0.034* -0.065*
DynaEval 0.319 0.323 0.503 0.547
USR 0.114 0.117 0.093* 0.062*
USL-H 0.201 0.189 0.073* 0.152*
DialogRPT -0.118 -0.086* -0.221 -0.214
HolisticEval 0.122 0.125 -0.276 -0.304
PredictiveEngage 0.024* 0.094* 0.026* 0.155*
FED 0.120 0.095 0.222 0.320
FlowScore -0.065* -0.055* -0.073* -0.003*
FBD - - - -

DSTC9 Data

DSTC9
Dialog-Level System-Level
P S P S
QuestEval 0.026* 0.043 0.604 0.527*
MAUDE 0.059 0.042* 0.224* 0.045*
DEB 0.085 0.131 0.683 0.473*
GRADE -0.078 -0.070 -0.674 -0.482*
DynaEval 0.093 0.101 0.652 0.727
USR 0.019* 0.020* 0.149* 0.127*
USL-H 0.105 0.105 0.566* 0.755
DialogRPT 0.076 0.069 0.685 0.555*
HolisticEval 0.015* 0.002* -0.019* -0.100*
PredictiveEngage 0.114 0.115 0.809 0.664
FED 0.128 0.120 0.559* 0.391*
FlowScore 0.147 0.140 0.907 0.900
FBD - - -0.669 -0.627

How to Add New Dataset

Let the name of the new dataset be sample

Create a directory data/sample_data, write a function load_sample_data as follow:

def load_sample_data(base_dir: str):
    '''
    Args: 
        base_dir: the absolute path to data/sample_data
    Return:
        Dict:
        {
            # the required items
            'contexts' : List[List[str]], # dialog context. We split each dialog context by turns. Therefore one dialog context is in type List[str].
            'responses': List[str], # dialog response.
            'references': List[str], # dialog references. If no reference in the data, please still give a dummy reference like "NO REF".
            "scores": List[float] # human scores.
            # add any customized items
            "Customized Item": List[str] # any additional info in the data. 
        }
    '''

Import the function in gen_data.py, and run with python gen_data.py --source_data sample

How to Add New Metrics

Let the name of the new metric be metric

Write a function gen_metric_data to transform and generate the data into the metric directory:

# input format 1
def gen_metric_data(data: Dict, output_path: str):
    '''
    Args:
        data: the return value of load_data functions e.g. {'contexts': ...}
        output_path: path to the output file
    '''

# input format 2
def gen_metric_data(data: Dict, base_dir: str, dataset: str):
    '''
    Args:
        data: the return value of load_data functions e.g. {'contexts': ...}
        base_dir: path to the output directory
        dataset: name of the dataset
    '''

We have two input formats. Just follow the one which is easier for you.

Import the function in gen_data.py and follow comments in the code to add the metric.

Then write a function read_metric_result to read the prediction of the metric:

def read_metric_data(data_path: str):
    '''
    Args:
        data_path: path to the prediction file
    
    Return:
        # You can choose to return list or dict
        List: metric scores e.g. [0.2, 0.3, 0.4, ...]
        or 
        Dict: {'metric': List # metric scores}
    '''

Import the function in read_result.py and follow comments in the code to add the metric.

Then just follow the previous evaluation instructions to evaluate the metric.

dialevalmetrics's People

Contributors

exe1023 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dialevalmetrics's Issues

Unable to find "run_eval.sh"

Modify the path in run_eval.sh as specified in the script, since we need to activate Conda environment when running the script. Run eval_metrics.sh to evaluate all quality-anntoated data.


I can't find "run_eval.sh". Who knows where it is?

conda install:Found conflicts! Looking for incompatible packages.

when i run "conda env update --name eval_base --file conda_envs/eval_base/environment.yml" on mac,
error reports :
Collecting package metadata (repodata.json): done
Solving environment:
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
Solving environment:
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (libgcc-ng):

Unable to reexperiment with GRADE

So thanks for your existing work !

I have get the result of grade_convai and grade_dailydialog, for both pearson and spearman scores are the same with yours. While for grade_empatheticdialogues, I get the different results under the same computation and evaluate methods.

  1. First I use nlg_eval to compute baseline metrics for both three datasets
  2. I modify the human_correlation.py script in data/grade_data, to compute for all the scores in both three datasets, and as described in above, two of them get the same but last one is different.

Waiting for your suggestions, thanks !

如何获取bert-ruber分数和PONE分数

你好,请问bert-ruber和PONE有什么区别呀
我发现使用gen_data.py创建bert-ruber的数据时,产生的数据是放在了pone文件夹中,运行相应run之后,会产生三个文件:refer_score_dailydialog_sampled.json、ruber_score_dailydialog_sampled.json、unrefer_score_dailydialog_sampled.json,我想问一下这三个文件分别代表什么?
哪一个是bert-ruber的分数,哪个是pone的分数呀?

Unable to reproduce results

I'm unable to get the desired results by following the steps as given in the README due to multiple issues such as:-

  1. Environment creation failures for some of the them
  2. gen_data.py is giving errors for DSTC6 and Engage
  3. Transformers version issues for FBD, Maude (Eval_base environment) and TF version issues for run_bert_as_service.sh
    and some others as well.
  4. Various files such as restore/ensemble.yml, ./data/DailyDialog/keyword.vocab etc. not being present
    So I request to update the repository with the latest codes and files if possible to help get the results as highlighted in your paper.

regr.pkl file not available

when i run "python usr_fed/usr/usr_server.py", error arrears:
Traceback (most recent call last):
File "usr_server.py", line 1, in
import usr
File "/home/zper/DialEvalMetrics/usr_fed/usr/usr.py", line 6, in
import regression
File "/home/zper/DialEvalMetrics/usr_fed/usr/regression.py", line 6, in
with open('regr.pkl','rb')as f:
FileNotFoundError: [Errno 2] No such file or directory: 'regr.pkl'

after i wget https://github.com/Shikib/usr/blob/master/examples/regr.pkl from original usr model github,
error also appears:
pickle invalid load key, '\x0a'.
i wonder if you can provide your pkl file, thanks.

Turn-level DSTC9 annotation data

I noticed there is just dialog-level DSTC9 data used in your work. May I ask if turn-level DSTC9 annotation data is made public? How can I get this data? Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.