budzianowski / multiwoz Goto Github PK

Source code for end-to-end dialogue model from the MultiWOZ paper (Budzianowski et al. 2018, EMNLP)

License: MIT License

Python 100.00%

machine-learning dialogue-systems dialogues dialogue seq2seq dialogue-manager dialogue-library natural-language-processing

multiwoz's People

Contributors

Stargazers

Watchers

Forkers

shubhampachori12110095 kaigexie johndpope little1tow xueshang-liulp nguyenkh sungjinlees shaform jiahuan-pei minwoo cstghitpku mihail-amazon aascode sudodoki xiongchao ufukhurriyetoglu jjw-megha jcarlosneto ivkrotova sjyttkl li-sin acastellanos-langai ttgit vaishnavmenon shaoxiaoyu gdbash qinghecode colinsongf kaggledevs dragomirradev pawel-polyai scoyer sunset-avenue qtli rameshddrr xuehuiping hunyuanfeng zqwerty suvodipdey ren98feng sherlock1987 yoonseokheo jihunchoi nlp-alpha 173745771 jimxionggm marziehngh yangpuhai hsvgbkhgbv ashutoshml ggjge burakakrishna soujanyaporia lidianxiang yuhonghe592 wei14cheng glhr trinh-hoang-hiep mihail911 huangcaiyun jem-mosig shanetian chulaka-g hydercps ufonia zlinao apprenticearnab tonynemo murorim nflubis hwaranlee tomiinek marry-20 leeshiyang zeinabbo elnaz776655 jimmy3663 jongwon-jay-lee songys pamasi ruinunca kopachinskaya leuchine taoyds matheusferraroni bmwas qbetterk imvansh25 yinpeidai yaxinfan1 yanguojun123 lqf96 yingxuh hujianglu andy194673 shanxixiaokang taesunwhang shiquanyang sachin-r guobzh3

multiwoz's Issues

What does `pre_invalid` mean in the `book` field?

Hi, what does pre_invalid mean in the book field? How is it relate to invalid? Thanks!

Some Questions on Benchmarks

Natural Language Generation

Here the Baseline (Budzianowski et al. 2018 https://pdfs.semanticscholar.org/47d0/1eb59cd37d16201fcae964bd1d2b49cfb55e.pdf) model got the BLEU score 0.632。

However, when I read the experiment records in the paper , there is no where to find such BLEU score.

[ERROR] in MultiWOZ 2.1

Some conversations have no dialog_act annotation in MultiWOZ 2.1:

PMUL4707.json
PMUL2245.json
PMUL4776.json
PMUL3872.json
PMUL4859.json

Will the data is not yet complete labeling?

Hi, thanks a lot for sharing the data!
I found that some slot labels in MultiWOZ2.2 are incomplete, like follow.

In PMUL0698.json, trun6, user says:"I am leaving from Cambridge and going to Norwich." for book train, this turn has train-departure and train-destination slots, but there are no labels in slots field, Will the data is not yet complete labeling?

The following is the part of MultiWOZ2.2/dev/dialogues_001.json

{
        "frames": [
          {
            "actions": [],
            "service": "restaurant",
            "slots": [],
            "state": {
              "active_intent": "NONE",
              "requested_slots": [],
              "slot_values": {
                "restaurant-area": [
                  "centre"
                ],
                "restaurant-food": [
                  "chinese"
                ]
              }
            }
          },
          {
            "actions": [],
            "service": "train",
            "slots": [],
            "state": {
              "active_intent": "find_train",
              "requested_slots": [],
              "slot_values": {
                "train-day": [
                  "sunday"
                ],
                "train-departure": [
                  "cambridge"
                ],
                "train-destination": [
                  "norwich"                       # in slot-values but not in slots, and no slots fields contains it before this turn.
                ],
                "train-leaveat": [
                  "16:15"
                ]
              }
            }
          },
          {
            "actions": [],
            "service": "taxi",
            "slots": [],
            "state": {
              "active_intent": "NONE",
              "requested_slots": [],
              "slot_values": {}
            }
          },
          {
            "actions": [],
            "service": "bus",
            "slots": [],
            "state": {
              "active_intent": "NONE",
              "requested_slots": [],
              "slot_values": {}
            }
          },
          {
            "actions": [],
            "service": "police",
            "slots": [],
            "state": {
              "active_intent": "NONE",
              "requested_slots": [],
              "slot_values": {}
            }
          },
          {
            "actions": [],
            "service": "hotel",
            "slots": [],
            "state": {
              "active_intent": "NONE",
              "requested_slots": [],
              "slot_values": {}
            }
          },
          {
            "actions": [],
            "service": "attraction",
            "slots": [],
            "state": {
              "active_intent": "NONE",
              "requested_slots": [],
              "slot_values": {}
            }
          },
          {
            "actions": [],
            "service": "hospital",
            "slots": [],
            "state": {
              "active_intent": "NONE",
              "requested_slots": [],
              "slot_values": {}
            }
          }
        ],
        "speaker": "USER",
        "turn_id": "6",
        "utterance": "I am leaving from Cambridge and going to Norwich."
      }

Python 2 Depreciation

Hi!

Your preprocessing script in create_delex_data.py is a good start for working with Multiwoz. But, it was implemented in Python 2 that is depreciated. I refactored this part and dependent codes for being compatible in Python 3 once for my research project that I think can help others. I want to ask may I push them and make a pull request?

Thanks!

2.2 action annotations missing?

The 2.2 dataset doesn't appear to have any system action annotations at all. The json format for it is convenient, but it isn't useful to me without the action annotations. Will they be added soon?

The upper bound of Inform and Success rate?

I run evaluate.py and get Matches(inform): 90.40, Success 82.3. Are these the upper bound of metric Inform and Success? In some paper, the inform and success rate can exceed 90.40,82.3. In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

data

Hi,

May I ask, whether the three MultiWOZ dataset in your folder data/ are same as downloaded from https://www.repository.cam.ac.uk/handle/1810/294507
Thanks for your feedback.

I want to the storage structure of this dataset, so just ran 'create_delex_data.py' and also read all of the source code in this script. But i still confused about "db" and "bs". could you help me explain something about these two variables? Thanks in advance!

Mirror of the dataset

Hi Paweł,

Do you know if there's are mirrors of the dataset anywhere? The cambridge website has reported that it's under maintenance for a couple weeks now.

Cheers,
Stephen

multiwoz/create_delex_data.py

Line 267 in 9fd409f

 dataset_url = "https://www.repository.cam.ac.uk/bitstream/handle/1810/280608/MULTIWOZ2.zip?sequence=3&isAllowed=y" 

Restaurant name error in the dataset "MUL1382.json"

In the Multi-WOZ 2.2 "data.json", line 7419586, dialog_id : MUL1382.json:
"text": "We've narrowed it down to 3. kihinoor, the gandhi, and mahal of cambridge. Would you like me to make a reservation for you?"
"text": "Yes please make a reservation for 3 people at 16:00 on Saturday at any of those choices."
"text": "I was able to book at Kohinoor for 16:00 on Saturday for 3 people. Your reference number is NTJ52ASI. The table will be held for 15 minutes."

Actually, in the database for restaurant domain, there is no restaurant named "kihinoor" but there is one restaurant named "kohinoor". And based on the next two utterances, I believe the first restaurant name in the first utterance should be "kohinoor".

the act_loss always is 0 and test.py always output 'can't decode'

[

system action annotation in MultiWOZ2.2

Firstly, thanks for launch of MultiWOZ2.2 dataset. Really appreciate for the contribution and correctness.

I found there are 15 errors in system act annotation in MultiWOZ2.2. please find more details in the following. for every annotation error, I showed dialogue_id, turn_id, span_info and the corresponding system response. Hopefully it helps. Thanks a lot.

**MUL0963.json
13
['Taxi-Inform', 'arriveby', '9:15', 19, 19]
Ok, a white audi will pick you up at cafe jello gallery and bring you to Ali baba by 19:15. You can contact the driver at 07646811518. Anything else?

MUL1382.json
3
['Restaurant-Inform', 'name', 'kihinoor', 29, 37]
We've narrowed it down to 3. kohinoor, the gandhi, and mahal of cambridge. Would you like me to make a reservation for you?

PMUL0363.json
9
['Restaurant-Inform', 'food', 'French', 35, 41]
Restaurant Restaurant Two Two is an expensive French restaurant in the north with wonderful food. Would you like to book a table?

PMUL0363.json
9
['Restaurant-Inform', 'area', 'north', 60, 65]
Restaurant Restaurant Two Two is an expensive French restaurant in the north with wonderful food. Would you like to book a table?

PMUL0363.json
9
['Restaurant-Inform', 'pricerange', 'expensive', 25, 34]
Restaurant Restaurant Two Two is an expensive French restaurant in the north with wonderful food. Would you like to book a table?

PMUL0363.json
9
['Restaurant-Inform', 'name', 'Two Two', 11, 18]
Restaurant Restaurant Two Two is an expensive French restaurant in the north with wonderful food. Would you like to book a table?

PMUL2368.json
11
['Booking-Book', 'ref', '9Z58HWE1,general-reqmore:', 10, 10]
I have you booked at Charlie Chan on Saturday at 20:00 for 5 people. Your reference number is 9Z58HWE1. They hold the table for 15 minutes. Is there anything else?

PMUL2584.json
11
['Taxi-Inform', 'leaveat', '19:00,general-reqmore:', 11, 11]
A grey skoda will pick you up at the hotel by 19:00 to take you to the Castle Galleries. Your contact number is 07375156908. Will there be anything else today? : 07375156908

PMUL3093.json
5
['Train-Inform', 'arriveby', '1:54', 3, 3]
TR8659 leaves at 10:09 and arrives at 11:54, will that work for you?

PMUL3382.json
11
['Train-Inform', 'leaveat', '11:50', 2, 2]
TR0767 leaves at11:50 on Friday morning, arriving 12:07. Price is 4.40 pounds. Would you like me to book a seat?

PMUL4077.json
13
['Taxi-Inform', 'arriveby', '5:15', 5, 5]
Ok you will arrive at 15:15 in a yellow skoda Contact number :07710839987

PMUL4115.json
5
['Train-Inform', 'leaveat', '19:39', 2, 2]
TR3197 leaves atb19:39 and costs 13:39 pounds. is that fine with you?

PMUL4385.json
3
['Train-Inform', 'leaveat', '9:29', 20, 20]
You have a few options available if you're traveling from bishops stortford to cambridge. There is a train leaving at 09:29, Does that work for you?

SNG01733.json
5
['Train-Inform', 'leaveat', '5:40', 6, 6]
Train TR7213 departing from cambridge at 05:40 and arriving at stansted airport at 06:08 will be the best option for you.

SNG1041.json
9
['Hotel-Inform', 'type', 'guesthouse,general-reqmore:', 10, 11]
I remind you that you can check-in in this guesthouse after 3:00 pm. You can leave your suitcases anytime.**

NLTK doesn't support python 2.* anymore, code needs update.

NLTK breaks when trying to run create_delex_data.py, there are syntax errors resulting. The solution would be to either specify the exact versions of the packages used in requirements.txt or update the code to python 3.

The annotation tool

I want to annotate similar data for different language, thus I want to build a web-based annotation tool, can you share the code of annotation tool or any suggestion?
Thanks.

About requestable slots for success rate

Hi, I have a question about requestable slots for success rate.

multiwoz/evaluate.py

Line 210 in d5f0a56

requestables = ['phone', 'address', 'postcode', 'reference', 'id']

In evaluate.py, requestables just includes 5 slots('phone', 'address', 'postcode', 'reference', 'id').
But, there are more requestable slots in user's goal.
ex) train-price, taxi-car type, attraction-entrance fee...
Why does it evaluate just 5 slots?

Inform and Success metrics

Hi,
I don't seem to understand the inform metric very well. What do you exactly mean by providing the right entity and why is the inform rate not 100% even with the oracle belief state. Does this mean that dialogue state prediction systems must do better than oracle in-order to improve inform rate?

on benchmarks

I noticed some of the results listed in the Benchmarks are different from those claimed in the original articles. For example, in SimpleTOD article, Joint Accuracy is 56.45, while in your list, the number is 55.72.
How do you get these results? Is there a script for everyone? Or you just rerun their model and report the results you get?

what's the nlp library ?

from nlp import normalize in dbPointer.py. What is this nlp library? link to install?

MultiWOZ2.2

I just run the "convert_to_multiwoz_format.py" in MultiWOZ2.2, there is the following error:

File "convert_to_multiwoz_format.py", line 85, in main
clean_dialogue = clean_data[dialogue_id]
KeyError: 'SNG01862.json'

it seems there is no 'SNG01862.json' in the dialogue_acts.json file. how to fit it? @XiaoxueZang

Thanks in advance.

createDelexData problem

multiwoz/create_delex_data.py

Line 335 in 42a1ff2

idx_acts +=1

Each act in dialogue_acts.json is the response corresponding to the system, and the number of acts in each dialogue is equal to the number of responses in each dialogue.
idx should be equal to twice the idx_acts, because only when idx is odd, it corresponds to response.

New SoTA for Policy optimization and end-to-end generation

Hello,

We have just published our work which reach new SoTA for policy optimization and end-to-end generation tasks.
https://www.aclweb.org/anthology/2020.coling-main.41.pdf

Policy optimization
Match/Success/BLEU
MW 2.0: 97.50/94.80/0.12
MW 2.1: 96.39/83.57/0.14

End-to-End generation
Match/Success/BLEU
MW 2.0: 91.80/81.80/0.12

Would it be possible to include this work on the page? Thank you!

MultiWOZ2.2

Hey

Could you please upload the data for MultiWOZ2.2 here aswell?

DB missing?

Hi, The hospital-dbase.db and tax-dbase.db are empty in the db folder.

Evaluation tools for belief tracking?

Hi, I wanna know is there an official evaluation tool for dialog belief tracking? The code of TRADE is built-in and confusing...

dialog_acts.json does not exist in Multizwoz 2.1

Is it okay to use dialog_acts.json that was used in version 2.0?
Or is it that the version 2.1 does not require dialog_act.json?

New publication for DST on MultiWOZ

Dear All,

we recently handed in a pre-print of our soon-to-be-published paper about a new model for DST on MultiWOZ 2.1, where we achieve a JGA of 55.3%.
https://arxiv.org/abs/2005.02877

Thank you & Best regards

Which version of pytorch are you using?

Hi Budzianowski,
Have you run the code on GPUs? I cannot make it. So may I know which version of Pytorch are you using?

License

Can you add the license for the baseline in case people want to use it? Thanks!

invalid and pre_invalid

What do these two fields mean in book domain in MultiWOZ 2.0?

Hyperparameter of trade on MultiWOZ 2.1

Hello,

I'm conducting some experiments with trade on MultiWOZ 2.1. I simply replaced the dataset used by trade, which experiments on MultiWOZ 2.0, with the hyperparameters unchange., However, this only reached an accuracy of 35% approximately. This result is much lower than the result this paper reported, which I guess it may be an issue of hyperparameters of the trade model.

However, I'm not able to find any reference related to this, I even have no idea if the hyperparameters of trade, or other models, change between these two datasets. I wonder if it is possible for me to get the specific values of these hyperparamters of models on MultiWOZ 2.1, so I could reproduce the result? Thanks in advance.

Dataset annotation process

I have started to collect a new dataset for a new domain, but I don't know how to annotate the dataset.
Should I annotate them manually? Or is there a helpful tool to do it?

New results on context-to-response, and end-to-end evaluation from SOLOIST

Hi Paweł ,

We just released a paper last week : SOLOIST: Few-shot Task-Oriented Dialog with A Single Pre-trained Auto-regressive Model. SOLOIST a pretraining-finetuning solution to building task-oriented dialog at scale with limited training examples and annotation efforts. Details can be found at https://arxiv.org/pdf/2005.05298.pdf . Project website is at here

We have updated numbers on context-to-response, and end-to-end evaluation setting. @budzianowski Could you please help update the leaderboard ?

Context-to-response using MultiWOZ 2.0

Inform: 89.60
Success: 79.30
BLEU: 18.03

End-to-end Evaluation using MultiWOZ 2.0:

Inform: 85.50
Success: 72.90
BLEU : 16.54

New results on dialogue state tracking , policy optimization

We have released our new results on arxiv

A Simple Language Model for Task-Oriented Dialogue
TL;DR: SimpleTOD is a simple approach to task-oriented dialogue that uses a single causal language model trained on all sub-tasks recast as a single sequence prediction problem.
https://arxiv.org/abs/2005.00796

Belief Tracking:

version      joint acc
2.1          55.72

Policy Optimization:

version     Inform    Success       Bleu
2.0          84.4       70.1        15.01
2.1          85         70.5        15.23

@budzianowski do you update the leaderboard? Or should we open a PR?
cc @bmccann

a error in the data.json

Hi,
I found "span_info" of text: "The city centre north b and b has parking and wifi. It is in the north area. Would you like to book this hotel?" an error, i think the index of the value ' north' should. be the second 'north', the first 'north' is a part of name 'city centre north b and b'. could you modify that?
details in he following

"text": "The city centre north b and b has parking and wifi. It is in the north area. Would you like to book this hotel?",
"metadata": {
"taxi": {
"book": {
"booked": []
},
"semi": {
"leaveAt": "",
"destination": "",
"departure": "",
"arriveBy": ""
}
},
"police": {
"book": {
"booked": []
},
"semi": {}
},
"restaurant": {
"book": {
"booked": [
{
"name": "nandos city centre",
"reference": "LYIENP77"
}
],
"people": "4",
"day": "wednesday",
"time": "15:00"
},
"semi": {
"food": "not mentioned",
"pricerange": "not mentioned",
"name": "nandos city centre",
"area": "not mentioned"
}
},
"hospital": {
"book": {
"booked": []
},
"semi": {
"department": ""
}
},
"hotel": {
"book": {
"booked": [],
"people": "",
"day": "",
"stay": ""
},
"semi": {
"name": "not mentioned",
"area": "not mentioned",
"parking": "yes",
"pricerange": "not mentioned",
"stars": "0",
"internet": "yes",
"type": "guesthouse"
}
},
"attraction": {
"book": {
"booked": []
},
"semi": {
"type": "",
"name": "",
"area": ""
}
},
"train": {
"book": {
"booked": [],
"people": ""
},
"semi": {
"leaveAt": "",
"destination": "",
"day": "",
"arriveBy": "",
"departure": ""
}
}
},
"dialog_act": {
"Booking-Inform": [
[
"none",
"none"
]
],
"Hotel-Inform": [
[
"Name",
"city centre north b and b"
],
[
"Area",
"north"
],
[
"Internet",
"none"
],
[
"Parking",
"none"
]
]
},
"span_info": [
[
"Hotel-Inform",
"Name",
"city centre north b and b",
1,
6
],
[
"Hotel-Inform",
"Area",
"north",
3,
3
]
]
},

Hospital database is not complete

The training data contains goals asking for hospital name, postcode and address, but the hospital database only contains departments and phone numbers. Any idea where the complete hospital database could be found? It must exist for the training data to have been created. For example: "Addenbrookes Hospital on Hills Rd".

DB: Booking availability is not provided in database

In db pointer vector, there is information of whether booking is available or not.
Is there a way that using predicted belief states, we compute booking availability for the retrieved entities?

Example:
for the attached example with following gold belief on restaurant, the retried entity has booking=available in db pointer vector, but there is no booking information in the restaurant db

belief : {'pricerange': 'cheap', 'area': 'centre', 'name': 'dojo noodle bar'}

retrieved restaurant: ('19225', '40210 Millers Yard City Centre', 'centre', 'asian oriental', 'dojo noodle bar serves a variety of japanese chinese vietnamese korean and malaysian dishes to eat in or take away sister restaurant to touzai', 'dojo noodle bar', '01223363471', 'cb21rq', 'cheap', 'NULL', 'restaurant')

@budzianowski
cc: @bmccann

Why is the slot value train-arriveby not updated?

Hi.When I use the test_dials to test my dst model.I found that some slot-values about train-arriveby are not update in the ground truth.
Such as MUL2294 :
"transcript": "i need to travel on saturday from cambridge to london kings cross and need to leave after 18:30",
"system_transcript": "train tr0427 leaves at 19:00 on saturday and will get you there by 19:51. the cost is 18.88 pounds. want me to book it?",
"transcript": "yes please book the train for 1 person and provide the reference number"

From above we can see that the slot-value train-arriveby has changed in the dialog.But why this slot-value is not changed in the ground truth?
This confused me a lot,hope you can have a look,thank you!

Comma missing in db/taxi_db.json

Hi~ I'm Tianbao from Harbin Institute of Technology, I recently load the db from your json files to carry out some reseach on dialogue system, and I noticed db/taxi_db.json couldn't be convert to list of dict by pyhton json pkg since

multiwoz/db/taxi_db.json

Line 3 in e87b0a3

 "taxi_types": ["toyota","skoda","bmw",'honda','ford','audi','lexus','volvo','volkswagen','tesla'] 

may missed a comma in the end of this line. I'm a little confused that is there anything wrong or it has some other usage instead of use it as a table database? Looking for your reply!

Conversion of non-categorial slot values to database keys.

Hello, I trying figure out whether is there a way to convert a value that appears in a slot to a canonical form suitable for querying database.

For example, according to the database, there exists an attraction named sheep's green and lammas land park fen causeway which appears in various forms in the annotation, namely:

sheep's green and lammas land park
sheeps green and lammas land park fen
sheep's green
sheeps green
lammas land park

I belief that I need to convert all those names to the original form to successfully query the database, is it true? And is there a code which can do it for me (or a mapping, normalization etc.)?

Is it an evaluation bug?

hello,

In this two lines,
https://github.com/budzianowski/multiwoz/blob/master/model/evaluator.py#L212
https://github.com/budzianowski/multiwoz/blob/master/model/evaluator.py#L362

The "venue_offered[domain]" is a string, so the [0] will give you just a token "[", I think the "venue_offered[domain][0] in goal_venues" does not do the logic you try to do.

Please check if this makes any mistake in your evaluation. Thanks.

Run on GPU

Can the model run on gpu? I got errors when I set no_cuda to False.

Does the preprocessing script work for 2.2?

There is no instruction in README on how to run preprocessing for 2.2. Instead, it implies that delexicalization is only compatible with earlier versions. Can someone confirm if that's the case? Thanks.

Discussion regarding `Trippy exists data leakage problem`

The line I refer to means that the final informed values(from dialog actions) are checked with ground-truth values. See the code,the informed_value will be set to none if the value from dialog action is not equal to ground-truth value.

Originally posted by @HuangLK in #64 (comment)

Normalization during test

Hi.

multiwoz/evaluate.py

Line 133 in a24d299

val2 = normalize(val2)

multiwoz/utils/nlp.py

Line 73 in a24d299

text = re.sub(timepat, ' [value_time] ', text)

In this line, the values of belief state are normalized before searching DB.

But, the time values of train(leaveAt, arrvieBy) are also normalized to [value_time].

In this case, I think we cannot check whether the values meet the user's goal, right?

Is it intended?

on benchmarks

Tokenization and BLEU score

Hello, I have a hard time evaluating my model.

First, the score for DAMD in the end-to-end modeling table should be 16.6 (as described in their paper) and not 18.6.

Second, I found out that the way I tokenize my responses highly affects the resulting BLEU score. I checked the systems from the end-to-end modeling table that have an open implementation and I am afraid that the numbers are not comparable:

DAMD (16.6) -- the tokens they use seem to be the same as the tokens that are predicted by their model (they do not use subwords).
LABES-S2S (18.3) -- The score seems to me too high concerning their rather low inform and success rates. However, there is no code or predictions.
LAVA (12.0) -- I cannot find out why their score is too low. The outputs they provide are good, and they probably use the tokens corresponding to the words predicted by their model including the tokens in ground truth responses.
UBAR (17.0) -- I do not understand the code. It is adapted from DAMD, but they use subwords, so I am not sure about it. Besides that, their reported inform rate is higher than the theoretical upper bound which I hope is somewhere around 92.2.
SimpleTOD (15.01) -- They decode responses using the HF GPT2 tokenizer and split them by whitespaces to get the tokens for computing the BLEU score. However, they do not care about interpunction or other stuff, so it is underestimated compared to the DAMD.
MinTL (17.89) -- They decode responses using the HF BERT tokenizer, prepend all . , ! ? : 's with spaces, split them by whitespaces, and use the tokens for the BLEU score.
SOLOIST (16.56), SUMBT+LaRL (17.9) -- no code 😞
Others - More papers are evaluating MultiWOZ and comparing to these numbers. Some of them use the NLTK tokenizer and it probably results in overestimated scores compared to the DAMD or MinTL.

I evaluated my data using different tokenization approaches and there are results:

NLTK tokenization with special care of the delexicalized spans - 17.9
NLTK tokenization without special care of the delexicalized spans - 24.5
whitespace splitting - 14.0
whitespace splitting with prepending . , ! ? : 's - 16.9

I think this shows that the evaluation script in this repository should be modified so that it first normalizes the input strings (for example using tokenization and immediate detokenization with the Moses tokenizer), somehow resolves the delexicalized spans (removes spaces etc., removes [ and ]) and does the tokenization on its own. I would really appreciate a standalone script that would be able to output the score from the delexicalized responses with corresponding dialogue and turn ids (provided in a file in a predefined format).
Or at least a guide to the preferred tokenization would be highly appreciated (for future generations).

Similarly, it would be also very nice to have a standalone script for computing inform and success rates that would accept just a file with delexicalized responses (taking into accunt that domain names do not have to be present in the spans) and corresponding dialogue states in .json

Randomness in evaluation?

I spotted some randomness in evaluation code. For example,

multiwoz/model/evaluator.py

Line 142 in e4922d6

venue_offered[domain] = random.sample(venues, 1)

Wouldn't it make the match and success rates different even if we evaluate the same model twice?

fail_book content is missing in some examples

In the example PMUL1848.json the fail_book field for the hotel domain is missing. The prompt message, on the other hand, does include instructions on what to do if booking fails:

If the booking fails how about <span class='emphasis'>friday</span>

Am I right that this is an inconsistency in the data, or am I missing something?

Here is the complete JSON goal:

 'police': {},
 'hospital': {},
 'hotel': {'info': {'area': 'east',
   'internet': 'yes',
   'type': 'guesthouse',
   'parking': 'yes'},
  'fail_info': {},
  'book': {'people': '5', 'day': 'thursday', 'invalid': True, 'stay': '5'}},
 'topic': {'taxi': False,
  'police': False,
  'restaurant': False,
  'hospital': False,
  'hotel': False,
  'general': False,
  'attraction': False,
  'train': False,
  'booking': False},
 'attraction': {},
 'train': {'info': {'destination': 'cambridge',
   'day': 'friday',
   'arriveBy': '14:00',
   'departure': 'stansted airport'},
  'fail_info': {},
  'book': {'invalid': True, 'people': '5'},
  'fail_book': {}},
 'message': ['You are planning your trip in Cambridge',
  "You are looking for a <span class='emphasis'>place to stay</span>. The hotel should <span class='emphasis'>include free parking</span> and should <span class='emphasis'>include free wifi</span>",
  "The hotel should be in the type of <span class='emphasis'>guesthouse</span> and should be in the <span class='emphasis'>east</span>",
  "Once you find the <span class='emphasis'>hotel</span> you want to book it for <span class='emphasis'>5 people</span> and <span class='emphasis'>5 nights</span> starting from <span class='emphasis'>thursday</span>",
  "If the booking fails how about <span class='emphasis'>friday</span>",
  "Make sure you get the <span class='emphasis'>reference number</span>",
  "You are also looking for a <span class='emphasis'>train</span>. The train should <span class='emphasis'>arrive by 14:00</span> and should be on <span class='emphasis'>the same day as the hotel booking</span>",
  "The train should depart from <span class='emphasis'>stansted airport</span> and should go to <span class='emphasis'>cambridge</span>",
  "Once you find the train you want to make a booking for <span class='emphasis'>the same group of people</span>",
  "Make sure you get the <span class='emphasis'>reference number</span>"],
 'restaurant': {}}```

What does parking=none mean?

In hotel domain, I saw some parking=none annotations, what does it mean?

generate SLU datasets

Does it possible to generate the data set of the SLU task, that is, the data set of the sequence labeling task, I need to obtain all possible values of slots in each turn (whether or not the values are in the results of the DST)

budzianowski / multiwoz Goto Github PK

multiwoz's People

Contributors

Stargazers

Watchers

Forkers

multiwoz's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs