nana2929 / dialogue-absa Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 1.8 MB

License: Apache License 2.0

projects

dialogue-absa's Introduction

Dialogue ABSA

DiaASQ
Exploring the inference results for DiaASQ on several models
- ChatGPT (k-shot)
- T5 (fine-tuned)
- LLaMA 2 (fine-tuned)

dialogue-absa's People

Contributors

Watchers

dialogue-absa's Issues

[exp] DiaASQ T5 compute_metrics in Trainer

compute_metrics

I wrote a compute_metrics but need to know the content of EvalPrediction object.
Solution: pickling it out and study it until I can get strings out of it. The below code is used for testing. It's weird that the tokenizer initialized with the same name seems to give out different pad_token_id.

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=collate_fn,
    compute_metrics=calc_sentiment_scores(tokenizer),
)

#%%
import pickle
pred_path = './preds.pkl'
label_path = './labels.pkl'
evalpred_path = './evalprediction.pkl'



def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)

evalpred = load_pickle(evalpred_path)
from transformers import AutoTokenizer
model_name = 'allenai/tk-instruct-base-def-pos'
tokenizer = AutoTokenizer.from_pretrained(model_name)
#%%
import numpy as np
p = evalpred.predictions[0]
p = np.argmax(p, axis=-1)
print('p:', p.shape)
p = tokenizer.batch_decode(p, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print('p:', p)
#%%
l = evalpred.label_ids
l= np.where(l != -100, l, tokenizer.pad_token_id)
l = tokenizer.batch_decode(l, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print('l:', l)
# print('preds:', preds)
# print('len(preds):', len(preds))          # 2 (why 2??)
# print('shape(preds[0]):', preds[0].shape)
# batch_size * max_output_len * vocab_size.
# shape(preds[1]): (290, 512, 768)

# %%
# labels shape: labels.shape (290, 183)

[data prep] DiaASQ

Dataset Preprocessing

full-dialogue inference
speaker-specific inference
[-] reply-thread inference > pending

Data 數量

train: 800, valid: 100 documents (threads)
for both zh and en

An experiment idea

will separating to sub-threads help LLM do the task?
DiaASQ does model the reply relation in replay mask $M^{Rp}$.

Understanding data structure

Understanding data indices

sentences = train_example['sentences']
full_text = ' '.join(sent for sent in sentences)
full_text = full_text.split()
# find the quads 
# 我太鬼了 
quads = train_example['triplets'] 
for quad in quads:
    assert len(quad) == 10 
    target_s, target_t = quad[0], quad[1] 
    asp_s, asp_t = quad[2], quad[3] 
    opn_s, opn_t = quad[4], quad[5]
    pol = quad[6]
    aspect_string = quad[7]
    target_string = quad[8]
    opn_string = quad[9]
    print(f'pol: {pol}')
    print(f'aspect_string: {aspect_string}')
    print(f'target_string: {target_string}')
    print(f'opn_string: {opn_string}')

    print(full_text[target_s:target_t])
    print(full_text[asp_s:asp_t])
    print(full_text[opn_s:opn_t])
    print('-----------------')

pol: other
aspect_string: 13promax
target_string: 信号
opn_string: 是硬伤吗?
['13', 'promax']
['信', '号']
['是', '硬', '伤', '吗', '?']

Full-Dialogue DiaASQ zh

每個例子都很長，需要良好的分隔符號或提示：不在這邊加上「範例」或引導詞彙的話，ChatGPT 會以為範例是需要一起 inference 的例子。
full_dialogue_dataset.py
ChatGPT results

INFO:__main__:Found 22 files to be concatenated ...
INFO:dataset.diaasq.full_dialogue_dataset:Legal pool (k-examples pool) size: 409
INFO:__main__:Sanity check passed!
INFO:__main__:Writing inference file to output/diaasq/gpt-full-dialog-zh/FullDiaAsqDataset_gpt_eval.csv ...
INFO:__main__:Starting evaluation on valid (100 examples)...
{'aspect_f1': 0.4953959483845313,
 'iden_f1': 0.12240553480962135,
 'opinion_f1': 0.21926105385779768,
 'pair_ao_f1': 0.14426229503281143,
 'pair_ta_f1': 0.2990881458468997,
 'pair_to_f1': 0.11902231663541006,
 'quad_micro_f1': 0.084703537569097,
 'target_f1': 0.5363457759829979}

Paper statistics

[experiment] llm + diaASQ [en]

ChatGPT

為了和 t5 版本比較，打一樣的 instruction。
先實驗 en（中文版），再實驗 zh。
Follow 易庭的建議，改用 dataclass 來寫 data（zh）。

Configs

# reference: https://tsmatz.wordpress.com/2022/11/25/huggingface-japanese-summarization/
# Note : Do not use FP16 precision in mT5 fine-tuning.
seed: 42
data:
#   data_root: 'data/diaasq/speaker_dataset'
#   train_split_name: 'train'
#   test_split_name: 'valid'
  lang_src : 'en'

# proc_data and dataset follows the diaasq-t5-speaker-spec-en.yaml for experiment comparison
proc_data:
  type: 'speaker'
  data_root: 'data/diaasq/speaker_dataset/proc'
  train_ic_name: 't5_in_context' # use t5/create_kshot_dataset_split.py
  t5_train_split_name: 't5_train'
  test_ic_name: 't5_in_context'  # use the same in-context examples as in training
  t5_test_split_name: 't5_valid'

dataset:
  name: 'diaasq-speaker-spec-en'
  k: 1
  prompt_path: 'prompt/experiment/diaasq-speaker-spec-en-t5'
  in_context_strategy: None

model:
  model_name: 'gpt-3.5-turbo'
  max_tokens: 256 # t5: generation_max_length: 256 # t5: # max_length: 512
  temperature: 0

# private keys
envfile: './envs/.env'
output_dir: './results'

13. LLaMA family spike

[test] DiaASQ dataset

test in the sense of unit test or pytest

Since I am refactoring code constantly, writing tests takes too much time and gets outdated too quickly.

[survey] DiaASQ

DiaASQ

2023 paper

paper itself is the SOTA
github: https://github.com/unikcc/DiaASQ

Method summary

base encoder learns base contextual repr.,
multi-view interaction layers use 3 feature masks (Thread mask, Speaker mask, Reply Mask) + max pooling over masked attns
RoPE
Grid-tagging

Results

All f1 are strict f1 (spans and elements need to be completely correct)

Ablation

removing all feature masks

See Difficult SA Dataset Survey
預計納入 Dialogue ABSA, Structured Sentiment Analysis（好像有三個 English 的 datasets，domain 是大學、政治類的）, MAMS, ARTS。語言以中文和英文為主。
預計下載資料，確認資料類型。
（確認 Prompt）

Discussion / Stop and Think

What's left to do for this dataset?

- DiaASQ Full Dialogue+en Ver. (Only ChatGPT)
DiaASQ zh

#15
Speaker-Specific DiaASQ zh

Dynamic/Not-fixed In-context examples?
Following What Makes Good In-context examples for GPT-3?

Different ic-example for a different test example, eg. in speaker-spec version, Speaker A IC - Speaker A test

Evaluate if it is possible to move the whole research direction toward dialogue SA

We still have CASA as another dialogue SA choice.
However, it is not easy to model CASA since its subtasks are mentions + opinion + polarity identifying. The dataset is formulated differently from triplet extraction tasks. Will need extra effort to convert it.

[exp] t5 + diaASQ [en]

T5-Generation Fine-tuning Task

Data Preparation

can only use specially-preprocessed: speaker-spec dataset, because the max input length for t5 is 512.
preprocessing
1. lib/create_speaker_data.py --cfg=configs/diaasq-t5-en.yaml for creating speaker-specific data. Note that do not
  use speaker-spec configs. The resulted data will be saved to /home/nanaeilish/projects/research/sentiment-llm/data/diaasq/speaker_dataset/jsons_en (take en as lang_src for example; modify config if needed).
2. lib/create_kshot_dataset_split.py --cfg=configs/diaasq-t5-speaker-spec-en.yaml --is_speaker_dataset for creating k-shot in-context examples for speaker-specific data. The script will reuse the data created above and then save resulted data to data/diaasq/speaker_dataset/proc/jsons_en.
3. Don't forget to set speaker in step 2. so that the in-context examples all contain the same speaker.
prompts need extra designed
- To be precise, "in-context examples design"
1. Filter trainset, leaving the speaker data example with 3 sentiment tuples (complicated, demonstrative enough)
2. Sort the data example with the number of triplets; choose the k shots with the fewest triplets. This is because T5 has very strict input length limit (512), which is too short for a data example to be filled in... (hence k = 1 in the below experiment),

Eval metrics (micro)

target, aspect, opinion F1
$pair_{t-a}, pair_{t-o}, pair_{a-o} F1$
quadruple F1

Loss

order does not matter; but (needs inspection) loss seems to matter for seq2seq training loss:
loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))