riow1983 / kaggle-coleridge-initiative Goto Github PK

spaCyにしろNERDAにしろ、結局ラッパーなんで、最終的にガチれないところはコンペ向きでは無いかもしれない。そういう意味ではhuggingface+PyTrochが王道なのかもしれん。
https://www.reddit.com/r/LanguageTechnology/comments/lnca2q/some_questions_about_spacy_vs_hugging_face/

huggingfaceのpre-trained BERTをPyTorchXLA(TPU)でfine-tune(NERタスク)するColab notebookが落ちてた。
https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_ner.ipynb

しかも訓練データはKaggleのデータ: https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
(コンペではなく、4年以上前に作られたデータセットだが）

Originally posted by @riow1983 in #6 (comment)

[Solutions] コンペ概要資料作成

以下３点についてスライド１枚〜２枚にまとめる:

コンペ概要
データセット説明
評価指標説明

[Solutions] 43rd (Ours!)

[Discussion]
{TITLE}

[Code]
{TITLE}

[一言で言うと]
{SAY SOMETHING}

[関連技術]

{NAME}
{TITLE}

[Solutions] 4th

[Discussion]
4th place solution - LB probing, acronym detection, and NER

[Twitter]
https://twitter.com/osciiart/status/1407602369156698116

[Code]
210622_det1_NERu_train_govt

[一言で言うと]
ルールベースで検出したデータセット名候補からFP除去したもの + 外部データ string matchingとspaCy NERによる表記ゆれ検出

[関連技術]

NERによる表記ゆれ是正(?)
[{NAME}]({URL}]

[GitHub]
vlomme/Find-dataset-from-text

[一言で言うと]
string matchingを使わず, 正規表現とBERT NERを使用.

[関連知識]

Offset mapping for BERT's word-piece tokenization
🤗 Fine-tuning with custom datasets

Big Bird使ったnotebook探してくる

cleaned_labelをカテゴライズしたものでGroup KfoldしたCVを作成する (train.csvに対して)

対象は無加工のtrain.csv
cleaned_labelをカテゴライズしたものをgroupとしてGroup Kfoldを行う
- その際, 教師ラベルをどの変数(カラム)にするかは未定 (適当でいい?)

[参考: cleaned_label 全130種 from train.csv]

0 national education longitudinal study
1 noaa tidal station
2 slosh model
3 noaa c cap
4 aging integrated database agid 
5 alzheimers disease neuroimaging initiative
6 aging integrated database
7 noaa national water level observation network
8 noaa water level station
9 baltimore longitudinal study of aging blsa 
10 national water level observation network
11 arms farm financial and crop production practices
12 beginning postsecondary student
13 noaa sea lake and overland surges from hurricanes
14 noaa tide gauge
15 the national institute on aging genetics of alzheimer s disease data storage site
16 national center for education statistics common core of data
17 national science foundation survey of industrial research and development
18 baccalaureate and beyond
19 noaa international best track archive for climate stewardship
20 agricultural resource management survey
21 national teacher and principal survey
22 international best track archive for climate stewardship
23 nsf higher education research and development survey
24 national science foundation survey of earned doctorates
25 school survey on crime and safety
26 the national institute on aging genetics of alzheimer s disease data storage site niagads 
27 national oceanic and atmospheric administration world ocean database
28 beginning postsecondary students longitudinal study
29 nces common core of data
30 program for the international assessment of adult competencies
31 survey of earned doctorates
32 baltimore longitudinal study of aging
33 early childhood longitudinal study
34 adni
35 national science foundation survey of graduate students and postdoctorates in science and engineering
36 trends in international mathematics and science study
37 national oceanic and atmospheric administration c cap
38 nsf survey of earned doctorates
39 noaa tide station
40 education longitudinal study
41 optimum interpolation sea surface temperature
42 national oceanic and atmospheric administration optimum interpolation sea surface temperature
43 alzheimer s disease neuroimaging initiative adni 
44 baccalaureate and beyond longitudinal study
45 agricultural resources management survey
46 beginning postsecondary students
47 ibtracs
48 coastal change analysis program
49 survey of graduate students and postdoctorates in science and engineering
50 national assessment of education progress
51 sea surface temperature optimum interpolation
52 high school longitudinal study
53 nsf survey of graduate students and postdoctorates in science and engineering
54 national science foundation survey of doctorate recipients
55 survey of doctorate recipients
56 coastal change analysis program land cover
57 survey of industrial research and development
58 world ocean database
59 rural urban continuum codes
60 noaa optimum interpolation sea surface temperature
61 noaa world ocean database
62 common core of data
63 higher education research and development survey
64 noaa storm surge inundation
65 national weather service nws storm surge risk
66 survey of science and engineering research facilities
67 nsf survey of industrial research and development
68 national science foundation survey of science and engineering research facilities
69 national science foundation higher education research and development survey
70 national center for science and engineering statistics survey of earned doctorates
71 national center for science and engineering statistics survey of science and engineering research facilities
72 national center for science and engineering statistics survey of graduate students and postdoctorates in science and engineering
73 national center for science and engineering statistics survey of doctorate recipients
74 national center for science and engineering statistics survey of industrial research and development
75 national center for science and engineering statistics higher education research and development survey
76 nsf survey of science and engineering research facilities
77 ffrdc research and development survey
78 nsf ffrdc research and development survey
79 survey of state government research and development
80 ncses survey of doctorate recipients
81 ncses survey of graduate students and postdoctorates in science and engineering
82 anss comprehensive earthquake catalog
83 anss comprehensive catalog
84 advanced national seismic system anss comprehensive catalog comcat 
85 advanced national seismic system comprehensive catalog
86 census of agriculture
87 usda census of agriculture
88 nass census of agriculture
89 north american breeding bird survey
90 north american breeding bird survey bbs 
91 usgs north american breeding bird survey
92 covid 19 open research dataset cord 19 
93 covid 19 open research dataset
94 covid open research dataset
95 covid 19 open research data
96 complexity science hub covid 19 control strategies list cccsl 
97 complexity science hub covid 19 control strategies list
98 cccsl
99 our world in data covid 19 dataset
100 our world in data covid 19
101 our world in data
102 jh crown registry
103 characterizing health associated risks and your baseline disease in sars cov 2 charybdis 
104 characterizing health associated risks and your baseline disease in sars cov 2
105 covid 19 death data
106 sars cov 2 genome sequence
107 sars cov 2 genome sequences
108 covid 19 genome sequence
109 covid 19 genome sequences
110 2019 ncov genome sequence
111 2019 ncov genome sequences
112 sars cov 2 full genome sequence
113 sars cov 2 full genome sequences
114 sars cov 2 complete genome sequence
115 sars cov 2 complete genome sequences
116 2019 ncov complete genome sequences
117 genome sequences of sars cov 2
118 genome sequence of sars cov 2
119 genome sequence of covid 19
120 genome sequences of covid 19
121 genome sequence of 2019 ncov
122 genome sequences of 2019 ncov
123 covid 19 image data collection
124 rsna international covid 19 open radiology database ricord 
125 rsna international covid 19 open radiology database
126 rsna international covid open radiology database
127 cas covid 19 antiviral candidate compounds dataset
128 cas covid 19 antiviral candidate compounds data set
129 cas covid 19 antiviral candidate compounds data

reference

[Submission] Finalをどう決めるか

Two submissions:
- 1つは素直にPublic LBが最も高いもの
- もう1つは, overfitしていないと思われるもの

[Solutions] 1st

[Discussion]
1st place solution: Metric learning and GPT

1st solution: Matching the : Context Similarity via Deep Metric Learning and Beyond

[Code]

[GitHub]

https://github.com/suicao/coleridge-gpt

[一言で言うと]
以下２アプローチをそれぞれ別個に用意:

GPT(QAタスク) + beamsearch (private LB = 0.565 (0.594 w/o labels obtained by scispaCy's abbreviation detecotor)) by Khoi Nguyen
metric learning with usual MLM backbones (private LB = 0.576 (0.588 w/o labels obtained by scispaCy's abbreviation detecotor)) by Nguyen Quan Anh Minh

[関連知識]

Source code for pytorch_transformers.modeling_gpt2
torch.utils.data.TensorDataset
https://schemer1341.hatenablog.com/entry/2019/01/06/024605
Is it okay to create a Loss function within the forward method?
ArcFace loss
モダンな深層距離学習 (deep metric learning) 手法: SphereFace, CosFace, ArcFace

QA系は失敗している様子 (NERのほうは上手くいってそう）

My score is quite bad as well. I suspect the Question Answering approach is not really adapted for the task. Time to move to NER :)
https://www.kaggle.com/theoviel/bert-for-question-answering-baseline-training/data#1273611

riow1983 / kaggle-coleridge-initiative Goto Github PK

kaggle-coleridge-initiative's People

Contributors

Watchers

kaggle-coleridge-initiative's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs