kaggle-coleridge-initiative's People
kaggle-coleridge-initiative's Issues
[Solutions] ビジネス的価値, 技術的価値まとめ
以下2つの視点についてスライド1枚にまとめる:
- ビジネス的価値: どんなふうにビジネスにいかせそうか
- 技術的価値: どんな技術が身についたか
[Solutions] 2nd
[Discussion]
2nd place solution overview
[Code]
{TITLE}
[一言で言うと]
学習不要なSchwartz-Hearst algorithmで"LONG-FORM (ACRONYM)"形式のデータセット候補を抜き出し, roberta-base binary classifierでデータセットか否かを当てにいった.
[関連技術]
- Schwartz-Hearst algorithm
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text
[Solutions] 13th
[Discussion]
13th place solution
[Code]
coleridge_dataset_extractor
[一言で言うと]
学習モデルを使わず, 正規表現と同等の機能を持つspaCyのsetencizerクラスを利用して単語を抽出した.
[関連技術]
- spaCy’s sentencizer
Sentencizer
[Solutions] 44th
[追加した理由]
GMであることと, Dice lossを使用している点が興味深いため
[Discussion]
44th place solution
[Code]
Coleridge NER Inference
[一言で言うと]
Roberta NER model w/ Dice loss + FPを削減する後処理
[関連技術]
- Dice loss
Dice Loss for Data-imbalanced NLP Tasks
[Solutions] 6th
[Discussion]
6th place solution (lucky novices!)
[Code]
{TITLE}
[一言で言うと]
spaCy2によるセンテンス分類(datasetを含むか含まないか)ののち, 前者に対してtoken classificationを実施.
[関連技術]
- Abbreviation detector
abbreviation detector
[Solutions] 資料全体の校正 (コンペを知らない人の視点でチェック)
- 分かりやすい説明になっていないと思われる部分は修正を各担当者に依頼 (issue reopen)
- 体裁整え:
- 構成チェック
- 字体・文字サイズ・デザイン統一
- セクション割り
[Solutions] 47th
[追加した理由]
fine-tuningとstring matchingをしていない点が興味深い.
[Discussion]
47th place solution - no training, no dataset label string matching
[Code]
coleridge_regex_electra
[一言で言うと]
正規表現によるstring matchingと, QAモデル(huggingfaceのpre-trained ELECTRAをfine-tuneせずにそのまま使用)によるanswerを組み合わせた.
[関連技術]
-
Machine Reading Comprehension (MRC) for NER
A Unified MRC Framework for Named Entity Recognition -
ELECTRA (QAモデル)
ELECTRA_large_discriminator language model fine-tuned on SQuAD2.0
[Merge] spaCyモデルをhuggingfaceモデルに置き換えてsubmitしてみる
spacy部分をコメントアウトして、必要ないコメント部分を削除したものを共有いたします。
https://www.kaggle.com/ti110106/fork-of-ex-data-patern-spacy3-tr-comment-out?scriptVersionId=65886474
[Solutions] 17th
[追加した理由]
データセット言及スパンの内, "neutral"以外の単語をルールベースで付与可能な{[TITLE], [UPPER], [MIXED]}の特殊トークンでmaskすることでNERモデルのoverfitを防いだ点がユニークだと思ったため.
[Discussion]
17th Place Solution - SpaCy 3 (EntityRuler) and NER CRF model
[Code]
{TITLE}
[一言で言うと]
ルールベースもしくはNERモデルが上げてきたデータセット候補を, LightGBMで選別.
[関連技術]
- spaCy Entity Ruler
EntityRuler - spaCy abbreviation detector
allenai/scispacy
[データ加工] nb009を.spacy形式に変換する
[Solutions] 14th
[Discussion]
14th Place Solution (with notebooks)
[Code]
- Additional hand-labeled dataset titles (from train only): https://www.kaggle.com/lichena/coleridgehandlabeled
- Preprocessing: https://www.kaggle.com/lichena/coleridge-pre-processing-hand
- Training: https://www.kaggle.com/lichena/coleridge-ner?scriptVersionId=66238738
- Inference: https://www.kaggle.com/lichena/coleridge-ner-inference
[一言で言うと]
trainから目視で抜いてきた"hand label"を使った, fine-tuned bert-base-cased + crf on BIO-tagged chunks of ~200-400 words
[関連技術]
- PyTorch CRF
pytorch-crf
CONLL Corpora (2003) でNERモデル構築 (huggingface + PyTorch 利用)
NERのMNISTとも言えるCONLLデータを使ってNERモデルを構築し, そこで得られた大枠を本コンペに転用する. 時間がないのでTPU用の実装はスキップする.
reference:
[1] https://medium.com/analytics-vidhya/ner-tensorflow-2-2-0-9f10dcf5a0a
[2] https://medium.com/analytics-vidhya/fine-tuning-bert-for-ner-on-conll-2003-dataset-with-tf-2-2-0-2f242ca2ce06
huggingface transformers + PyTorch for NER task fine-tuning
spaCyにしろNERDAにしろ、結局ラッパーなんで、最終的にガチれないところはコンペ向きでは無いかもしれない。そういう意味ではhuggingface+PyTrochが王道なのかもしれん。
https://www.reddit.com/r/LanguageTechnology/comments/lnca2q/some_questions_about_spacy_vs_hugging_face/
huggingfaceのpre-trained BERTをPyTorchXLA(TPU)でfine-tune(NERタスク)するColab notebookが落ちてた。
https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_ner.ipynb
しかも訓練データはKaggleのデータ: https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
(コンペではなく、4年以上前に作られたデータセットだが)
Originally posted by @riow1983 in #6 (comment)
[Solutions] コンペ概要資料作成
以下3点についてスライド1枚〜2枚にまとめる:
- コンペ概要
- データセット説明
- 評価指標説明
[Solutions] 43rd (Ours!)
[Solutions] 4th
[Discussion]
4th place solution - LB probing, acronym detection, and NER
[Twitter]
https://twitter.com/osciiart/status/1407602369156698116
[Code]
210622_det1_NERu_train_govt
[一言で言うと]
ルールベースで検出したデータセット名候補からFP除去したもの + 外部データ string matchingとspaCy NERによる表記ゆれ検出
[関連技術]
- NERによる表記ゆれ是正(?)
[{NAME}]({URL}]
NERDAによるfine-tuning模索
NERDA = BERTなどの言語モデルのラッパー (huggingface, PyTorch based)
[公式ドキュメント]
https://ebanalyse.github.io/NERDA/
過去コンペの解法 特にNERタスクについて調べる
issueとcommitの紐づけテスト
シーケンス長={512, 4096}のCVデータ(5 fold)を作る
Data preparation for NERにある実装で作成する.
BERTを使ったベースラインノートブックを探してくる
string matching a.k.a. LB proving
[Solutions] 5th
[Discussion]
5th place solution
[Code]
Coleridge Initiative 5th place
[GitHub]
vlomme/Find-dataset-from-text
[一言で言うと]
string matchingを使わず, 正規表現とBERT NERを使用.
[関連知識]
- Offset mapping for BERT's word-piece tokenization
🤗 Fine-tuning with custom datasets
Big Bird使ったnotebook探してくる
cleaned_labelをカテゴライズしたものでGroup KfoldしたCVを作成する (train.csvに対して)
- 対象は無加工のtrain.csv
- cleaned_labelをカテゴライズしたものをgroupとしてGroup Kfoldを行う
- その際, 教師ラベルをどの変数(カラム)にするかは未定 (適当でいい?)
[参考: cleaned_label 全130種 from train.csv]
0 national education longitudinal study
1 noaa tidal station
2 slosh model
3 noaa c cap
4 aging integrated database agid
5 alzheimers disease neuroimaging initiative
6 aging integrated database
7 noaa national water level observation network
8 noaa water level station
9 baltimore longitudinal study of aging blsa
10 national water level observation network
11 arms farm financial and crop production practices
12 beginning postsecondary student
13 noaa sea lake and overland surges from hurricanes
14 noaa tide gauge
15 the national institute on aging genetics of alzheimer s disease data storage site
16 national center for education statistics common core of data
17 national science foundation survey of industrial research and development
18 baccalaureate and beyond
19 noaa international best track archive for climate stewardship
20 agricultural resource management survey
21 national teacher and principal survey
22 international best track archive for climate stewardship
23 nsf higher education research and development survey
24 national science foundation survey of earned doctorates
25 school survey on crime and safety
26 the national institute on aging genetics of alzheimer s disease data storage site niagads
27 national oceanic and atmospheric administration world ocean database
28 beginning postsecondary students longitudinal study
29 nces common core of data
30 program for the international assessment of adult competencies
31 survey of earned doctorates
32 baltimore longitudinal study of aging
33 early childhood longitudinal study
34 adni
35 national science foundation survey of graduate students and postdoctorates in science and engineering
36 trends in international mathematics and science study
37 national oceanic and atmospheric administration c cap
38 nsf survey of earned doctorates
39 noaa tide station
40 education longitudinal study
41 optimum interpolation sea surface temperature
42 national oceanic and atmospheric administration optimum interpolation sea surface temperature
43 alzheimer s disease neuroimaging initiative adni
44 baccalaureate and beyond longitudinal study
45 agricultural resources management survey
46 beginning postsecondary students
47 ibtracs
48 coastal change analysis program
49 survey of graduate students and postdoctorates in science and engineering
50 national assessment of education progress
51 sea surface temperature optimum interpolation
52 high school longitudinal study
53 nsf survey of graduate students and postdoctorates in science and engineering
54 national science foundation survey of doctorate recipients
55 survey of doctorate recipients
56 coastal change analysis program land cover
57 survey of industrial research and development
58 world ocean database
59 rural urban continuum codes
60 noaa optimum interpolation sea surface temperature
61 noaa world ocean database
62 common core of data
63 higher education research and development survey
64 noaa storm surge inundation
65 national weather service nws storm surge risk
66 survey of science and engineering research facilities
67 nsf survey of industrial research and development
68 national science foundation survey of science and engineering research facilities
69 national science foundation higher education research and development survey
70 national center for science and engineering statistics survey of earned doctorates
71 national center for science and engineering statistics survey of science and engineering research facilities
72 national center for science and engineering statistics survey of graduate students and postdoctorates in science and engineering
73 national center for science and engineering statistics survey of doctorate recipients
74 national center for science and engineering statistics survey of industrial research and development
75 national center for science and engineering statistics higher education research and development survey
76 nsf survey of science and engineering research facilities
77 ffrdc research and development survey
78 nsf ffrdc research and development survey
79 survey of state government research and development
80 ncses survey of doctorate recipients
81 ncses survey of graduate students and postdoctorates in science and engineering
82 anss comprehensive earthquake catalog
83 anss comprehensive catalog
84 advanced national seismic system anss comprehensive catalog comcat
85 advanced national seismic system comprehensive catalog
86 census of agriculture
87 usda census of agriculture
88 nass census of agriculture
89 north american breeding bird survey
90 north american breeding bird survey bbs
91 usgs north american breeding bird survey
92 covid 19 open research dataset cord 19
93 covid 19 open research dataset
94 covid open research dataset
95 covid 19 open research data
96 complexity science hub covid 19 control strategies list cccsl
97 complexity science hub covid 19 control strategies list
98 cccsl
99 our world in data covid 19 dataset
100 our world in data covid 19
101 our world in data
102 jh crown registry
103 characterizing health associated risks and your baseline disease in sars cov 2 charybdis
104 characterizing health associated risks and your baseline disease in sars cov 2
105 covid 19 death data
106 sars cov 2 genome sequence
107 sars cov 2 genome sequences
108 covid 19 genome sequence
109 covid 19 genome sequences
110 2019 ncov genome sequence
111 2019 ncov genome sequences
112 sars cov 2 full genome sequence
113 sars cov 2 full genome sequences
114 sars cov 2 complete genome sequence
115 sars cov 2 complete genome sequences
116 2019 ncov complete genome sequences
117 genome sequences of sars cov 2
118 genome sequence of sars cov 2
119 genome sequence of covid 19
120 genome sequences of covid 19
121 genome sequence of 2019 ncov
122 genome sequences of 2019 ncov
123 covid 19 image data collection
124 rsna international covid 19 open radiology database ricord
125 rsna international covid 19 open radiology database
126 rsna international covid open radiology database
127 cas covid 19 antiviral candidate compounds dataset
128 cas covid 19 antiviral candidate compounds data set
129 cas covid 19 antiviral candidate compounds data
[Submission] Finalをどう決めるか
- Two submissions:
- 1つは素直にPublic LBが最も高いもの
- もう1つは, overfitしていないと思われるもの
[Solutions] 1st
[Discussion]
1st place solution: Metric learning and GPT
1st solution: Matching the : Context Similarity via Deep Metric Learning and Beyond
[Code]
[GitHub]
[一言で言うと]
以下2アプローチをそれぞれ別個に用意:
- GPT(QAタスク) + beamsearch (private LB = 0.565 (0.594 w/o labels obtained by scispaCy's abbreviation detecotor)) by Khoi Nguyen
- metric learning with usual MLM backbones (private LB = 0.576 (0.588 w/o labels obtained by scispaCy's abbreviation detecotor)) by Nguyen Quan Anh Minh
[関連知識]
QA系は失敗している様子 (NERのほうは上手くいってそう)
My score is quite bad as well. I suspect the Question Answering approach is not really adapted for the task. Time to move to NER :)
https://www.kaggle.com/theoviel/bert-for-question-answering-baseline-training/data#1273611
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.