Here, the corrupted tokens are produced in generator as fake data. I can understand wh

Deal with the duplicated positions in generator about electra HOT 2 CLOSED

zheyuye commented on June 22, 2024

Deal with the duplicated positions in generator

from electra.

Comments (2)

clarkkev commented on June 22, 2024

scatter_nd sums up values with the same index, but we want to just pick a single value per index. If the values for an index are the same (e.g., we are just scattering a mask tensor of 1s), then dividing the summed values by the number of occurrences at that index fixes the issue. That's what is implemented in the code you linked to.

However, this does NOT fully fix having duplicate mask positions. It does stop there from being errors due to overflowing above the vocab size, but if (1) the same position is sampled twice and (2) the generator samples different tokens for the position then the replaced token will be a "random" token obtained by averaging the two sampled token ids. I didn't bother fixing this issue when developing ELECTRA because this occurs for a pretty small fraction of masked positions. But there actually is an easy fix: replace the sampling step (masked_lm_positions = tf.random.categorical... in pretrain_helpers.mask) with

masked_lm_positions = tf.math.top_k(
      sample_logits + sample_gumbel(
          modeling.get_shape_list(sample_logits)), N)[1]

This will do sampling without replacement when picking mask positions.

from electra.

zheyuye commented on June 22, 2024

That is clear. Thank you for the explanation.

from electra.

Recommend Projects

Deal with the duplicated positions in generator about electra HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs