BERT 复习

复习

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv:1810.04805.

Why masking tokens?

Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context (Devlin, et al., 2018).

Why can’t standard conditional language models be trained left-to-right and right-to-left?

Static and dynamic masking

这是我当时读 RoBERTa 的时候才发现的

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv Preprint arXiv:1907.11692.

The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.

We compare this strategy with dynamic masking where we generate the masking pattern every time we feed a sequence to the model. This becomes crucial when pretraining for more steps or with larger datasets.

有不少 “解读” 文章说 dynamic masking 指将数据复制 10 份做不同的 masking, 但根据原文和 BERT 源码, 这其实是 BERT 做的事情. RoBERTa 的 dynamic 指在喂给模型之前才 masking (即时生成).

这里的 create_pretraining_data.py 下面的段落是 3 年前 BERT initial release 就有的.

flags.DEFINE_integer(
    "dupe_factor", 10,
    "Number of times to duplicate the input data (with different masks).")

for _ in range(dupe_factor):
  for document_index in range(len(all_documents)):
    instances.extend(
      create_instances_from_document(
        all_documents, document_index, max_seq_length, short_seq_prob,
        masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

苏剑林. (2021, Sep 10). 曾被嫌弃的预训练任务 NSP, 做出了优秀的 Zero Shot 效果.

Finetuning

There are two existing strategies for applying pre-trained language representations to downstream tasks (Devlin, et al., 2018):

feature-based: fixed features are extracted from the pretrained model.
fine-tuning: trained on the downstream tasks by simply fine-tuning all pretrained parameters.

Freezing layers?

Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better this way for Transformers model (so this is not an oversight on our side). If you’re not familiar with what “freezing the body” of the model means, forget you read this paragraph. From Hugging Face Fine-tuning a pretrained model

# quick example: freezing first 4 encoder layers
for module in [bert.embeddings, bert.encoder.layer[:4]]:
    for param in module.parameters():
        param.requires_grad = False
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, rel_extractor.parameters()))

Sentence embedding

为什么 Bert 中的 CLS 在未 fine tune 时作为 sentence embedding 性能非常糟糕?
苏剑林. (2021, Jan 11). 你可能不需要 BERT-flow：一个线性变换媲美 BERT-flow.
苏剑林. (2021, Oct 19). 关于 WhiteningBERT 原创性的疑问和沟通.

Albert

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv Preprint arXiv:1909.11942.
Chaudhary, A. (2020, Feb). Visual Paper Summary: ALBERT (A Lite BERT).
苏剑林. (2020, Oct 29). 用 ALBERT 和 ELECTRA 之前, 请确认你真的了解它们.

Miscs

Nozhihu. (2021, Feb 18). BERT 你关注不到的点.
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune bert for text classification?. China National Conference on Chinese Computational Linguistics, 194–206.
Alan Lee. (2020). BERT 是如何分词的

Todo

xlnet
prompt