BERT 复习

复习

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv:1810.04805.

Why masking tokens?

Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context (Devlin, et al., 2018).

Static and dynamic masking

这是我当时读 RoBERTa 的时候才发现的

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv Preprint arXiv:1907.11692.

The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.

We compare this strategy with dynamic masking where we generate the masking pattern every time we feed a sequence to the model. This becomes crucial when pretraining for more steps or with larger datasets.

有不少 “解读” 文章说 dynamic masking 指将数据复制 10 份做不同的 masking, 但根据原文和 BERT 源码, 这其实是 BERT 做的事情. RoBERTa 的 dynamic 指在喂给模型之前才 masking (即时生成).

这里create_pretraining_data.py 下面的段落是 3 年前 BERT initial release 就有的.

flags.DEFINE_integer(
    "dupe_factor", 10,
    "Number of times to duplicate the input data (with different masks).")
for _ in range(dupe_factor):
  for document_index in range(len(all_documents)):
    instances.extend(
      create_instances_from_document(
        all_documents, document_index, max_seq_length, short_seq_prob,
        masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

Finetuning

There are two existing strategies for applying pre-trained language representations to downstream tasks (Devlin, et al., 2018):

  • feature-based: fixed features are extracted from the pretrained model.
  • fine-tuning: trained on the downstream tasks by simply fine-tuning all pretrained parameters.

Freezing layers?

Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better this way for Transformers model (so this is not an oversight on our side). If you’re not familiar with what “freezing the body” of the model means, forget you read this paragraph. From Hugging Face Fine-tuning a pretrained model

# quick example: freezing first 4 encoder layers
for module in [bert.embeddings, bert.encoder.layer[:4]]:
    for param in module.parameters():
        param.requires_grad = False
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, rel_extractor.parameters()))

Sentence embedding

Albert

Miscs

  • Nozhihu. (2021, Feb 18). BERT 你关注不到的点.
  • Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune bert for text classification?. China National Conference on Chinese Computational Linguistics, 194–206.
  • Alan Lee. (2020). BERT 是如何分词的

Todo

  • xlnet
  • prompt