Model
In order to apply Transformer or self-attention tovlanguage modeling, the central problem is how to train a Transformer to effectively encode an arbitrarily long context into a fixed size representation.
Vanilla Transformer Language Models
做法:
split the entire corpus into shorter segments of manageable sizes, and only train the model within each segment, ignoring all contextual information from previous segments.
将文本分成大小相同的段,放进模型中训练,忽略上下文之间的联系
问题:
- the largest possible dependency length is upper bounded by the segment length, the vanilla model is not able to fully exploit this(transformer) optimization advantage.
最长的可能依赖长度取决于段长,且无法发挥 transformer 的优化特长
- simply chunking a sequence into fixed-length segments will lead to the context fragmentation problem
会导致上下文碎片化问题
- 评估阶段每次只预测一位,右移一位,
此过程可确保每个预测都利用训练过程中暴露的尽可能长的上下文,并且还可以缓解训练中遇到的上下文碎片问题。
但是代价过于昂贵