A language model (LM) is effectively a huge guessing machine - given a set of words e.g. ‘Barack Obama was the president of…’, a language model uses probability and statistics to try and guess what will come next i.e. ‘the United States’. As before, words are assigned vectors but with language models, these representations can also be contextual; the same word can have different vectors depending on the context it appears in.
The more recent and popular language models like BERT and GPT-3 use deep learning under the hood. Their wide appeal is because they employ transfer learning so a single pre-trained model can be used for a variety of downstream tasks. Recall that transfer learning involves two stages; pretraining a model on a large collection of data and then fine-tuning the model for a particular task with annotated data.
Language models are typically pre-trained on a huge corpus of text and are very expensive to pre-train. To give a sense of scale, the largest GPT-3 model has 175 billion parameters and was trained on 499 billion tokens at a reported cost of $4.6 Million on the cheapest cloud hardware. A token is a linguistic unit like a word or a sub-word.
During pre-training, the model learns general patterns and features in language around syntax, morphology and idioms. Once trained, the model can be fine-tuned with additional annotated data for a number of different tasks like NER, classification and question answering.
The revolutionary change with transfer learning means that the volume of annotated data required to carry out these tasks is significantly reduced - to train a NER model from scratch typically thousands of annotated examples are needed for good performance. With a pre-trained language model, tens to hundreds of annotated examples are enough to achieve the same level of performance.
Language models have slammed the door of possibility wide open for practical applications of Legal NLP.
Resources