The Bag of Words (BoW) approach focuses on building a vocabulary of words and simply counting how many times words in the vocabulary appear in a piece of text.
It’s a way of representing a passage of text by a list of numbers which keep track of how often words in the vocabulary are mentioned in the text. If a word like ‘classroom’ appears 15 times and the word ‘beach’ appears only once in a passage of text it’s a good indication that the text has something to do with school rather than travelling.
Rather than just tracking the usage of individual words, the BoW approach can also be used to track pairs of words, triplets of words, quadruplets of words or more generally n-grams of words. This is useful since it allows phrases or nouns spanning multiple words like ‘civil rights’ or ‘Great Britain’ to be tracked.
The appeal of traditional methods like the BoW approach is that they are simple, intuitive and easy to understand. The BoW representation allows a passage of text to be quantified so it opens up the door to using statistics to compare different passages of texts based on the words they contain and how often they are used.
The BoW approach does have drawbacks. It doesn’t take into account word order so ends up throwing away useful information encoded in the way the sentence is structured.