LLM101n selflearning

  • Chapter 01 Bigram Language Model (language modeling)

What is a Language Model?

A language model is a probabilistic model used to predict the likelihood of a sequence of words in a language. Language models are crucial in various natural language processing (NLP) tasks such as speech recognition, machine translation, text generation, and more. They help machines understand and generate human language in a way that is coherent and contextually appropriate.

Why Are Language Models Necessary?

Language models are needed to help machines make sense of language, which is inherently complex and context-dependent. For instance, given a sequence of words, a language model can predict the next word or assess the plausibility of a sentence. This ability is fundamental to tasks like autocomplete in search engines, where the system suggests the next word or phrase based on what has been typed so far.

Role of Language Models as Word-Level Probability Models

At the core of a language model is the idea of predicting the probability of a word given the preceding words. This can be formalized as follows:

  • Unigram Model: In the simplest form, the model considers each word independently. The probability of a sentence is the product of the probabilities of each word occurring independently of others. However, this model ignores any context or dependency between words, which is unrealistic for natural language.
  • Bigram Model: The bigram model, which we will focus on, improves upon the unigram model by considering the probability of each word given the previous word. This adds a layer of context and allows for more accurate predictions.

Bigram Model

What is a Bigram?

A bigram is a pair of consecutive words in a sequence. For example, in the sentence “The cat sat on the mat,” the bigrams are:

  • “The cat”
  • “cat sat”
  • “sat on”
  • “on the”
  • “the mat”

Each bigram represents a transition from one word to the next, and the bigram model uses these transitions to predict the likelihood of a word following another word.

Mathematical Representation of a Bigram Model

The bigram model estimates the probability of a word wiw_iwi​ occurring given the previous word wi−1w_{i-1}wi−1​. This can be represented mathematically as:

P(wi∣wi−1)=Count(wi−1,wi)Count(wi−1)P(w_i | w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})}P(wi​∣wi−1​)=Count(wi−1​)Count(wi−1​,wi​)​

Where:

  • P(wi∣wi−1)P(w_i | w_{i-1})P(wi​∣wi−1​) is the probability of word wiw_iwi​ given word wi−1w_{i-1}wi−1​.
  • Count(wi−1,wi)\text{Count}(w_{i-1}, w_i)Count(wi−1​,wi​) is the number of times the bigram (wi−1,wi)(w_{i-1}, w_i)(wi−1​,wi​) occurs in the corpus.
  • Count(wi−1)\text{Count}(w_{i-1})Count(wi−1​) is the number of times the word wi−1w_{i-1}wi−1​ occurs in the corpus.

Learning a Bigram Model

To create a bigram model, the following steps are generally followed:

  1. Counting Frequencies: First, a large corpus of text is used to count how often each bigram occurs. This gives a frequency distribution of bigrams in the language.
  2. Normalizing to Get Probabilities: The counts are then normalized to obtain probabilities. For each word wi−1w_{i-1}wi−1​, the probability distribution over the next word wiw_iwi​ is calculated.
  3. Handling Unknown Bigrams: In any realistic corpus, there will be bigrams that do not occur or are rare. Techniques like Laplace Smoothing or Good-Turing Smoothing are used to assign non-zero probabilities to unseen bigrams.
  • Laplace Smoothing: Adds a small positive value (often 1) to each count to ensure that no bigram has a probability of zero.
  • Good-Turing Smoothing: Adjusts the estimated probabilities by accounting for the frequency of bigrams that occur only once.

Implementation and Application

Implementing a Bigram Model in Python

Here’s a simple example of implementing a bigram model in Python:

python

Copy code

import nltk
from collections import defaultdict

# Sample corpus
corpus = "the cat sat on the mat".split()

# Create bigram frequency dictionary
bigram_freq = defaultdict(int)
unigram_freq = defaultdict(int)

# Count the frequencies
for i in range(len(corpus)-1):
    bigram_freq[(corpus[i], corpus[i+1])] += 1
    unigram_freq[corpus[i]] += 1
unigram_freq[corpus[-1]] += 1  # Last word unigram count

# Calculate probabilities
bigram_prob = {}
for (w1, w2), freq in bigram_freq.items():
    bigram_prob[(w1, w2)] = freq / unigram_freq[w1]

# Example prediction
def predict_next_word(word):
    candidates = {w2: prob for (w1, w2), prob in bigram_prob.items() if w1 == word}
    return max(candidates, key=candidates.get)

# Predict the next word after "the"
print(predict_next_word("the"))

Example of Generated Text

Using the model, you can generate text by predicting the next word based on the previous one:

  1. Start with a word, say “the”.
  2. Predict the next word using the bigram probabilities.
  3. Use the new word as the input to predict the next word, and so on.

Generated Text Example:

  • Start: “the”
  • Next word: “cat” (predicted based on highest bigram probability)
  • Continue: “cat sat on the mat”

This process can be repeated to generate a sequence of words.

Evaluation of the Generated Text

The quality of the generated text can be evaluated based on factors like:

  • Coherence: How well the text flows and makes sense as a whole.
  • Grammaticality: Whether the generated text adheres to the rules of grammar.

Limitations and Challenges

Limitations of the Bigram Model

  • Context Ignorance: The bigram model only considers one preceding word, which can lead to poor predictions in contexts requiring longer dependencies.Example: In the sentence “I saw a man with a telescope,” a bigram model might incorrectly predict the association between “man” and “telescope,” missing the fact that “with a telescope” modifies “saw.”
  • Sparse Data Problem: Even in large corpora, many possible bigrams might not appear, leading to challenges in predicting new or rare word combinations.
  • Assumption of Markov Property: The model assumes that the current word only depends on the previous word, which is often not true in natural language where a word can depend on words much earlier in the sentence.

Towards More Advanced Models

  • Trigram and N-gram Models: To capture more context, trigrams (three-word sequences) or N-grams (sequences of N words) can be used, but they also increase the complexity and sparsity of the data.
  • Neural Network-Based Models: Modern approaches use neural networks, such as Recurrent Neural Networks (RNNs) or Transformer models, which can learn dependencies over much longer sequences and handle the limitations of traditional N-gram models.
  • Pre-trained Language Models: Models like GPT (Generative Pre-trained Transformer) leverage large-scale pre-training on diverse datasets to understand and generate text with nuanced context and meaning.

Conclusion

The bigram model is a foundational concept in language modeling that helps in understanding the transition probabilities between words. While it introduces context sensitivity beyond unigram models, it is still limited by its simplistic assumption of word dependencies. Advanced models like trigrams, neural networks, and transformer-based architectures address these limitations by capturing more extensive context and handling the complexity of natural language more effectively.