NLP for LLMs

Converting raw text data into a format suitable for training a Large Language Model (LLM) is a complex, multi-stage pipeline called Data Ingestion. The goal is to transform messy, real-world data into a standardized, numerical sequence that the Transformer architecture (the backbone of modern LLMs) can process efficiently.

Stage 1: Data Collection and Cleaning
The quality of the training data directly determines the intelligence and coherence of the resulting LLM.
- Collection - massive datasets, e.g., the internet, digitized books, code repositories, databases.
- Cleaning and Filtering
  - Deduplication - remove duplicate text to prevent the model from overfitting.
  - Content filtering - remove harmful, biased, or explicit content using classifiers (smaller, pre-trained models).
  - Quality filtering - remove low-quality content, e.g., boilerplates, template text, machine-translated text, documents with poor grammar or excessive errors.
  - Formatting normalization - strip HTML/XML tags, fix inconsistent encodings, unify quotation marks and dashes.
  - Metadata tagging - attach labels (e.g., source domain, language, code type) to enable fine-tuning or analysis later.
Stage 2: Tokenization
Tokenization is the process of breaking down raw text into smaller, meaningful units called tokens.
- Subword tokenization (the LLM standard), most commonly Byte Pair Encoding or BPE
  - Handles rare words - avoid a massive vocabulary by breaking words into known subwords from a pre-built "tokenizer vocabulary", e.g., "unfriendly" might become ['un', 'friend', 'ly'].
  - Manages vocabulary size - keeps the vocabulary size manageable (typically 50,000 to 100,000 unique tokens) while retaining the ability to represent virtually any word.
  - Efficiency: more computationally efficient than character-level encoding for long texts.
- The output - a sequence of tokens, e.g.,
  Input: "The model is unfriendly."
  Output: 'The', 'model', 'is', 'un', 'friend', 'ly', '.'
Stage 3: Numericalization and Encoding
Tokens are converted into numerical tensors (multi-dimensional arrays) that can be processed by a GPU.
- Each token is assigned an ID, e.g.,
  Input: 'The', 'model', 'is', 'un', 'friend', 'ly', '.'
  Output: 12, 573, 23, 891, 229, 48, 5
- Positional encoding - transformers need information about word order
  - Positional encodings are numerical vectors added to the input token IDs.
  - These vectors encode the exact position of each token in the sequence.
Stage 4: Batching and Padding
Prepares the numerical data for efficient processing on GPUs.
- Chunking - stream of token IDs is broken into fixed-length segments, e.g., 2048 or 4096 tokens, representing the maximum context window of the model.
- Padding - Not all sequences will be the exact maximum length. Shorter sequences are extended with special padding tokens, e.g., ID 0, until they all have the same length.
- Batch creation
  - The fixed-length sequences are then grouped together into batches, e.g., 64 or 128 sequences per batch.
  - The LLM processes every sequence within a batch simultaneously and calculates the loss (error) for the entire batch.
  - This parallel processing is the core reason why modern LLM training is so fast when utilizing powerful GPU hardware.

NLP for LLMs

Exam Style Questions