NLP for LLMs
Converting raw text data into a format suitable for training a Large Language Model (LLM) is a
complex, multi-stage pipeline called Data Ingestion.
The goal is to transform messy, real-world data into a standardized, numerical sequence that the
Transformer architecture (the backbone of modern LLMs) can process efficiently.
- Stage 1: Data Collection and Cleaning
The quality of the training data directly determines the intelligence and coherence of
the resulting LLM.
- Collection - massive datasets, e.g., the internet, digitized books, code repositories,
databases.
- Cleaning and Filtering
- Deduplication - remove duplicate text to prevent the model from overfitting.
- Content filtering - remove harmful, biased, or explicit content using classifiers
(smaller, pre-trained models).
- Quality filtering - remove low-quality content, e.g., boilerplates, template text,
machine-translated text, documents with poor grammar or excessive errors.
- Formatting normalization - strip HTML/XML tags, fix inconsistent encodings,
unify quotation marks and dashes.
- Metadata tagging - attach labels (e.g., source domain, language, code type) to
enable fine-tuning or analysis later.
- Stage 2: Tokenization
Tokenization is the process of breaking down raw text into smaller, meaningful units called
tokens.
- Subword tokenization (the LLM standard), most commonly Byte Pair Encoding or BPE
- Handles rare words - avoid a massive vocabulary by breaking words into known
subwords from a pre-built "tokenizer vocabulary", e.g., "unfriendly"
might become ['un', 'friend', 'ly'].
- Manages vocabulary size - keeps the vocabulary size manageable (typically 50,000 to
100,000 unique tokens) while retaining the ability to represent virtually any word.
- Efficiency: more computationally efficient than character-level encoding for
long texts.
- The output - a sequence of tokens, e.g.,
Input: "The model is unfriendly."
Output: 'The', 'model', 'is', 'un', 'friend', 'ly', '.'
- Stage 3: Numericalization and Encoding
Tokens are converted into numerical tensors (multi-dimensional arrays) that can be processed
by a GPU.
- Each token is assigned an ID, e.g.,
Input: 'The', 'model', 'is', 'un', 'friend', 'ly', '.'
Output: 12, 573, 23, 891, 229, 48, 5
- Positional encoding - transformers need information about word order
- Positional encodings are numerical vectors added to the input token IDs.
- These vectors encode the exact position of each token in the sequence.
- Stage 4: Batching and Padding
Prepares the numerical data for efficient processing on GPUs.
- Chunking - stream of token IDs is broken into fixed-length segments, e.g., 2048 or 4096
tokens, representing the maximum context window of the model.
- Padding - Not all sequences will be the exact maximum length.
Shorter sequences are extended with special padding tokens, e.g., ID 0, until they all
have the same length.
- Batch creation
- The fixed-length sequences are then grouped together into batches, e.g., 64 or 128
sequences per batch.
- The LLM processes every sequence within a batch simultaneously and calculates the
loss (error) for the entire batch.
- This parallel processing is the core reason why modern LLM training is so fast
when utilizing powerful GPU hardware.
Exam Style Questions