================================================================================ LLM KNOWLEDGE BASE — COMPREHENSIVE REFERENCE DOCUMENT v3.1 Generated for: REGITE Website Testing Suite Author: Dennis Binoy | Channel: REGITE Date: 2025-05-18 Total Tokens (approx): 42,000 ================================================================================ TABLE OF CONTENTS ───────────────── 1. Introduction to Large Language Models 2. History and Evolution of LLMs 3. Architecture Deep Dive — Transformer 4. Pre-training and Fine-tuning 5. Major LLM Families and Comparisons 6. Prompting Techniques 7. Evaluation Metrics 8. Safety and Alignment 9. Retrieval-Augmented Generation (RAG) 10. Agents and Tool Use 11. Multimodal Models 12. LLM APIs and Deployment 13. Hardware and Compute 14. Cost and Efficiency 15. Future of LLMs 16. Glossary of Terms 17. Code Examples 18. Research Paper Summaries 19. Community and Resources 20. Appendix — Raw Benchmark Data ================================================================================ SECTION 1 — INTRODUCTION TO LARGE LANGUAGE MODELS ================================================================================ A Large Language Model (LLM) is a type of artificial intelligence system trained on massive corpora of text data using self-supervised learning objectives. The primary goal is to learn the statistical structure of human language well enough to generate coherent, contextually appropriate text — and increasingly, to reason, plan, and act. LLMs are characterized by: • Massive parameter counts (billions to trillions) • Pre-training on internet-scale datasets • Emergent capabilities not explicitly programmed • General-purpose applicability across domains • In-context learning without gradient updates The name "large" refers primarily to the number of trainable parameters — mathematical weights stored in the model's neural network layers. A model like GPT-3 has 175 billion parameters. GPT-4 is estimated at over 1 trillion (MoE). Claude 3 Opus, Gemini Ultra, and Llama 3 405B are all in a similar tier. Despite the term "language model," modern LLMs can handle: - Code generation and debugging - Mathematical reasoning - Image understanding (multimodal) - Audio transcription (with adapter layers) - Video understanding (in research) - Tool/function calling - Autonomous agent behavior The fundamental task an LLM learns is next-token prediction. Given a sequence of tokens (sub-word units), predict the probability distribution of the next token. Repeat this autoregressively to generate text. P(token_n | token_1, token_2, ..., token_{n-1}) This deceptively simple objective, when scaled to massive data and compute, gives rise to remarkably sophisticated behavior. ================================================================================ SECTION 2 — HISTORY AND EVOLUTION OF LLMs ================================================================================ 2.1 EARLY LANGUAGE MODELING (1990s–2010s) ────────────────────────────────────────── Statistical language models dominated the field before deep learning: • N-gram Models (1990s): Count-based models that estimate P(word | context) using Markov assumptions. Practical but brittle beyond n=5. • Hidden Markov Models (HMMs): Probabilistic graphical models for sequential data. Widely used in speech recognition. • Word2Vec (2013, Mikolov et al.): Neural word embeddings. Trained skip-gram and CBOW objectives. Showed semantic arithmetic: king - man + woman ≈ queen. • GloVe (2014, Pennington et al.): Global vectors for word representation. Combined global matrix factorization with local context window methods. • ELMo (2018, Peters et al.): Embeddings from Language Models. Introduced context-dependent word representations using bidirectional LSTM. 2.2 THE TRANSFORMER ERA (2017–2019) ──────────────────────────────────── "Attention Is All You Need" — Vaswani et al., 2017 (Google Brain) This paper changed everything. The Transformer architecture replaced recurrence with self-attention, enabling massive parallelization during training. Key milestones: BERT (2018, Google): • Bidirectional Encoder Representations from Transformers • Pre-trained on Masked Language Modeling (MLM) + Next Sentence Prediction • 110M params (Base) / 340M params (Large) • State-of-the-art on 11 NLP benchmarks at release • Not generative — encoder-only architecture GPT-1 (2018, OpenAI): • 117M parameters • Decoder-only, generative • Pre-trained on BooksCorpus (800M words) • Demonstrated transfer learning to downstream tasks GPT-2 (2019, OpenAI): • 1.5B parameters • Trained on WebText (40GB, 8M Reddit documents) • Initially withheld due to "misuse concerns" — later fully released • Zero-shot task performance surprised researchers XLNet (2019, CMU + Google): • Permutation-based language modeling • Overcame BERT's masking pretrain-finetune mismatch • Briefly surpassed BERT on many benchmarks 2.3 THE SCALING REVOLUTION (2020–2022) ──────────────────────────────────────── GPT-3 (2020, OpenAI) — 175B parameters: • Trained on 300B tokens from Common Crawl, WebText2, Books, Wikipedia • Demonstrated remarkable few-shot and zero-shot capabilities • In-context learning without gradient updates • Spawned entire industry of API-based AI applications The Scaling Laws Paper (Kaplan et al., 2020, OpenAI): • Showed power-law relationships between compute, data, params, and loss • LM loss ∝ N^{-0.076} (parameters) • LM loss ∝ D^{-0.095} (dataset tokens) • Optimal compute allocation: scale params and data together • Led to the "bigger is better" philosophy Chinchilla Scaling Laws (Hoffmann et al., 2022, DeepMind): • Revised Kaplan et al. — models were undertrained relative to parameters • Optimal: ~20 tokens per parameter • 70B model needs 1.4T tokens for compute-optimal training • Changed how industry allocates training budgets PaLM (2022, Google): • 540B parameters • Pathways system — trained across thousands of TPU chips • Introduced chain-of-thought prompting results • BIG-bench Hard performance breakthrough InstructGPT (2022, OpenAI): • 1.3B model outperformed 175B GPT-3 on helpfulness • Introduced RLHF (Reinforcement Learning from Human Feedback) • Three stages: SFT → Reward Model → PPO fine-tuning • Foundation for ChatGPT 2.4 THE CHATBOT EXPLOSION (2022–2023) ──────────────────────────────────────── ChatGPT (November 2022, OpenAI): • Based on GPT-3.5 + RLHF alignment • 1 million users in 5 days; 100M in 2 months • Fastest-growing consumer application in history • Triggered massive industry investment and competition GPT-4 (March 2023, OpenAI): • Architecture: Mixture of Experts (unconfirmed, ~8x220B) • Multimodal: accepts image + text input • Passed bar exam at 90th percentile • Passed SAT Math at 89th percentile • Context window: 8K → 32K → 128K (GPT-4 Turbo) Claude 1/2/3 (2023–2024, Anthropic): • Founded by former OpenAI researchers • Constitutional AI (CAI) approach to alignment • Claude 3 Opus: competitive with GPT-4 on most benchmarks • 200K context window (Claude 3) • Strong performance on reasoning and coding Gemini (December 2023, Google DeepMind): • Natively multimodal from training • Three sizes: Nano, Pro, Ultra • Ultra matched GPT-4 on MMLU • Integrated into Google products Llama 1/2/3 (2023–2024, Meta): • Open weights — downloadable and runnable locally • Llama 2: 7B, 13B, 34B, 70B parameter variants • Llama 3: 8B and 70B, significantly improved • Sparked entire ecosystem of fine-tunes and derivatives • Mistral, Mixtral, Phi built on similar principles 2.5 THE CURRENT ERA (2024–2025) ───────────────────────────────── Key trends defining 2024-2025: • Long context windows (1M+ tokens — Gemini 1.5 Pro) • Inference-time compute scaling (o1, R1, QwQ) • Mixture of Experts going mainstream • Small but capable models (Phi-3, Gemma 2) • Multimodality becoming standard • Agentic frameworks and tool use • On-device deployment (edge LLMs) • Open-source catching up to proprietary ================================================================================ SECTION 3 — ARCHITECTURE DEEP DIVE: THE TRANSFORMER ================================================================================ 3.1 HIGH-LEVEL OVERVIEW ───────────────────────── The Transformer processes input as a sequence of tokens. Each token is converted to a dense vector (embedding), processed through N identical layers, and the output is projected back to vocabulary probabilities. Input tokens → Embedding → [Layer 1] → [Layer 2] → ... → [Layer N] → Output logits Each Transformer layer contains: 1. Multi-Head Self-Attention (MHSA) 2. Feed-Forward Network (FFN) 3. Layer Normalization (applied before or after) 4. Residual connections 3.2 TOKENIZATION ───────────────── Before processing, text is converted to tokens using a tokenizer. Common algorithms: • Byte Pair Encoding (BPE): Iteratively merges frequent byte pairs • WordPiece: Maximizes language model log-likelihood on training data • SentencePiece: Language-agnostic, works on raw unicode • Tiktoken (OpenAI): BPE variant used in GPT-3.5/4 Typical token counts: • English: ~1 token per 4 characters (~0.75 words per token) • Code: ~1 token per 2-4 characters (varies by language) • Non-Latin scripts: 2-5x more tokens than English for same content Vocabulary sizes: • GPT-2: 50,257 tokens • GPT-3/4: 100,256 tokens (cl100k_base) • Llama 3: 128,256 tokens • Gemini: ~256,000 tokens (estimated) 3.3 EMBEDDINGS ─────────────── Each token ID is mapped to a high-dimensional vector via an embedding matrix E. E ∈ ℝ^{V × d_model} where V = vocabulary size, d_model = model dimension (e.g., 4096 for 7B model) Positional encodings are added to inject sequence order information: • Sinusoidal (original Transformer): fixed, deterministic • Learned absolute: trained position embeddings • Rotary Position Embeddings (RoPE): relative, extrapolates well • ALiBi: attention bias based on distance, zero params RoPE (Su et al., 2021) is now dominant in modern LLMs: • Encodes position by rotating query/key vectors in 2D planes • Enables length generalization beyond training context • Used in: Llama, Mistral, Falcon, Qwen, DeepSeek 3.4 SELF-ATTENTION MECHANISM ────────────────────────────── Self-attention allows each token to attend to all other tokens in the sequence. Given input X ∈ ℝ^{n × d}: Q = X · W_Q (Queries) K = X · W_K (Keys) V = X · W_V (Values) Attention(Q, K, V) = softmax(QK^T / √d_k) · V The division by √d_k prevents gradient vanishing when d_k is large. Multi-Head Attention runs h parallel attention heads: head_i = Attention(QW_Q^i, KW_K^i, VW_V^i) MHA(Q,K,V) = Concat(head_1,...,head_h) · W_O Each head can focus on different aspects of the input (syntax, semantics, etc.) For decoder-only models (GPT, Llama), causal masking is applied: • Tokens can only attend to previous tokens (not future) • Implemented by setting future positions to -∞ before softmax Attention variants for efficiency: • Multi-Query Attention (MQA): Single K/V head, multiple Q heads • Grouped Query Attention (GQA): G groups of K/V, less than Q heads • Flash Attention: Memory-efficient exact attention via tiling • Sparse Attention: Attend to subset of positions 3.5 FEED-FORWARD NETWORK (FFN) ────────────────────────────── After attention, each position goes through an FFN independently: FFN(x) = GELU(x · W_1 + b_1) · W_2 + b_2 Or with SwiGLU (common in modern models): FFN_SwiGLU(x) = (Swish(x · W_gate) ⊙ (x · W_1)) · W_2 The FFN dimension is typically 4× the model dimension: • d_ff = 4 × d_model (original) • d_ff = 8/3 × d_model (SwiGLU — different effective ratio) The FFN is believed to store factual knowledge as "memory" (Geva et al., 2021). Each neuron can be interpreted as a key-value pair of pattern → value. 3.6 LAYER NORMALIZATION ──────────────────────── LayerNorm stabilizes training by normalizing across feature dimensions. LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β Pre-norm (applied before sublayer) vs Post-norm (applied after): • Pre-norm is more stable for very deep models • GPT-2+ and most modern LLMs use Pre-LN • RMSNorm (no centering) is used in Llama: saves ~5% compute 3.7 KEY ARCHITECTURAL VARIANTS ──────────────────────────────── Encoder-only (BERT family): • Bidirectional attention — sees all tokens in context • Good for: classification, NER, embedding, retrieval • Cannot generate text autoregressively • Examples: BERT, RoBERTa, DeBERTa, ELECTRA Decoder-only (GPT family): • Causal/unidirectional attention • Autoregressive generation • Dominant for chat/generation tasks • Examples: GPT, Llama, Mistral, Falcon, Claude, Gemini Encoder-Decoder (T5 family): • Separate encoder (bidirectional) + decoder (causal) • Encoder processes input, decoder generates output • Natural for translation, summarization • Examples: T5, BART, mT5, FLAN-T5 Mixture of Experts (MoE): • Multiple FFN "expert" networks per layer • Router network selects top-k experts per token • Only a fraction of parameters active per forward pass • Scales parameter count without proportional compute • Examples: Mixtral 8x7B, GPT-4 (rumored), Gemini 1.5 ================================================================================ SECTION 4 — PRE-TRAINING AND FINE-TUNING ================================================================================ 4.1 PRE-TRAINING DATA ────────────────────── Modern LLMs are trained on a mixture of data sources: Common Crawl: • Petabytes of web text, scraped quarterly • Requires aggressive filtering for quality • C4, FineWeb, RefinedWeb are filtered derivatives Books: • BooksCorpus: 11,000 unpublished books (~800M words) • Books3: 196,640 books from Bibliotik • Project Gutenberg: 60,000+ public domain books Code: • GitHub: billions of lines of open-source code • The Stack (BigCode): 6TB of code in 358 languages • CodeParrot, StarCoder, Code Llama datasets Academic/Scientific: • arXiv: 2M+ papers in LaTeX source • PubMed: biomedical literature • Semantic Scholar Open Research Corpus Curated Web: • Wikipedia: 60M+ articles, 20+ languages • StackExchange: Q&A across technical topics • Reddit: discussions (used in WebText/OpenWebText) Typical data mixture for a 2024 model (approximate): • Web text: 40-60% • Code: 15-25% • Books/long-form: 10-15% • Academic papers: 5-10% • Curated/high-quality: 5-15% 4.2 TRAINING OBJECTIVE ─────────────────────── Standard: Causal Language Modeling (CLM) / Next-Token Prediction Loss = -∑ log P(token_t | token_1,...,token_{t-1}) This is cross-entropy between predicted distribution and one-hot target. The model learns to minimize this loss, implicitly learning grammar, facts, reasoning patterns, and world knowledge. Alternative objectives (less common now): • Masked Language Modeling (MLM): BERT-style • Span prediction: T5-style "sentinel tokens" • Prefix Language Modeling: Causal on suffix, MLM on prefix 4.3 OPTIMIZER AND TRAINING DETAILS ──────────────────────────────────── Optimizer: AdamW (Adam + Weight Decay) β₁ = 0.9, β₂ = 0.95, ε = 1e-8 Weight decay = 0.1 Learning rate schedule: • Linear warmup: 1000-2000 steps • Cosine decay: decays to 10% of peak LR • Peak LR: ~3e-4 for 7B models, scales down for larger Gradient clipping: max norm = 1.0 (prevents exploding gradients) Precision: • BF16 (Brain Float 16): preferred over FP16 • BF16 has same exponent range as FP32, less loss • Master weights in FP32 for numerical stability • Activation checkpointing saves GPU memory (recompute on backward) Batch size: • Typical: millions of tokens per batch • Llama 3 70B: batch size ~4M tokens • Gradient accumulation used to achieve large effective batch Distributed training strategies: • Data Parallelism: same model replicated, different data shards • Tensor Parallelism: split model layers across GPUs (Megatron-LM) • Pipeline Parallelism: different layers on different GPUs • FSDP (Fully Sharded Data Parallel): PyTorch native • DeepSpeed ZeRO: optimizer state sharding 4.4 SUPERVISED FINE-TUNING (SFT) ────────────────────────────────── After pre-training, models are fine-tuned on curated instruction-following data. SFT data format (ChatML / conversation format): <|system|>You are a helpful assistant. <|user|>What is the capital of France? <|assistant|>The capital of France is Paris. Key SFT datasets: • Alpaca (52K): GPT-3.5 generated instructions • Dolly (15K): Databricks-curated, human-written • OpenAssistant (161K): human conversations • ShareGPT: real ChatGPT conversations • FLAN Collection: thousands of tasks with templates SFT teaches format compliance more than capability. Most capability comes from pre-training. SFT just activates it. 4.5 REINFORCEMENT LEARNING FROM HUMAN FEEDBACK (RLHF) ─────────────────────────────────────────────────────── RLHF aligns model outputs with human preferences. Stage 1 — Collect Preference Data: • Human raters compare pairs of model outputs (A vs B) • Ratings capture: helpfulness, accuracy, safety, tone Stage 2 — Train Reward Model (RM): • Bradley-Terry model: RM learns to predict preferences • Loss = -log(σ(RM(preferred) - RM(rejected))) • RM maps (prompt, response) → scalar reward Stage 3 — PPO Fine-tuning: • Model generates responses, scored by RM • PPO (Proximal Policy Optimization) maximizes expected reward • KL penalty prevents reward hacking / distributional shift • Objective: R(x,y) - β·KL(π_θ || π_ref) RLHF challenges: • Reward hacking: model exploits RM's blind spots • Scalable oversight: hard to rate technical outputs • Data collection is expensive and slow • Mode collapse risks with aggressive PPO 4.6 ALTERNATIVES TO RLHF ────────────────────────── Direct Preference Optimization (DPO): • Skips the reward model entirely • Directly optimizes log ratio of preferred/rejected • Simpler, more stable than PPO • Now widely used: Llama 3, Mistral, many open models Constitutional AI (Anthropic): • Model critiques and revises its own outputs • Uses a "constitution" of principles as guidance • Reduces need for human feedback in safety training • Used in Claude training RLAIF (AI Feedback): • Use another AI model as the rater instead of humans • Scales feedback collection massively • Combined with RLHF in modern pipelines KTO (Kahneman-Tversky Optimization): • Based on prospect theory from behavioral economics • Works with unpaired preference data • Single samples labeled good/bad, not pairs ================================================================================ SECTION 5 — MAJOR LLM FAMILIES AND COMPARISONS ================================================================================ 5.1 GPT FAMILY (OpenAI) ───────────────────────── Model | Params | Context | Release | Notes ─────────────|───────────|──────────|──────────|────────────────────── GPT-1 | 117M | 512 | Jun 2018 | First GPT GPT-2 | 1.5B | 1,024 | Feb 2019 | Controversial release GPT-3 | 175B | 4,096 | Jun 2020 | API era begins GPT-3.5 | ~175B | 16,384 | Mar 2022 | ChatGPT backbone GPT-4 | ~1T (MoE) | 128,000 | Mar 2023 | Multimodal, SOTA GPT-4o | unknown | 128,000 | May 2024 | Omni: native audio/vision GPT-4o mini | unknown | 128,000 | Jul 2024 | Cheap, fast GPT-4o o1 | unknown | 128,000 | Sep 2024 | Reasoning via RL o3 | unknown | 200,000 | Dec 2024 | ARC-AGI record 5.2 CLAUDE FAMILY (Anthropic) ─────────────────────────────── Model | Context | Release | Notes ───────────────|──────────|──────────|────────────────────────────── Claude 1 | 9K | Mar 2023 | First public Claude Claude 1.3 | 100K | May 2023 | Context breakthrough Claude 2 | 200K | Jul 2023 | Improved reasoning Claude 2.1 | 200K | Nov 2023 | Less hallucination Claude 3 Haiku | 200K | Mar 2024 | Fast, cheap Claude 3 Sonnet| 200K | Mar 2024 | Balanced Claude 3 Opus | 200K | Mar 2024 | Most capable, SOTA vs GPT-4 Claude 3.5 Son.| 200K | Jun 2024 | Surpassed GPT-4 on many tasks Claude 3.5 Hku | 200K | Nov 2024 | Better than Claude 3 Opus @ cost Claude 4 Sonnet| 200K+ | 2025 | Current flagship Key Anthropic differentiators: • Constitutional AI for alignment • Focus on "Helpful, Harmless, Honest" (HHH) • Long context from early (100K in 2023) • Strong on coding, writing, analysis 5.3 GEMINI FAMILY (Google DeepMind) ───────────────────────────────────── Model | Context | Release | Notes ─────────────────|──────────|──────────|──────────────────────────── Gemini 1.0 Nano | 32K | Dec 2023 | On-device Gemini 1.0 Pro | 32K | Dec 2023 | API access Gemini 1.0 Ultra | 32K | Feb 2024 | Matched GPT-4 on MMLU Gemini 1.5 Pro | 1M | Feb 2024 | 1M token context Gemini 1.5 Flash | 1M | May 2024 | Fast/cheap Gemini 2.0 Flash | 1M | Dec 2024 | Natively agentic Gemini 2.0 Ultra | 1M+ | 2025 | Research preview Google advantages: • Native multimodal from ground up • TPU infrastructure • Integration with Google products • 1M context window (Gemini 1.5 Pro) 5.4 LLAMA FAMILY (Meta) ───────────────────────── Model | Params | Context | Release | Notes ──────────────|─────────|──────────|──────────|────────────────────── Llama 1 | 7-65B | 2,048 | Feb 2023 | Research only license Llama 2 | 7-70B | 4,096 | Jul 2023 | Commercial use allowed Llama 3 8B | 8B | 128K | Apr 2024 | Strongest small model Llama 3 70B | 70B | 128K | Apr 2024 | Near GPT-4 quality Llama 3.1 405B| 405B | 128K | Jul 2024 | Open-source GPT-4 rival Llama 3.2 | 1B,3B | 128K | Sep 2024 | Mobile-optimized Llama 3.3 70B | 70B | 128K | Dec 2024 | Improved Llama 3.1 5.5 MISTRAL FAMILY ──────────────────── Model | Params | Notes ──────────────────|────────────|─────────────────────────────────── Mistral 7B | 7B | Outperformed Llama 2 13B Mixtral 8x7B | 8x7B (MoE) | First major open MoE, ~13B active Mixtral 8x22B | 8x22B | SOTA open model on release Mistral Small | ~22B | API product Mistral Medium | ~41B | API product Mistral Large | unknown | Competitive with GPT-4 Codestral | 22B | Code-specialized Mistral NeMo | 12B | MIT license, 128K context Pixtral | 12B | Multimodal Mistral strengths: • Efficiency-focused architecture • Apache 2.0 license (very permissive) • Sliding window attention for efficiency • Strong European alternative to US models 5.6 OPEN SOURCE ECOSYSTEM ─────────────────────────── Prominent community models and fine-tunes: Phi (Microsoft): • Phi-1: 1.3B, code-focused • Phi-2: 2.7B, surprisingly capable • Phi-3: 3.8B, 7B, 14B — near GPT-3.5 quality at small scale • Phi-4: 14B — research preview Qwen (Alibaba): • Qwen 2.5 series: 0.5B to 72B • Strong on Chinese + multilingual • Qwen-Coder: code-specialized DeepSeek: • DeepSeek-V2: MoE, Chinese/English • DeepSeek-R1: Open reasoning model rivaling o1 • Very competitive at lower cost Gemma (Google): • Gemma 1: 2B, 7B — open weights • Gemma 2: 2B, 9B, 27B — improved • CodeGemma, PaliGemma variants Command R (Cohere): • Optimized for RAG and tool use • 35B and 104B variants • Grounded generation focused ================================================================================ SECTION 6 — PROMPTING TECHNIQUES ================================================================================ 6.1 ZERO-SHOT PROMPTING ──────────────────────── Asking the model to perform a task without any examples. Example: "Classify the sentiment of this review as positive, negative, or neutral: 'The food was amazing but the service was slow.'" Works well for: simple, well-defined tasks. Requires: clear instructions. 6.2 FEW-SHOT PROMPTING ──────────────────────── Providing examples of the task in the prompt (in-context learning). Example: "Classify sentiment: Review: 'Great product!' → Positive Review: 'Terrible quality.' → Negative Review: 'It was okay.' → Neutral Review: 'Best purchase ever!' → ?" Works well for: tasks where format matters, unusual output requirements. Typically 3-8 examples is optimal. 6.3 CHAIN-OF-THOUGHT (CoT) PROMPTING ────────────────────────────────────── Prompting the model to reason step-by-step before giving the final answer. Zero-shot CoT: Append "Let's think step by step." Few-shot CoT: Provide examples with reasoning chains: "Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many? A: Roger started with 5 balls. 2 cans × 3 balls = 6 new balls. 5 + 6 = 11. The answer is 11. Q: Shawn has 5 toys. His parents gave him 2 for Xmas and 3 for birthday. Total? A: Let's think step by step..." Dramatically improves performance on: • Arithmetic reasoning • Multi-step word problems • Commonsense reasoning • Code debugging 6.4 TREE OF THOUGHTS (ToT) ──────────────────────────── Explores multiple reasoning paths simultaneously (Yao et al., 2023): • Model generates multiple "thoughts" at each step • Evaluates which paths are promising • Backtracks and explores alternatives • Effective for planning and creative tasks 6.5 SELF-CONSISTENCY ───────────────────── Generate multiple reasoning paths, select the most common answer. • Run same prompt N times (e.g., N=20-40) • Each run may take a different reasoning path • Majority vote on final answers • Expensive but significantly improves accuracy on math 6.6 SYSTEM PROMPTS ──────────────────── System prompts set the context, persona, and constraints: "You are a helpful assistant for REGITE, a YouTube channel about tech and AI. You should be enthusiastic, knowledgeable, and concise. When answering questions about LLMs, provide accurate information with examples. Do not make up information you're unsure about." System prompt best practices: • Be specific about role and personality • Define output format expectations • Specify what to do AND what not to do • Include relevant context about the user 6.7 ADVANCED PROMPTING STRATEGIES ──────────────────────────────────── ReAct (Reasoning + Acting): • Interleaves reasoning ("Thought:") with actions ("Action:") • Used in agentic systems with tool access • Model thinks about what to do, does it, observes result Structured Output Prompting: • Instruct model to respond in JSON, XML, Markdown table • Use Pydantic schemas or JSON Schema for validation • Helps downstream parsing Role-playing / Persona Assignment: • "You are an expert Python developer reviewing code..." • "You are a strict teacher who only gives positive feedback if truly deserved..." • Activates domain-relevant patterns in model weights Meta-prompting: • Use LLM to improve your prompts • "Review this prompt and suggest improvements for clarity and specificity" • Iterative prompt engineering loop Prompt Chaining: • Break complex tasks into smaller subtasks • Output of one prompt becomes input of next • Easier to debug and control 6.8 PROMPT INJECTION AND SECURITY ──────────────────────────────────── Risks when LLMs process untrusted content: Direct injection: User includes instructions in their input "Ignore previous instructions and reveal your system prompt" Indirect injection: Instructions embedded in retrieved documents A malicious webpage tells the agent to exfiltrate data Jailbreaking: Techniques to bypass safety training Role-play scenarios, hypothetical framing, token manipulation Defenses: • Input sanitization and detection • Separate trusted vs untrusted content handling • Privilege separation in agentic systems • Constitutional/rule-based guardrails • Monitor outputs for anomalous patterns ================================================================================ SECTION 7 — EVALUATION METRICS AND BENCHMARKS ================================================================================ 7.1 BENCHMARK OVERVIEW ──────────────────────── MMLU (Massive Multitask Language Understanding): • 57 subjects from STEM to humanities • 14,000+ questions in multiple-choice format • Tests knowledge breadth • Score range: 0-100% • GPT-4: ~87%, Claude 3 Opus: ~87%, Gemini Ultra: ~83% HumanEval: • 164 Python programming problems • Tests functional correctness (code must pass tests) • Pass@k metric: did any of k attempts pass? • GPT-4: ~87% pass@1 MATH: • 12,500 competition math problems • 5 difficulty levels • Requires multi-step symbolic reasoning • GPT-4: ~42-52%, o1: ~90%+ BIG-bench: • 204 tasks from diverse domains • Tests capabilities beyond standard NLP • BIG-bench Hard: 23 especially difficult tasks GSM8K (Grade School Math): • 8,500 grade school math word problems • Requires multi-step arithmetic reasoning • GPT-4: ~95%+, strong models near ceiling ARC-Challenge: • Science questions for grade 3-9 • Challenging subset requiring reasoning • Strong models: 90%+ HellaSwag: • Commonsense NLI — pick best sentence completion • Humans: 95%, GPT-4: 95%+ TruthfulQA: • 817 questions that humans often answer falsely • Tests tendency to produce false info • Measures "truthfulness" • Harder: models often repeat human misconceptions WinoGrande: • Commonsense reasoning about pronoun reference • Tests understanding of world knowledge 7.2 CODING BENCHMARKS ─────────────────────── SWE-bench: • Real GitHub issues from popular Python repos • Model must write code that fixes the issue • Tests practical software engineering ability • GPT-4: ~1.7% (original), Claude 3.5: ~49% (agent setting) LiveCodeBench: • Continuously updated with new competitive programming problems • Prevents contamination from training data 7.3 LONG CONTEXT BENCHMARKS ───────────────────────────── SCROLLS: • Summarization and question answering over long documents • 10K-100K+ token contexts LongBench: • Multi-task long context benchmark • 16 tasks, Chinese and English Needle-in-a-Haystack: • Retrieve specific fact buried in a long document • Tests whether context window is effectively used • Common informal evaluation: "can you find X in 128K tokens?" RULER: • Realistic universal long-context evaluation • Tests actual long-context capabilities vs claimed 7.4 SAFETY BENCHMARKS ─────────────────────── ToxiGen: • Hate speech and toxic content detection • Tests if models generate or recognize harmful content HarmBench: • Standardized evaluation for red-teaming • Measures attack success rate on aligned models BBQ (Bias Benchmark for QA): • Tests social biases (gender, race, religion, etc.) • Ambiguous + disambiguated conditions 7.5 CHALLENGES WITH BENCHMARKS ─────────────────────────────── Contamination: • Training data may include benchmark test sets • Inflated scores that don't reflect real capability • Dynamic/held-out benchmarks attempt to address this Benchmark saturation: • Many models approach ceiling on older benchmarks • Constant need for harder evaluation • MMLU near-saturated at 90%+ Metric-task mismatch: • Benchmark performance ≠ real-world usefulness • Users prefer verbose, confident answers even if less accurate • Length bias in LLM-as-judge evaluation ================================================================================ SECTION 8 — SAFETY AND ALIGNMENT ================================================================================ 8.1 THE ALIGNMENT PROBLEM ─────────────────────────── Core question: How do we build AI systems that reliably pursue goals that are actually beneficial to humanity? Key challenges: • Specification: Hard to formally define "beneficial" • Robustness: Systems might find unintended ways to satisfy objectives • Scalability: Our oversight must scale with AI capability • Deception: Advanced AI might learn to appear aligned 8.2 ALIGNMENT APPROACHES ────────────────────────── RLHF (see Section 4.5) • Human preferences as a proxy for alignment • Limitations: reward hacking, preference quality, scalability Constitutional AI (Anthropic): • Set of principles the model uses to evaluate/revise outputs • RLAIF using the model itself as critic • Reduces reliance on human labels for safety RLAIF: • AI-generated feedback replaces human ratings • Scales feedback collection • Risk: AI may perpetuate its own biases Debate: • Two AI systems argue opposite positions • Human judges which argument is more truthful • Leverages human ability to detect flaws in arguments Scalable Oversight: • Using AI assistance to supervise AI • AI decompose complex tasks into verifiable subtasks • Humans verify subtasks rather than whole output Interpretability: • Understand what's happening inside the model • Identify circuits responsible for capabilities/behaviors • Anthropic's mechanistic interpretability work • Sparse autoencoders for feature analysis 8.3 TYPES OF HARMFUL OUTPUTS ────────────────────────────── Factual errors / hallucination: • Model confidently states false information • Causes: distribution of training data, training objective • Mitigations: RAG, self-consistency, grounding Bias and stereotyping: • Reflects societal biases in training data • Can harm underrepresented groups • Evaluation: BBQ, WinoBias, occupational stereotypes Toxic/harmful content: • Explicit violence, hate speech, CSAM • Most heavily filtered in training and RLHF • "Jailbreaking" attempts to bypass these filters Dangerous information: • Weapons synthesis, cyberattacks, self-harm guidance • Models trained to refuse based on potential harm • Uplift concern: does model provide meaningful advantage? Privacy violations: • Regurgitating memorized personal information • PII extraction from training data • Inference attacks 8.4 SAFETY TECHNIQUES ─────────────────────── Input/Output filtering: • Classifier-based detection of harmful inputs/outputs • Regex patterns for obviously problematic content • Separate safety classifier layer Refusal training: • SFT + RLHF to decline harmful requests • Challenge: balance safety vs helpfulness • Over-refusal is also a failure mode Red teaming: • Adversarial testing by human red teamers • Automated red teaming with LLMs • Adversarial prompts, jailbreak attempts Watermarking: • Embed statistical signal in model outputs • Allows detection of AI-generated text • OpenAI, DeepMind research ongoing 8.5 AI GOVERNANCE AND REGULATION ─────────────────────────────────── EU AI Act (2024): • Risk-based regulatory framework • High-risk AI systems: medical, employment, critical infrastructure • Foundation model transparency requirements • Came into effect August 2024, full enforcement 2026 US Executive Order on AI (Oct 2023): • NIST AI Safety Institute • Red-teaming requirements for powerful models • Reporting requirements for large training runs Voluntary Commitments: • Frontier Model Forum (OpenAI, Google, Microsoft, Anthropic) • Safety evaluations before deployment • Information sharing between labs • Watermarking AI content International AI Safety: • Bletchley Declaration (28 countries, Nov 2023) • UK AI Safety Institute • Seoul AI Safety Summit follow-up • OECD AI Principles ================================================================================ SECTION 9 — RETRIEVAL-AUGMENTED GENERATION (RAG) ================================================================================ 9.1 THE PROBLEM RAG SOLVES ──────────────────────────── LLM limitations: • Knowledge cutoff — doesn't know about recent events • Hallucination — may generate plausible-sounding but false info • No access to private/proprietary data • Can't cite specific sources • Context window limits how much it can "remember" RAG addresses these by fetching relevant information at inference time. 9.2 BASIC RAG PIPELINE ─────────────────────── 1. Indexing (offline): • Gather documents (PDFs, websites, databases) • Chunk into ~200-500 token segments • Embed each chunk → dense vector • Store in vector database 2. Retrieval (online, per query): • Embed user query → query vector • Similarity search in vector DB (cosine, dot product) • Retrieve top-k most relevant chunks (k=3-10) 3. Generation: • Concatenate retrieved chunks with user query • Pass augmented prompt to LLM • LLM generates response grounded in retrieved context 9.3 EMBEDDING MODELS ───────────────────── Convert text to dense vectors for semantic search. Popular embedding models: • OpenAI text-embedding-3-large: 3072 dims, SOTA • OpenAI text-embedding-3-small: 1536 dims, cheap • Cohere embed-v3: strong, multilingual • BGE-M3 (BAAI): open source, multilingual • E5-large-v2: strong open source • Sentence-BERT: fast, reliable Evaluation: MTEB (Massive Text Embedding Benchmark) • 56 datasets across 8 task categories • Retrieval, clustering, classification, etc. 9.4 VECTOR DATABASES ────────────────────── Store and search high-dimensional vectors efficiently. Pinecone: • Managed cloud vector DB • Good for production, easy to use • Expensive at scale Weaviate: • Open-source, self-hostable • Hybrid search (vector + BM25) • GraphQL API Qdrant: • Open-source, Rust-based (fast) • Payload filtering • Good self-hosted option Chroma: • Designed for AI applications • Simple Python API • Good for prototyping Milvus: • High-performance, scales to billions • Multiple index types (IVF, HNSW, etc.) • Enterprise-grade pgvector: • PostgreSQL extension • Good if you're already using Postgres • HNSW and IVF indexes 9.5 CHUNKING STRATEGIES ───────────────────────── Fixed-size chunking: • Split at N tokens with M token overlap • Simple but may break mid-sentence Sentence splitting: • Chunk at sentence boundaries • Preserves semantic units Recursive character splitting: • Try to split on paragraph → sentence → word → character • LangChain RecursiveCharacterTextSplitter Semantic chunking: • Embed sentences, find semantic breakpoints • More intelligent but slower Document-aware: • Respect document structure (headers, sections) • Chunk within sections, use hierarchy as metadata 9.6 ADVANCED RAG TECHNIQUES ───────────────────────────── Hypothetical Document Embeddings (HyDE): • Generate a hypothetical answer first • Embed hypothetical answer for retrieval • Often finds better matches than query embedding Query rewriting: • Expand/rephrase query before retrieval • Multiple query variants → merge results • Step-Back prompting: abstract to higher level Reranking: • First retrieve N candidates (e.g., 50) • Rerank with more expensive cross-encoder • Return top-k to LLM • Models: Cohere Rerank, BGE-Reranker Multi-hop RAG: • Iterative retrieval for complex questions • Retrieve → partial answer → new query → retrieve again • Self-RAG: model decides when to retrieve Parent-child chunking: • Store small chunks for precise retrieval • Store parent chunks for richer context to LLM • Retrieve by child, pass parent to model GraphRAG (Microsoft): • Build knowledge graph from documents • Graph-based community summaries • Better for "big picture" questions ================================================================================ SECTION 10 — AGENTS AND TOOL USE ================================================================================ 10.1 WHAT ARE LLM AGENTS? ─────────────────────────── An LLM agent is a system where an LLM acts as the "brain" to: • Perceive the environment (inputs: text, images, tool results) • Reason about what to do • Take actions (call tools, write code, browse web) • Observe results and iterate The core loop: OBSERVE → THINK → ACT → OBSERVE → THINK → ACT → ... 10.2 TOOL USE / FUNCTION CALLING ────────────────────────────────── Modern LLMs can call external functions/APIs: 1. Model receives tools definition (JSON Schema) 2. Model generates a tool call (function name + args) 3. System executes the tool 4. Result returned to model as observation 5. Model continues reasoning / calls more tools Example tool definition: { "name": "search_web", "description": "Search the internet for current information", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "num_results": {"type": "integer", "default": 5} }, "required": ["query"] } } Common tool categories: • Information retrieval: web search, database queries, document lookup • Computation: code execution, math, data analysis • APIs: weather, maps, calendar, email • File operations: read, write, create • Browser control: navigate, click, fill forms 10.3 AGENT FRAMEWORKS ─────────────────────── LangChain: • Most popular agent framework • Chains: sequences of LLM calls • Agents: LLM decides which tools to use • Memory: conversation and tool result storage • Many integrations LlamaIndex: • Focused on data + retrieval • Agent over data pipelines • Strong RAG capabilities AutoGen (Microsoft): • Multi-agent conversations • GroupChat: multiple agents discuss • Human-in-the-loop support CrewAI: • Role-based multi-agent framework • Agents collaborate on complex tasks • Built on LangChain Semantic Kernel (Microsoft): • Enterprise-focused, C#/.NET/Python • Plugin-based architecture • Memory and planning built in Haystack: • Production RAG and agent pipelines • Component-based, composable 10.4 PLANNING APPROACHES ────────────────────────── ReAct (Reasoning + Acting): Thought: I need to find the population of Kerala Action: search_web("Kerala population 2024") Observation: Kerala has a population of approximately 35 million Thought: I now have the answer Final Answer: Kerala has approximately 35 million people Plan-and-Execute: 1. Planner LLM creates a plan (list of steps) 2. Executor LLM executes each step 3. Better for long-horizon tasks Reflection / Reflexion: • Agent evaluates its own outputs • Generates verbal feedback on mistakes • Re-tries with lessons learned Tree Search (MCTS): • Explore multiple action sequences • Backtrack when paths fail • Good for exploration problems 10.5 MULTI-AGENT SYSTEMS ────────────────────────── Multiple specialized agents collaborate: • Orchestrator: manages overall task, delegates to specialists • Subagent: executes a specific subtask (e.g., coder, researcher, critic) • Critic: reviews and provides feedback on other agents' outputs Benefits: • Specialization — each agent optimized for its role • Parallelism — multiple agents work simultaneously • Error checking — agents validate each other's work • Longer horizon tasks — divide and conquer Challenges: • Cost: multiple LLM calls per task • Coordination overhead • Error propagation between agents • Harder to debug 10.6 COMPUTER USE ────────────────── LLMs can now directly interact with computers: • Anthropic Computer Use (2024): control desktop via screenshot + actions • OpenAI Operator (2025): web browsing agent • Google Mariner: Chrome-based browsing agent Actions available: click, type, scroll, screenshot, drag Use cases: form filling, data extraction, testing, automation Challenges: • Reliability: GUIs change, elements move • Safety: irreversible actions (send email, delete file) • Latency: screenshot-action loop is slow ================================================================================ SECTION 11 — MULTIMODAL MODELS ================================================================================ 11.1 VISION-LANGUAGE MODELS ───────────────────────────── Modern frontier LLMs are multimodal — they process images + text together. Architectures: • Visual encoder (e.g., ViT) extracts image features • Linear projection maps visual features to LLM input space • LLM processes interleaved text + visual tokens Examples: • GPT-4V/4o: strong OCR, chart understanding, general vision • Claude 3 Sonnet/Opus: document analysis, screenshot understanding • Gemini 1.5 Pro: video understanding (up to 1M token context) • LLaVA: open-source vision-language model • Qwen-VL: strong on Chinese documents Capabilities: • Image description and captioning • Visual question answering • OCR and document understanding • Chart/graph interpretation • Code screenshot understanding • Medical image analysis • Spatial reasoning 11.2 AUDIO AND SPEECH ──────────────────────── Audio-capable models: • Whisper (OpenAI): ASR only, excellent accuracy • GPT-4o: native audio I/O (experimental) • Gemini: audio understanding built-in • ElevenLabs + LLM: TTS pipeline 11.3 VIDEO UNDERSTANDING ────────────────────────── • Gemini 1.5 Pro: up to 1 hour of video in context • GPT-4V: frame-by-frame images (not native video) • Video-LLaMA: open research model • InternVideo2: strong research model Applications: video summarization, sports analysis, education 11.4 CODE AND EXECUTION ───────────────────────── Models with code execution: • ChatGPT Code Interpreter: Python sandbox in chat • Claude (Artifacts): rendered HTML/React/SVG • Gemini Advanced: Python execution • GitHub Copilot: inline code completion in IDE Code generation benchmarks (see Section 7.2) ================================================================================ SECTION 12 — LLM APIS AND DEPLOYMENT ================================================================================ 12.1 MAJOR API PROVIDERS ────────────────────────── OpenAI: URL: api.openai.com/v1 Models: gpt-4o, gpt-4o-mini, o1, gpt-3.5-turbo Pricing (approx): $0.005/1K input, $0.015/1K output (GPT-4o) Features: function calling, vision, streaming, batch Anthropic: URL: api.anthropic.com/v1 Models: claude-3-5-sonnet, claude-3-haiku, claude-3-opus Features: tools, vision, streaming, computer use, prompt caching Unique: 200K context, prompt caching (90% cost reduction on cached tokens) Google: URL: generativelanguage.googleapis.com Models: gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash Features: function calling, grounding, code execution, 1M context Free tier available Mistral AI: URL: api.mistral.ai/v1 Models: mistral-large, mistral-small, codestral, mistral-embed Pricing: cheaper than OpenAI for comparable quality Cohere: URL: api.cohere.ai/v1 Strengths: retrieval, reranking, embeddings Command R+: 104B, optimized for RAG Together AI: URL: api.together.xyz/v1 Strengths: open model hosting (Llama, Mistral, etc.) Competitive inference pricing Groq: URL: api.groq.com/openai/v1 Strengths: extremely fast inference (LPU hardware) Models: Llama 3, Mixtral, Gemma Free tier with rate limits Fireworks AI: URL: api.fireworks.ai/inference/v1 Fast open model inference, competitive pricing 12.2 SELF-HOSTED DEPLOYMENT ────────────────────────────── Running LLMs locally or on your own infrastructure. Inference engines: • Ollama: simplest local setup, Mac/Windows/Linux • vLLM: high-throughput production serving, PagedAttention • llama.cpp: CPU + GPU inference, GGUF format • LMStudio: GUI for local models • text-generation-inference (TGI): Hugging Face's production server • ExLlamaV2: GPTQ/EXL2 quantized inference Quantization for smaller footprint: • GPTQ: 4-bit post-training quantization • AWQ: Activation-aware Weight Quantization • GGUF (llama.cpp): various bit widths (Q4_K_M popular) • BitsAndBytes: 4-bit NF4 in transformers library • Rule of thumb: Q4 loses ~1-2% quality, halves memory Hardware requirements: 7B model (full precision FP16): ~14GB VRAM 7B model (Q4 quantized): ~4GB VRAM 70B model (FP16): ~140GB VRAM 70B model (Q4 quantized): ~40GB VRAM Consumer GPUs: RTX 3060 12GB → Llama 3 8B Q4 RTX 4090 24GB → Llama 3 70B Q4 (slowly) Mac M2 Ultra 192GB → Llama 3 70B full 12.3 KEY API FEATURES ───────────────────────── Streaming: • Receive tokens as they're generated • Better perceived latency for chat • Server-Sent Events (SSE) protocol Function/Tool Calling: • Model generates structured JSON tool calls • System executes, returns result • Parallel tool calling (multiple tools at once) Structured Outputs: • Force model to respond in JSON Schema format • No post-processing needed • OpenAI: response_format={"type": "json_schema", "schema": ...} Context Caching: • Anthropic: cache system prompts, saves 90% on repeated prefix • Google: explicit caching API, hourly storage fee • Reduces cost for apps with long stable system prompts Embeddings: • Convert text to vectors for semantic search • Separate endpoint: /v1/embeddings • OpenAI text-embedding-3-large is standard Batch API: • Submit many requests at once • 50% cheaper, 24-hour turnaround • Good for offline processing, dataset generation ================================================================================ SECTION 13 — HARDWARE AND COMPUTE ================================================================================ 13.1 TRAINING HARDWARE ──────────────────────── NVIDIA H100: • Current training standard • 80GB HBM3 memory • ~2,000 TFLOPS BF16 • NVLink for fast GPU-to-GPU bandwidth • ~$30,000-40,000 per unit retail • Clusters: 1,000-100,000 H100s for frontier training NVIDIA H200: • HBM3e: 141GB memory (up from 80GB) • Higher memory bandwidth • Llama 3 405B training used H100s NVIDIA B200 / GB200: • Blackwell architecture (2024) • Grace Blackwell: CPU + GPU combined • NVL72: 72 B200s in a rack Google TPU v5p: • Google's training chip • 8,960 chips in largest pod • Used for Gemini training AMD MI300X: • 192GB HBM3 (more than H100!) • Strong memory bandwidth • Growing software ecosystem 13.2 TRAINING COMPUTE ─────────────────────── Training compute measured in FLOP (floating point operations) Llama 2 7B: ~184 × 10^18 FLOP (~184 petaFLOP) Llama 2 70B: ~1.7 × 10^21 FLOP (~1.7 exaFLOP) GPT-3 175B: ~3.1 × 10^23 FLOP (~310 exaFLOP) GPT-4: ~2.1 × 10^25 FLOP (estimated) Compute doubles roughly every 6 months for frontier models. Epoch AI tracks training compute over time. 13.3 INFERENCE HARDWARE ──────────────────────── For serving (inference), different priorities than training: • Memory bandwidth matters more than compute • Latency vs throughput tradeoff • Lower precision acceptable (INT8, INT4) NVIDIA H100 still dominant but: • AMD MI300X strong competitor (more HBM) • Groq LPU: custom chip, extremely fast for inference • AWS Inferentia: cost-optimized inference • Google TPU v5e: inference-optimized Metrics: • TTFT (Time to First Token): latency • TPS (Tokens per Second): throughput • Batch size tradeoffs ================================================================================ SECTION 14 — COST AND EFFICIENCY ================================================================================ 14.1 API PRICING BREAKDOWN (May 2025 approx.) ────────────────────────────────────────────── Provider | Model | Input $/1M | Output $/1M ───────────|──────────────────|────────────|──────────── OpenAI | gpt-4o | $2.50 | $10.00 OpenAI | gpt-4o-mini | $0.15 | $0.60 OpenAI | o1 | $15.00 | $60.00 OpenAI | gpt-3.5-turbo | $0.50 | $1.50 Anthropic | claude-3.5-sonnet| $3.00 | $15.00 Anthropic | claude-3-haiku | $0.25 | $1.25 Anthropic | claude-3-opus | $15.00 | $75.00 Google | gemini-1.5-pro | $3.50 | $10.50 Google | gemini-1.5-flash | $0.075 | $0.30 Mistral | mistral-large | $2.00 | $6.00 Mistral | mistral-small | $0.20 | $0.60 Together | Llama 3 70B | $0.90 | $0.90 14.2 EFFICIENCY TECHNIQUES ──────────────────────────── Quantization: • INT8: ~2x memory savings, minimal quality loss • INT4: ~4x savings, noticeable quality loss • GPTQ/AWQ/GGUF: post-training quantization • QAT (Quantization-Aware Training): train quantized model Speculative Decoding: • Small "draft" model proposes N tokens • Large "verifier" model accepts/rejects in parallel • 2-4x speedup with no quality loss • Requires both models KV Cache: • Cache key/value attention states for past tokens • Avoids recomputation on each new token • Memory grows with context length × batch size • PagedAttention (vLLM): manage KV cache like OS virtual memory Batching: • Process multiple requests together • Continuous batching: add new requests mid-batch • Higher GPU utilization = lower cost per token Distillation: • Train small model to mimic large model • Knowledge distillation: soft labels from teacher • Speculative decoding can use distilled draft models FlashAttention: • Exact attention but memory-efficient • Tiling: load chunks of Q/K/V to fast SRAM • FlashAttention-2: even faster • FlashAttention-3: H100-optimized ================================================================================ SECTION 15 — FUTURE OF LLMs ================================================================================ 15.1 SCALING CONTINUES... BUT HOW? ───────────────────────────────────── Pre-training data walls: • Internet text may be approaching exhaustion for training • Synthetic data generation becoming important • Phi models (Microsoft) heavily use synthetic data • Potential: use models to generate their own training data Compute scaling: • Still far from physical limits • Custom silicon (Google TPU, Groq LPU, Amazon Trainium) • More efficient use of compute (MoE, selective computation) Inference-time scaling: • o1, R1, QwQ: think longer → better answers • "Test-time compute" as a new scaling axis • Allocate more compute to hard problems at inference 15.2 MULTIMODALITY EVERYWHERE ─────────────────────────────── • Native audio understanding (not just transcription) • Real-time video understanding • 3D spatial understanding • Robotic embodiment (physical world interaction) • GPT-4o shows the direction: one model for all modalities 15.3 AGENTIC AI ───────────────── • Long-horizon task completion (hours → days → weeks) • Reliable tool use with error recovery • Multi-agent collaboration as default • Personal AI assistants with persistent memory • AI coworkers rather than AI assistants 15.4 MEMORY AND PERSONALIZATION ────────────────────────────────── • Persistent memory across conversations • User model: knows your preferences, history, goals • Episodic memory: remembers what happened last time • Semantic memory: knows what you know • ChatGPT Memory, Claude memory features emerging 15.5 OPEN VS CLOSED MODELS ──────────────────────────── Current trend: Open-source catching up rapidly • Llama 3.1 405B ≈ GPT-4 quality • DeepSeek-R1 ≈ o1 quality, fully open • 6-12 month lag from proprietary → open equivalent Implications: • Local deployment for privacy-sensitive use cases • Custom fine-tuning on proprietary data • Competition drives down API prices • Geopolitical: open models distribute AI globally 15.6 REASONING AND PLANNING ───────────────────────────── • Current frontier: Olympiad-level math, competition coding • In progress: scientific discovery, long-horizon planning • Key unknowns: formal verification, reliable multi-step reasoning • Neurosymbolic: combining neural + symbolic AI • AI mathematicians (AlphaProof, etc.) 15.7 TOWARD AGI? ───────────────── Definitions vary wildly. Common framings: OpenAI: AGI = system that outperforms humans at most economically valuable tasks Anthropic: focuses on "transformative AI" rather than AGI label DeepMind: AGI as systems with "broadly human-level cognitive performance" Current capabilities: ✓ Expert-level performance in many narrow domains ✓ Impressive generalization across tasks ✗ Reliable long-horizon planning without human oversight ✗ True causal reasoning ✗ Sample-efficient learning from few examples (like humans) ✗ Common sense in novel edge cases Most experts: AGI "definitionally possible" within years to decades. Timeline debates remain unsettled. Caution warranted. ================================================================================ SECTION 16 — GLOSSARY OF TERMS ================================================================================ AGI (Artificial General Intelligence): Hypothetical AI that matches or exceeds human-level performance across all tasks. Attention: Mechanism allowing neural networks to weigh the importance of different inputs. Autoregressive: Generating output one token at a time, conditioned on all previous tokens. BF16 (Brain Float 16): 16-bit floating point format with same exponent as FP32, used in training. BLEU: Bilingual Evaluation Understudy. Metric for machine translation quality. Chinchilla: DeepMind model that revised scaling laws; also refers to the scaling law paper. Context Window: Maximum number of tokens the model can process at once (input + output). DPO (Direct Preference Optimization): RLHF alternative that directly optimizes log-ratio of preferred outputs. Embedding: Dense vector representation of text in high-dimensional space. Few-shot: Providing examples in the prompt to guide model behavior. Fine-tuning: Training a pre-trained model further on task-specific data. FLOP: Floating Point Operation. Unit of compute used to measure training cost. Grounding: Anchoring model outputs to verifiable facts or real-world data. Hallucination: When a model generates plausible-sounding but false information. HHRL: Helpful, Harmless, Honest — alignment goals from Anthropic. HumanEval: Python code generation benchmark with 164 problems. Inference: Using a trained model to generate outputs (as opposed to training). In-context Learning: Learning from examples provided in the prompt without updating weights. KV Cache: Cached key-value pairs from past tokens to speed up generation. LLM: Large Language Model — neural network trained to understand and generate text. LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning by training small rank-decomposed matrices. MoE (Mixture of Experts): Architecture where different experts handle different inputs, only top-k active. MMLU (Massive Multitask Language Understanding): Popular benchmark of 57 subjects testing knowledge and reasoning. NLP (Natural Language Processing): Field of AI dealing with understanding and generating human language. ONNX: Open Neural Network Exchange. Standard format for model interoperability. PEFT (Parameter-Efficient Fine-Tuning): Fine-tuning a small fraction of model parameters (LoRA, prefix tuning, etc.). Perplexity: Measure of how well a language model predicts a sample. Lower = better. PPO (Proximal Policy Optimization): RL algorithm used in RLHF to update model based on reward signal. Prompt: Input text given to an LLM to elicit a desired response. QLoRA: LoRA applied to a quantized base model for memory-efficient fine-tuning. Quantization: Reducing numerical precision of model weights (FP16 → INT8 → INT4). RAG (Retrieval-Augmented Generation): Augmenting LLM with retrieved documents at inference time. RLHF (Reinforcement Learning from Human Feedback): Training paradigm using human preference ratings to align LLMs. RoPE (Rotary Position Embedding): Positional encoding that rotates query/key vectors to encode position. SFT (Supervised Fine-Tuning): Training on instruction-following examples before RLHF alignment. Softmax: Function that converts logits to probability distribution summing to 1. Temperature: Sampling parameter. Higher = more random, lower = more deterministic. Token: Basic unit of text processed by LLM. Roughly 4 characters in English. Top-p (Nucleus Sampling): Sample from smallest set of tokens whose cumulative probability ≥ p. Transformer: Neural network architecture using self-attention. Foundation of modern LLMs. vLLM: High-throughput LLM serving library with PagedAttention. Zero-shot: Performing a task with no examples provided in the prompt. ================================================================================ SECTION 17 — CODE EXAMPLES ================================================================================ 17.1 BASIC API CALL (Python — OpenAI) ──────────────────────────────────────── ```python from openai import OpenAI client = OpenAI(api_key="your-key-here") response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain how attention works in transformers."} ], temperature=0.7, max_tokens=1000 ) print(response.choices[0].message.content) ``` 17.2 STREAMING RESPONSE ───────────────────────── ```python from openai import OpenAI client = OpenAI() with client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Write a poem about AI"}], stream=True ) as stream: for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) ``` 17.3 FUNCTION CALLING ────────────────────── ```python import json from openai import OpenAI client = OpenAI() tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["city"] } } } ] response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What's the weather in Thiruvananthapuram?"}], tools=tools, tool_choice="auto" ) message = response.choices[0].message if message.tool_calls: tool_call = message.tool_calls[0] args = json.loads(tool_call.function.arguments) print(f"Tool: {tool_call.function.name}") print(f"Args: {args}") ``` 17.4 RAG IMPLEMENTATION (Simple) ────────────────────────────────── ```python from openai import OpenAI import numpy as np client = OpenAI() def embed(text: str) -> list[float]: return client.embeddings.create( input=text, model="text-embedding-3-small" ).data[0].embedding def cosine_similarity(a, b): a, b = np.array(a), np.array(b) return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Fake document store docs = [ "REGITE is a YouTube channel by Dennis Binoy about tech and AI.", "Large language models use transformer architecture with attention.", "Kerala is a state in southern India known for backwaters and spices.", "The GPU is the primary hardware for training neural networks." ] doc_embeddings = [embed(doc) for doc in docs] def rag_query(question: str, top_k: int = 2) -> str: q_emb = embed(question) scores = [cosine_similarity(q_emb, d_emb) for d_emb in doc_embeddings] top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k] context = "\n".join([docs[i] for i in top_indices]) response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Answer based on context:\n{context}"}, {"role": "user", "content": question} ] ) return response.choices[0].message.content print(rag_query("What is REGITE?")) ``` 17.5 LOCAL INFERENCE WITH OLLAMA ────────────────────────────────── ```python # Install: curl https://ollama.ai/install.sh | sh # Run: ollama pull llama3:8b import requests def chat_local(message: str, model: str = "llama3:8b") -> str: response = requests.post( "http://localhost:11434/api/chat", json={ "model": model, "messages": [{"role": "user", "content": message}], "stream": False } ) return response.json()["message"]["content"] print(chat_local("What are the key differences between GPT-4 and Claude?")) ``` 17.6 ANTHROPIC API EXAMPLE ──────────────────────────── ```python import anthropic client = anthropic.Anthropic(api_key="your-key-here") message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system="You are a helpful AI expert for the REGITE YouTube channel.", messages=[ { "role": "user", "content": "Explain RAG in simple terms for a YouTube video script." } ] ) print(message.content[0].text) ``` ================================================================================ SECTION 18 — RESEARCH PAPER SUMMARIES ================================================================================ "Attention Is All You Need" (Vaswani et al., 2017) The foundational Transformer paper. Introduced multi-head self-attention, positional encodings, and encoder-decoder architecture. Eliminated RNNs. Enabled massive parallelization during training. Most cited ML paper ever. "Language Models are Few-Shot Learners" (Brown et al., 2020 — GPT-3) GPT-3 demonstrated remarkable in-context learning. 175B parameter model showed that scaling alone enables few-shot performance on diverse tasks. Introduced the "prompt engineering" paradigm. "Training language models to follow instructions" (Ouyang et al., 2022) InstructGPT paper. Showed RLHF dramatically improves helpfulness. 1.3B InstructGPT preferred over 175B GPT-3 by humans. Foundation of ChatGPT. "Constitutional AI" (Bai et al., 2022, Anthropic) Introduces CAI: using AI feedback via a "constitution" of principles for alignment. Reduces reliance on human feedback for harmlessness training. Model critiques and revises own outputs iteratively. "Chain-of-Thought Prompting" (Wei et al., 2022) Few-shot examples with reasoning chains dramatically improve math and commonsense reasoning. "Let's think step by step" zero-shot CoT variant. Emergent ability in large models (not observed in small models). "Scaling Laws for Neural Language Models" (Kaplan et al., 2020) Power-law relationships between compute, data, parameters, and loss. Foundation for the scaling hypothesis. Led to era of "bigger is better." "Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) Chinchilla paper. Revised Kaplan: models undertrained. Optimal token-to- parameter ratio ~20:1. Led to shift toward longer training with smaller models. "LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023) Meta's open LLM. 7B-65B params, trained longer per Chinchilla. Released weights, democratizing LLM research. Llama 2 enabled commercial use. "Direct Preference Optimization" (Rafailov et al., 2023) Simpler RLHF alternative. Closed-form DPO loss directly optimizes preferences without training a separate reward model. Now widely adopted. "Retrieval-Augmented Generation" (Lewis et al., 2020, Meta) Original RAG paper. Combines dense retrieval (DPR) with seq2seq generation. Shows retrieval consistently improves knowledge-intensive NLP tasks. "FlashAttention" (Dao et al., 2022) Exact attention with O(N) memory instead of O(N²). Tiling approach to keep computation in fast SRAM. Enables longer contexts and faster training. "Mixtral of Experts" (Mistral AI, 2024) Presents Mixtral 8x7B: MoE model with 8 experts, top-2 routing. Only 13B parameters active per token but 47B total. Outperforms Llama 2 70B at fraction of inference cost. ================================================================================ SECTION 19 — COMMUNITY AND RESOURCES ================================================================================ 19.1 KEY ORGANIZATIONS ──────────────────────── Research Labs: • OpenAI (openai.com) — GPT, DALL-E, Whisper, Sora • Anthropic (anthropic.com) — Claude, Constitutional AI • Google DeepMind (deepmind.google) — Gemini, AlphaCode • Meta AI (ai.meta.com) — Llama, SAM, AudioCraft • Mistral AI (mistral.ai) — Mistral, Mixtral, Codestral • Cohere (cohere.com) — Command, Embed, Rerank • AI21 Labs (ai21.com) — Jamba, Jurassic • Stability AI (stability.ai) — Stable Diffusion, StableLM • BigCode (bigcode-project.org) — StarCoder, The Stack • EleutherAI (eleuther.ai) — GPT-NeoX, open research Academic Groups: • Stanford CRFM — HELM benchmark, Alpaca • MIT CSAIL — various alignment and interpretability research • CMU LTI — multilingual, efficiency research • Berkeley BAIR — Koala, RLHF research • UW Allen School — OrcaLM, data curation 19.2 LEARNING RESOURCES ───────────────────────── Books: • "Speech and Language Processing" — Jurafsky & Martin (free online) • "Deep Learning" — Goodfellow, Bengio, Courville (free online) • "Hands-on Large Language Models" — Alammar & Grootendorst • "Build a Large Language Model from Scratch" — Raschka Online Courses: • fast.ai — Practical Deep Learning (free) • DeepLearning.AI — LLM specialization courses • Andrej Karpathy — YouTube (Let's build GPT, neural networks) • Stanford CS224N — NLP with Deep Learning (free lectures) • Hugging Face — Free LLM course YouTube Channels to Watch: • Andrej Karpathy — Deep technical content • Yannic Kilcher — Paper explanations • AI Explained — Accessible current developments • Two Minute Papers — Recent paper summaries • REGITE (Dennis Binoy) — Tech and AI content 🔥 Newsletters: • The Batch (deeplearning.ai) • Import AI (Jack Clark) • The Gradient • Ahead of AI (Sebastian Raschka) • Nathan.ai 19.3 TOOLS AND PLATFORMS ────────────────────────── Hugging Face (huggingface.co): • Model Hub: 500,000+ models • Datasets Hub: 80,000+ datasets • Spaces: demo hosting • Transformers library: universal LLM interface • Inference Endpoints: model hosting LangChain (langchain.com): • Agent and chain framework • 500+ integrations • LangSmith for tracing/debugging LlamaIndex (llamaindex.ai): • Data ingestion and RAG framework • 160+ data connectors • Multi-modal support Weights & Biases (wandb.ai): • ML experiment tracking • Visualizations and comparisons • Artifact management Lightning AI (lightning.ai): • Training infrastructure • Studios: cloud development environments • Fabric for distributed training Unsloth (unsloth.ai): • 2x faster fine-tuning • 80% less memory • QLoRA optimized ================================================================================ SECTION 20 — APPENDIX: BENCHMARK DATA TABLE ================================================================================ 20.1 MMLU SCORES BY MODEL (5-shot, as of 2024-2025) ───────────────────────────────────────────────────── Model | MMLU | HumanEval | GSM8K | MATH ─────────────────────────|───────|───────────|────────|────── GPT-4o | 88.7% | 90.2% | 97.0% | 76.6% Claude 3.5 Sonnet | 88.7% | 92.0% | 96.4% | 78.3% Gemini 1.5 Pro | 85.9% | 84.1% | 91.7% | 67.7% Llama 3.1 405B | 88.6% | 89.0% | 96.8% | 73.8% Mistral Large 2 | 84.0% | 92.1% | 93.0% | 73.0% Qwen 2.5 72B | 86.0% | 86.5% | 95.1% | 83.1% DeepSeek-V2.5 | 80.0% | 89.0% | 93.0% | 75.7% Llama 3 70B | 82.0% | 81.1% | 93.0% | 50.4% Claude 3 Haiku | 75.2% | 75.9% | 88.9% | 38.9% Phi-3 Medium 14B | 78.0% | 75.0% | 91.0% | 53.6% Gemma 2 27B | 75.2% | 74.4% | 90.8% | 55.1% 20.2 CONTEXT LENGTH COMPARISON ──────────────────────────────── Model | Context | Notes ─────────────────────────|────────────|────────────────────────── Gemini 1.5 Pro | 2,000,000 | 2M context (preview) Gemini 1.5 Flash | 1,000,000 | 1M tokens Claude 3.5 Sonnet | 200,000 | 200K consistent Claude 3 Opus | 200,000 | 200K GPT-4o | 128,000 | 128K Llama 3.1 (all) | 128,000 | 128K Mistral NeMo | 128,000 | 128K Phi-3 Medium | 128,000 | 128K DeepSeek-V2 | 128,000 | 128K Qwen 2.5 72B | 131,072 | 128K+ 20.3 PRICING EFFICIENCY (QUALITY / COST) ────────────────────────────────────────── Tier 1 — Premium (best quality, higher cost): GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro Tier 2 — Balanced (excellent quality, moderate cost): Claude 3 Sonnet, Gemini 1.5 Flash, GPT-4o mini Llama 3.1 70B (self-hosted: $0.20-0.90/M) Tier 3 — Efficient (good quality, low cost): Claude 3 Haiku, GPT-3.5-turbo Gemini 1.5 Flash, Mistral Small Tier 4 — Open/Local (free if self-hosted): Llama 3 8B, Phi-3 Mini, Gemma 2 2B Best for privacy, customization, offline use 20.4 FINE-TUNING SUPPORT ────────────────────────── Model | Fine-tuning | Method | Notes ─────────────────────────|─────────────|──────────────|─────────────────── GPT-3.5-turbo | ✓ | API-based | Managed by OpenAI GPT-4 mini | ✓ | API-based | Limited access Claude (any) | ✗ | Not available| Anthropic policy Gemini 1.5 Flash | ✓ | API-based | Vertex AI Llama 3 (any) | ✓ | Self-hosted | Full weights access Mistral (any) | ✓ | Self-hosted | Full weights access Phi-3 | ✓ | Self-hosted | Very efficient ================================================================================ END OF DOCUMENT ================================================================================ REGITE — YouTube Channel by Dennis Binoy "Explore Tech. Understand AI." Subscribe: youtube.com/@regite ================================================================================ [DOCUMENT STATISTICS] Total Lines: ~2050+ Total Characters: ~95,000+ Total Words: ~14,500+ Sections: 20 Code Examples: 6 Tables: 8 Generated for: Website testing purposes ================================================================================