The Referently Glossary of AI Terms: Definitions for the Current Era
A working reference for the vocabulary of modern AI — organized by conceptual layer, not alphabetically. Each definition is written for practitioners and informed generalists: precise enough to cite, plain enough to share.
Foundation Layer
Large Language Model (LLM)
A neural network trained on vast quantities of text to predict and generate language. LLMs learn statistical patterns across billions of documents, enabling them to answer questions, write code, summarize text, and engage in dialogue. The term “large” refers to parameter count, which now ranges from billions to trillions.
Token
The atomic unit of text that a language model processes. Tokens are not words — they are subword fragments determined by a tokenizer. The word “tokenization” might be split into two or three tokens; a single Chinese character is often one. Token limits define how much text a model can process in a single request.
Context Window
The total number of tokens a model can hold in active memory during a single interaction — both the input provided and the output generated. A model with a 128,000-token context window can process roughly 100,000 words at once. Content outside the context window is invisible to the model.
Inference
The process of running a trained model to generate output. Inference is distinct from training: training builds the model, inference uses it. Most AI costs at scale are inference costs. Inference speed is measured in tokens per second.
Training
The computational process by which a model learns from data. During training, a neural network adjusts billions of numerical parameters to minimize prediction error across a large dataset. Training a frontier model requires months of compute time and costs tens to hundreds of millions of dollars.
Parameter
A numerical weight inside a neural network that is adjusted during training. Parameters encode what the model has learned. A model described as having “70 billion parameters” contains 70 billion such values. More parameters generally means more capacity, though diminishing returns apply at scale.
Pre-training
The initial, large-scale training phase in which a model learns general language patterns from a massive corpus. Pre-training produces a base model with broad capabilities but no task-specific optimization. It is the most expensive phase of model development.
Foundation Model
A large model trained on broad data at scale, intended to be adapted for a wide range of downstream tasks. The term was coined at Stanford in 2021. GPT-4, Claude, Gemini, and Llama are all foundation models. They serve as the base layer on which more specialized applications are built.
Corpus
The dataset used to train or fine-tune a model. Pre-training corpora for frontier models include web crawls, books, code repositories, and curated datasets — often totaling trillions of tokens. The composition of a corpus substantially determines a model’s capabilities and blind spots.
Benchmark
A standardized test used to measure model performance on specific tasks. Common benchmarks include MMLU (academic knowledge), HumanEval (code generation), and HellaSwag (commonsense reasoning). Benchmark scores are widely used in model comparisons but can be gamed through selective training on benchmark-adjacent data.
Architecture
Transformer
The neural network architecture that underlies virtually all modern LLMs. Introduced by Google researchers in the 2017 paper “Attention Is All You Need,” the transformer replaced recurrent architectures with a parallel attention mechanism, enabling far more efficient training on long sequences. The architecture has remained dominant for nearly a decade.
Attention Mechanism
The core innovation of the transformer. Attention allows a model to weigh the relevance of every token in a sequence against every other token simultaneously, rather than processing tokens sequentially. This enables the model to capture long-range dependencies in text — understanding that a pronoun refers to a subject mentioned paragraphs earlier, for example.
Self-Attention
A variant of the attention mechanism in which a sequence attends to itself — each token weighs its relationship to every other token in the same sequence. Self-attention is the mechanism by which transformers build internal representations of meaning and structure.
Embedding
A numerical vector representing a word, token, sentence, or document in a high-dimensional space. Embeddings encode semantic meaning: similar concepts cluster near each other geometrically. They are the interface between discrete text and the continuous mathematics of neural networks.
Vector
A list of numbers representing a point in multidimensional space. In AI contexts, vectors encode the meaning of text (embeddings), store retrieved knowledge (in vector databases), and represent internal model states. The distance between vectors measures semantic similarity.
Neural Network
A computational system loosely inspired by biological neurons. Neural networks consist of layers of interconnected nodes that transform input data through learned weights. Deep neural networks — with many layers — are the foundation of modern AI.
Weights
The numerical values in a neural network that determine how inputs are transformed at each layer. Weights are learned during training and constitute the model’s “knowledge.” A model’s weights are what is stored, distributed, and deployed.
Fine-tuning
A training process that adapts a pre-trained model to a specific task or style by training it further on a smaller, curated dataset. Fine-tuning is cheaper than pre-training and produces models with specialized capabilities. Instruction fine-tuning teaches models to follow user directions; RLHF fine-tunes for helpfulness and safety.
RLHF (Reinforcement Learning from Human Feedback)
A training technique that uses human preference judgments to teach a model to produce outputs humans rate as better. Human raters compare pairs of model outputs; a reward model is trained on their preferences; the base model is then optimized to maximize reward. RLHF is the primary method for aligning LLMs to human intentions.
Quantization
A compression technique that reduces the precision of model weights — from 32-bit floats to 8-bit integers, for example — to decrease memory requirements and inference cost. Quantized models run faster and cheaper with modest accuracy loss. Critical for deploying large models on consumer hardware.
Mixture of Experts (MoE)
An architecture in which a model is divided into multiple specialized sub-networks (“experts”), with a routing mechanism that activates only a subset for any given input. MoE models can achieve high parameter counts while keeping inference costs manageable, since most parameters are idle at any moment.
Multimodal Model
A model capable of processing and generating multiple types of data — text, images, audio, video, or code. Multimodal models can answer questions about images, describe videos, or generate images from text descriptions. GPT-4o and Gemini Ultra are examples.
Tokenizer
The component that converts raw text into tokens before it enters the model. Different models use different tokenizers, which is why the same text may consume different numbers of tokens across systems. Tokenizer design affects multilingual performance, code handling, and cost.
Deployment and Operations
Hallucination
The generation of plausible-sounding but factually incorrect information by a language model. Hallucinations arise because models optimize for linguistic plausibility, not factual accuracy. They are not bugs in the conventional sense — they are a structural property of generative models. Mitigation strategies include grounding, retrieval augmentation, and factuality fine-tuning.
Grounding
The practice of anchoring a model’s responses to specific, verifiable source material — retrieved documents, database records, or real-time data — rather than relying on parametric knowledge alone. Grounding reduces hallucination and increases the verifiability of outputs.
RAG (Retrieval-Augmented Generation)
An architecture that combines a retrieval system with a generative model. When a query arrives, relevant documents are fetched from an external knowledge base and inserted into the model’s context alongside the query. The model generates a response grounded in the retrieved material. RAG is the dominant approach for building knowledge-intensive AI applications.
Latency
The time between sending a request to a model and receiving the first token of the response. Latency is a critical performance dimension in user-facing applications. It is distinct from throughput and is influenced by model size, hardware, batching strategy, and network distance.
Throughput
The number of tokens a system can generate per unit of time across all active requests. High throughput matters for batch processing and high-volume APIs. There is an inherent tension between latency and throughput: batching requests improves throughput but increases individual response latency.
Prompt
The input text provided to a language model to elicit a response. A prompt can range from a single question to a multi-thousand-word structured instruction. The design of prompts significantly affects the quality, format, and accuracy of model outputs.
System Prompt
A privileged instruction set provided to a model before the user’s input, typically invisible to the end user. System prompts define the model’s persona, behavioral constraints, output format, and task context. They are the primary mechanism by which application developers customize model behavior.
Temperature
A parameter that controls the randomness of model outputs. At temperature 0, the model always selects the most probable next token, producing deterministic, conservative responses. Higher temperatures introduce randomness, yielding more varied and creative outputs. Standard values range from 0 to 2.
Top-p (Nucleus Sampling)
A sampling strategy that restricts token selection to the smallest set of tokens whose cumulative probability exceeds a threshold p. At p=0.9, the model samples only from tokens that together account for 90% of the probability mass. Top-p sampling balances coherence and diversity better than temperature alone.
Inference Cost
The compute expense incurred each time a model generates output. Inference cost is denominated in tokens (input and output) and scales with model size, context length, and request volume. For most production AI applications, inference dominates total AI spend.
API (Application Programming Interface)
In AI contexts, the programmatic interface through which external applications access a model’s capabilities. AI APIs accept structured requests (prompts, parameters, attachments) and return model outputs. They are the commercial delivery mechanism for frontier model capabilities.
Rate Limit
A restriction on the number of API requests or tokens that can be processed within a given time window. Rate limits protect infrastructure stability and enforce usage tiers. Exceeding them returns throttling errors.
Batch Processing
The execution of many inference requests simultaneously rather than sequentially. Batch processing improves hardware utilization and reduces per-token cost, at the expense of increased latency for individual requests. Critical for large-scale document processing workloads.
Streaming
The delivery of model output token-by-token as it is generated, rather than waiting for the complete response. Streaming dramatically reduces perceived latency in user-facing interfaces and is now the default delivery mode for conversational AI.
Guardrails
System-level controls that filter, constrain, or redirect model outputs to prevent harmful, off-topic, or policy-violating responses. Guardrails can be implemented as fine-tuning, output classifiers, or prompt injection defenses. They are the operational layer of AI safety.
Agent Layer
AI Agent
A system in which a language model is given the ability to take actions — executing code, browsing the web, calling APIs, reading and writing files — in pursuit of a goal, often across multiple steps. Agents differ from chatbots in that they operate autonomously in an environment, not just in a conversation.
Agentic Loop
The repeated cycle of observation, reasoning, and action that characterizes agent behavior. In each iteration, the agent perceives its current state, generates a plan or decision, executes an action, and observes the result. The loop continues until a stopping condition is met.
Tool Use
The capability of a model to invoke external functions, APIs, or services during inference. Tool use allows models to retrieve real-time data, perform calculations, execute code, and interact with external systems. It is the primary mechanism for extending LLMs beyond pure language generation.
Function Calling
A specific implementation of tool use in which a model outputs structured function invocations — name and arguments — that are then executed by the surrounding application. Function calling is the standard interface for integrating LLMs with external code.
MCP (Model Context Protocol)
An open protocol developed by Anthropic that standardizes how AI models connect to external tools, data sources, and services. MCP defines a common interface so that any MCP-compatible server can be used by any MCP-compatible model, eliminating the need for custom integrations. It is rapidly becoming the default interoperability standard for AI tool ecosystems.
Orchestration
The coordination of multiple AI models, agents, or tools working in sequence or in parallel to complete a complex task. An orchestration layer routes inputs, manages state, handles errors, and assembles final outputs. LangChain, LlamaIndex, and custom pipelines are common orchestration approaches.
Memory
The mechanisms by which an agent or model retains information across turns or sessions. Memory can be implemented as in-context storage (within the active context window), external storage (retrieved via RAG), or parametric storage (encoded in model weights through fine-tuning). Effective memory architecture is a central challenge in agentic system design.
Planning
The capacity of an agent to decompose a high-level goal into a sequence of actionable steps before executing them. Planning distinguishes sophisticated agents from reactive systems. Techniques include chain-of-thought reasoning, tree-of-thought search, and hierarchical task decomposition.
ReAct
A prompting pattern — Reasoning + Acting — in which a model interleaves explicit reasoning steps (“Thought:”) with action outputs (“Action:”) and environment observations (“Observation:”). ReAct improves agent reliability by making the reasoning chain explicit and auditable.
Multi-Agent System
An architecture in which multiple AI agents operate in coordination, each with distinct roles or specializations. One agent may act as an orchestrator, delegating subtasks to specialized workers. Multi-agent systems enable parallelism and modular task decomposition.
Autonomous Agent
An agent designed to complete tasks with minimal human intervention across extended task horizons. True autonomy requires robust planning, error recovery, and goal interpretation. Current systems operate on a spectrum from human-in-the-loop to near-fully-autonomous.
Human-in-the-Loop
A design pattern in which a human approves, corrects, or redirects an agent at defined checkpoints rather than allowing fully autonomous operation. Human-in-the-loop architectures trade efficiency for oversight and are standard practice for high-stakes agentic workflows.
Tool Calling
The act of an AI model invoking an available tool during inference. Distinct from the capability definition (tool use), tool calling refers to the runtime event of a specific invocation. Logging tool calls is essential for auditing agent behavior.
Context Management
The strategies used to keep relevant information within a model’s active context window as tasks grow longer than the window allows. Techniques include summarization, selective retrieval, and memory offloading. Poor context management leads to agents “forgetting” earlier task state.
Retrieval and Knowledge
Vector Database
A database optimized for storing and querying vector embeddings. Vector databases support similarity search — finding the stored vectors closest to a query vector — at scale. They are the storage backbone of RAG systems. Pinecone, Weaviate, Qdrant, and pgvector are leading implementations.
Semantic Search
Search that retrieves results based on meaning rather than keyword matching. A semantic search for “heart attack” will surface results about “myocardial infarction” even if those words don’t appear in the query. Semantic search is powered by embedding similarity.
Similarity Search
The operation of finding items in a dataset whose embeddings are closest to a query embedding, typically measured by cosine similarity or Euclidean distance. The core query operation of vector databases.
Chunking
The process of splitting long documents into smaller segments before embedding and storing them in a retrieval system. Chunk size affects retrieval quality: too large and chunks contain irrelevant noise; too small and they lack context. Chunking strategy is a significant engineering variable in RAG system quality.
Reranking
A second-stage retrieval step in which an initial set of retrieved documents is reordered by a more sophisticated relevance model before being passed to the LLM. Reranking improves the precision of what enters the context window.
Knowledge Graph
A structured representation of entities and their relationships, stored as nodes and edges. Knowledge graphs provide precise, structured knowledge that LLMs can query or use as grounding. They complement unstructured retrieval systems in enterprise AI architectures.
Hybrid Search
A retrieval approach that combines dense vector search (semantic similarity) with sparse keyword search (BM25 or TF-IDF). Hybrid search outperforms either method alone on most real-world retrieval tasks by capturing both semantic and lexical relevance signals.
Prompting and Interaction Design
Prompt Engineering
The practice of designing and optimizing input text to elicit desired outputs from a language model. Prompt engineering encompasses instruction phrasing, few-shot example selection, output format specification, and persona assignment. It is a significant determinant of model performance on specific tasks.
Few-Shot Prompting
A prompting technique that includes several examples of the desired input-output pattern before the actual query. Few-shot examples calibrate the model’s format, tone, and reasoning style without requiring fine-tuning. The number of examples is typically 2–10.
Zero-Shot Prompting
Prompting a model to perform a task without providing any examples, relying entirely on the model’s pre-trained capabilities. Zero-shot performance is a measure of a model’s generalization ability.
Chain-of-Thought (CoT)
A prompting technique that instructs or encourages the model to reason step-by-step before producing a final answer. Chain-of-thought prompting significantly improves performance on multi-step reasoning, math, and logic tasks by externalizing the reasoning process.
Tree of Thought
An extension of chain-of-thought in which the model explores multiple reasoning branches simultaneously, evaluates them, and selects the most promising path. Tree-of-thought approaches improve performance on problems that require search or backtracking.
In-Context Learning
The ability of a model to perform new tasks based solely on examples or instructions provided in the prompt, without any weight updates. In-context learning is one of the most distinctive and surprising properties of large language models.
Instruction Following
A model’s ability to execute complex, multi-part directions accurately. Instruction following is a trained capability, not an inherent property of base models. Instruction fine-tuning and RLHF are the primary techniques for developing it.
Persona
A defined character, role, or identity assigned to a model via the system prompt. Personas shape tone, vocabulary, and response style. They are widely used in customer-facing AI deployments to maintain brand consistency.
Output Format
The structural specification of how a model should present its response — JSON, markdown, bullet points, prose paragraphs, numbered steps, etc. Explicit format instructions in prompts or system prompts significantly improve output consistency.
Prompt Injection
An attack in which malicious instructions are embedded in content the model is asked to process — a document, a webpage, an email — intending to override the original system prompt. Prompt injection is a primary security concern in agentic systems that process untrusted content.
Governance and Risk
Alignment
The challenge of ensuring that AI systems behave in accordance with human values, intentions, and interests — especially as systems become more capable. Alignment research addresses both technical problems (how to specify and train for desired behavior) and philosophical ones (whose values, and how defined).
Safety
In AI, the property of a system that minimizes harmful outputs and behaviors. Safety encompasses refusal of dangerous requests, resistance to adversarial manipulation, robustness to distribution shift, and long-term alignment. Safety research and capabilities research are both active at frontier labs.
Bias
Systematic skew in model outputs reflecting imbalances in training data or optimization objectives. Bias can manifest as demographic disparities in generated content, differential performance across languages, or the encoding of historical prejudices. Detecting and mitigating bias is a central concern in responsible AI deployment.
Fairness
The property of a model treating comparable inputs comparably across protected attributes — demographic groups, languages, dialects. Fairness is formally defined in multiple ways (demographic parity, equalized odds, individual fairness) that are mathematically incompatible with each other, creating genuine trade-off decisions.
Transparency
The property of AI systems being understandable and legible — in their design, training data, decision processes, and limitations. Transparency is a prerequisite for accountability. It is distinct from explainability, which refers to post-hoc interpretation of model outputs.
Explainability
The degree to which a model’s reasoning or outputs can be explained in human-understandable terms. Neural networks are inherently opaque; explainability techniques (attention visualization, SHAP values, feature attribution) attempt to reconstruct after-the-fact rationales. True mechanistic explainability for frontier models remains an open research problem.
Red-Teaming
The practice of systematically attempting to elicit harmful, dangerous, or policy-violating outputs from a model, in order to identify and fix vulnerabilities before deployment. Red-teaming borrows from military and cybersecurity traditions and is standard practice at frontier AI labs.
Jailbreak
An adversarial technique that attempts to bypass a model’s safety measures — typically through crafted prompts, role-playing scenarios, or encoded instructions — to elicit outputs the model would otherwise refuse. Jailbreaks are an ongoing arms race between adversarial users and safety researchers.
Watermarking
A technique for embedding a detectable but imperceptible signal in AI-generated content, enabling later identification of AI authorship. Watermarking can be applied to text (statistical patterns in token sampling), images (pixel-level perturbations), or audio. It is a contested approach to AI content provenance.
CSAM / Harmful Content Filtering
Technical and policy mechanisms that prevent models from generating illegal or severely harmful content. These include training-time filtering, output classifiers, and hard-coded refusals. Filtering robustness is a core safety metric for any publicly deployed model.
Model Card
A standardized documentation format for AI models that describes intended use cases, training data, evaluation results, known limitations, and ethical considerations. Model cards are a transparency practice introduced by Google and now widely adopted.
Responsible AI
An umbrella framework covering the ethical design, development, deployment, and governance of AI systems. Responsible AI encompasses fairness, transparency, accountability, privacy, safety, and inclusivity — often formalized as organizational policy or regulatory compliance frameworks.
EU AI Act
The European Union’s comprehensive AI regulation, which classifies AI applications by risk level and imposes corresponding requirements. High-risk applications (hiring, credit scoring, medical devices) face strict conformity assessments. Foundation model providers face transparency and safety obligations under the Act.
AI Governance
The policies, standards, processes, and institutions that guide AI development and deployment — at the organizational, national, and international levels. Governance encompasses technical standards, regulatory frameworks, voluntary commitments, and international coordination mechanisms.
Infrastructure and Compute
GPU (Graphics Processing Unit)
The primary hardware for training and running AI models. GPUs perform the parallel matrix operations at the core of neural network computation far more efficiently than CPUs. NVIDIA’s H100 and B200 are the dominant training accelerators. GPU availability is the primary constraint on AI development capacity.
TPU (Tensor Processing Unit)
Google’s custom AI accelerator chip, designed specifically for tensor operations. TPUs power Google’s internal model training and are available through Google Cloud. They offer competitive performance with NVIDIA GPUs for certain workloads.
FLOP (Floating Point Operation)
The standard unit of computational work in neural network training. Model training is quantified in FLOPs (or petaFLOPs, exaFLOPs); hardware is rated by FLOP/s capacity. FLOPs-based scaling analysis underpins empirical scaling laws.
Scaling Laws
Empirical relationships describing how model performance improves with increases in model size, training data, and compute. The Chinchilla scaling laws, published by DeepMind in 2022, showed that many frontier models were undertrained relative to their size and defined optimal token-to-parameter ratios.
Data Center
The physical infrastructure housing the compute, storage, and networking required for AI training and inference. Modern AI data centers are defined by their GPU cluster density, power capacity (now in gigawatts), and cooling architecture. The buildout of AI data centers is a major capital allocation trend globally.
Edge Inference
Running AI models on local devices — phones, laptops, embedded systems — rather than on remote servers. Edge inference reduces latency, eliminates API costs, and preserves data privacy, at the cost of using smaller models. Apple Intelligence and on-device LLMs are leading deployments.
Cloud AI
AI model capabilities delivered over the internet via APIs and cloud services, without requiring local hardware. AWS Bedrock, Google Vertex AI, Azure AI, and direct API access to frontier labs are the primary cloud AI delivery mechanisms.
VRAM (Video RAM)
The memory on a GPU, distinct from system RAM. VRAM is the primary constraint on what model sizes can be loaded for inference on consumer hardware. A 70-billion-parameter model in 16-bit precision requires approximately 140GB of VRAM — far exceeding a single consumer GPU.
Model Serving
The infrastructure for deploying a trained model to handle live inference requests at scale. Model serving involves load balancing, batching, caching, autoscaling, and hardware management. It is a distinct engineering discipline from model training.
Distillation
A technique for training a smaller “student” model to mimic the behavior of a larger “teacher” model. Distillation transfers capability from an expensive large model into a cheaper small one. DeepSeek-R1 and many open-weight models use distillation from frontier closed models.
Open Source and Ecosystem
Open-Weight Model
A model whose trained weights are publicly released, allowing anyone to download, run, fine-tune, and modify it. Open-weight models include Meta’s Llama series, Mistral, and Falcon. “Open-weight” is distinct from “open-source” — the training data and code may not be public.
Open Source AI
AI development in which model weights, training code, and training data are all publicly available. True open-source AI in this strict sense remains rare among frontier models. The Open Source Initiative has published a formal definition.
Hugging Face
The dominant platform for sharing, discovering, and running open-weight AI models and datasets. Hugging Face hosts hundreds of thousands of models and has become the de facto package registry of the open AI ecosystem.
Llama
Meta’s family of open-weight large language models, widely regarded as the most significant open-weight release in AI history. Llama models have been fine-tuned into hundreds of derivatives and established that competitive LLM performance could be achieved at accessible parameter counts.
Mistral
A French AI company and its family of efficient open-weight models, notable for high performance at relatively small sizes. Mistral’s releases demonstrated that architectural efficiency could partly substitute for raw scale.
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning technique that trains small adapter matrices added to frozen model weights, rather than updating all parameters. LoRA reduces fine-tuning compute and memory requirements by orders of magnitude, making customization accessible without full training infrastructure.
Business and Strategy
AI-Native
Describing a company or product designed from the ground up around AI capabilities, rather than one that retrofits AI into an existing structure. AI-native products use models as a core architectural component, not as a feature layer.
Commoditization
The process by which AI model capabilities that were once differentiating become widely available and interchangeable. As frontier model performance converges and open-weight models close the gap with proprietary ones, the value of raw model capability commoditizes, shifting competitive advantage to data, distribution, and workflow integration.
AI Infrastructure
The full stack of hardware, software, and services required to train, deploy, and operate AI systems. AI infrastructure investment — in chips, data centers, cloud services, and MLOps tooling — has been the largest capital deployment story in technology since cloud itself.
Inference Cost per Token
The unit economics of AI deployment: how much it costs to generate each output token. Inference cost per token has fallen dramatically with model efficiency improvements and hardware competition. It is the primary financial metric for evaluating AI product economics.
AI Wrapper
A product built primarily by adding a user interface or workflow around a foundation model’s API, with minimal proprietary technical development. Wrappers can create real value through UX and distribution but are vulnerable to disintermediation as models gain native interfaces.
Vertical AI
AI products targeting a specific industry or workflow — legal AI, medical AI, financial AI — rather than general-purpose capabilities. Vertical AI products compete on domain-specific data, compliance, and workflow integration rather than raw model capability.
Enterprise AI
The deployment of AI systems within large organizations, distinguished by requirements for security, compliance, auditability, integration with existing systems, and governance. Enterprise AI procurement moves more slowly than consumer AI but at substantially larger contract values.
AI Product-Market Fit
The alignment between an AI product’s capabilities and a customer segment’s genuine, recurring need. AI PMF is harder to establish than conventional PMF because model capabilities change rapidly, making durable differentiation difficult.
Moat
In AI business strategy, a durable competitive advantage that is difficult for competitors to replicate. Proposed AI moats include proprietary data, embedded workflows, regulatory approval, distribution scale, and inference cost efficiency. The durability of most proposed AI moats is actively contested.
Total Cost of Ownership (TCO)
The full cost of deploying an AI system, including API or compute costs, human oversight, integration development, fine-tuning, monitoring, and failure remediation. TCO analysis frequently reveals that the model API cost is a minority of total deployment cost.
Evaluation and Quality
Eval
Short for evaluation — any systematic method for measuring model performance on a defined task or set of tasks. Rigorous evals are the primary scientific instrument of AI research and engineering. Poor evals produce misleading capability claims.
Human Evaluation
Assessment of model outputs by human raters, typically on dimensions such as accuracy, coherence, helpfulness, and harmlessness. Human evaluation is expensive and slow but remains the gold standard for assessing qualities that automated metrics cannot capture.
Automated Evaluation (LLM-as-Judge)
Using a language model to evaluate the outputs of another language model. LLM-as-judge evaluation is cheaper and faster than human evaluation and correlates reasonably well with human judgments on many tasks. It introduces systematic biases — models tend to prefer verbose, confident outputs and their own stylistic patterns.
MMLU (Massive Multitask Language Understanding)
A benchmark covering 57 academic subjects from elementary mathematics to professional law and medicine. MMLU scores are widely cited in model comparisons as a proxy for general knowledge and reasoning. High MMLU scores are necessary but not sufficient indicators of practical capability.
HumanEval
A code generation benchmark consisting of Python programming problems with test cases. HumanEval pass@k measures the probability that at least one of k generated solutions passes all tests. It remains the standard benchmark for coding capability comparisons.
LMSYS Chatbot Arena
A crowdsourced evaluation platform in which human users vote on which of two anonymous model responses is better. Arena Elo ratings, derived from these pairwise comparisons, are widely regarded as among the most reliable real-world capability rankings available.
Emerging and Frontier Concepts
Reasoning Model
A model trained to engage in extended, explicit reasoning before producing a final answer — often via reinforcement learning on verifiable tasks. Reasoning models trade inference speed for accuracy on complex problems. OpenAI o1, o3, and DeepSeek-R1 are canonical examples.
Test-Time Compute
The use of additional compute during inference — rather than during training — to improve output quality. Reasoning models exemplify test-time compute scaling: generating longer chains of thought, backtracking, and self-correcting before answering. Test-time compute is an emerging alternative to ever-larger pre-training budgets.
Constitutional AI
Anthropic’s approach to AI alignment in which a model is trained to critique and revise its own outputs according to a set of principles (“the constitution”) rather than relying solely on human feedback. Constitutional AI reduces the human labor required for safety fine-tuning.
Mechanistic Interpretability
A research program aimed at understanding the specific computational mechanisms inside neural networks — identifying which circuits, features, and algorithms implement particular behaviors. Mechanistic interpretability seeks a causal, not merely correlational, understanding of model internals.
Superposition
A phenomenon in neural networks whereby a single neuron encodes information about multiple unrelated features simultaneously. Superposition is a major obstacle to mechanistic interpretability and suggests that concepts inside neural networks are far more densely packed than the number of neurons would suggest.
Sparse Autoencoder (SAE)
A tool used in mechanistic interpretability research to decompose superposed neural network activations into a larger set of interpretable features. SAEs are enabling researchers to identify specific concepts encoded inside large models.
Emergent Capabilities
Abilities that appear in large models that were not present in smaller models and were not explicitly trained for — such as multi-step arithmetic, translation between unseen language pairs, or analogical reasoning. The predictability of emergence is an open research question.
Capability Elicitation
The process of finding prompting or scaffolding techniques that reveal capabilities a model possesses but does not demonstrate by default. A model may be capable of a task without reliably performing it under standard prompting; elicitation research closes this gap.
Synthetic Data
Training or fine-tuning data generated by AI models rather than collected from human activity. Synthetic data is used to augment scarce real data, create privacy-preserving training sets, and generate verifiable reasoning traces. Its quality relative to human-generated data is an active research question.
Inference Scaling
The practice of devoting more compute to inference — more tokens of reasoning, more candidate responses, more verification steps — to improve output quality without retraining the model. Inference scaling complements training-time scaling and is increasingly the frontier of capability improvement.
Last updated May 2026. Definitions reflect the current state of the field and will evolve as the technology does. Referently maintains this glossary as a living reference.