Essential AI Glossary for Product Leaders

A comprehensive list of AI jargon that product leader should know.

Sep 10, 2025

AI is a vast topic, and there are plenty of terminologies/jargon that can be overwhelming. Every time a new LLM is released, there is new terminology. This terminology can be introduced in the form of a new benchmark or as a new feature in the architecture.

All of them are essential but not useful. But to AI engineers, everything will be useful, while to product leaders, some will be useful.

As such, I have narrowed down some of the terminologies that will be useful for product leaders. The terminologies are grouped so that you can know where each of them belongs.

Try Adaline

Model Landscape

The model landscape holds the core information. You can think of it as a building block to AI.

Artificial Intelligence (AI)

I define AI as the broad field where machines perform tasks that typically require human intelligence. It excels in automation and decision-making systems.

For instance, Claude Sonnet 4, released in May 2025 by Anthropic, and OpenAI's GPT-5, launched in August 2025, power some of the most advanced services and workflow.

CRED, a fintech platform serving over 15 million users across India, uses Claude to transform their software development lifecycle and deliver features faster without compromising quality. — Anthropic

Machine Learning (ML)

Machine Learning is a subset of AI where models learn patterns, representations, and underlying data distributions from examples. It improves via optimization rather than hard-coded rules, which require human involvement.

Common uses: recommendation engines; fraud and risk scoring.
Current examples: In March 2025, Netflix outlined a foundation-model approach to personalization, which is now applied across a subscriber base exceeding 300 million.

Deep Learning (DL)

Deep Learning advances ML by training multi-layer neural networks that learn rich representations from data. It powers today’s image recognition, speech processing, and natural-language understanding in production systems.

Current examples: Apple Siri Siri’s “Hey Siri” wake word rely on deep neural networks, and “Apple Intelligence” adds on-device and server foundation models to power richer language features.

Artificial General Intelligence (AGI)

Artificial General Intelligence refers to AI matching human cognition across diverse or multiple domains. That’s what the term general refers to. It means being able to generalize, transfer knowledge, and solve novel problems beyond narrow or task-specific domains.

Though theoretical, it promises versatile applications. OpenAI and Google DeepMind's 2025 research pushes toward AGI, with surveys predicting a 50% chance by 2040.

Artificial Super Intelligence (ASI)

Artificial Super Intelligence is an intelligence that surpasses top human performance across virtually all cognitive domains. It is supposed to be faster, more accurate, more strategic.

Superintelligence is a system that is “much smarter than the best human brains in practically every field, including scientific creativity, general wisdom, and social skills.” — Nick Bostrom

It promises breakthroughs on challenging global problems (e.g., climate change, disease), but it has alignment issues that need to be solved.

AGI vs. ASI:

AGI: human-level general intelligence across many tasks.
ASI: beyond the best humans across most tasks, often assumed to improve itself.

Foundation Model

Foundation models are large models pre-trained on broad data [usually via self-supervision] that can be adapted via prompting or fine-tuning to many downstream applications—not just language. They span text, vision, speech, and multimodal use cases.

GPT-5 is a current example: it underpins ChatGPT and offers multimodal understanding; teams adapt it via prompts, tools, and (where available) fine-tuning APIs.

Frontier Model

Frontier models push AI performance limits for research.

Examples include GPT-5, Claude Sonnet 4, Claude Opus 4.1, and Google's Gemini 2.5 Pro, Grok-4, etc.

Multimodal LLM (MLLM), VLM

These models integrate text, images, audio, and video. They handle document analysis and visual Q&A.

Today, almost every frontier model is multimodal.

Reasoning Model

Reasoning models are built to think, not just autocomplete. They break problems into steps, plan, call tools, and verify answers. They tackle multi-step math, code debugging, scientific Q&A, and complex, cite-back research.

In short, they don’t just answer; they reason, compare options, and justify the final result.

Examples include GPT-5 Thinking and pro, Claude Sonnet 4 thinking mode, Claude Opus 4.1 thinking mode, and Google's Gemini 2.5 Pro, Grok-4 thinking and expert, etc.

Mixture of Experts (MoE), Sparse MoE

Mixture-of-Experts (MoE) can be thought as “many small specialists inside one huge model.”

A router sends each token to just a few expert subnetworks, so you only activate or wake ~5–10% of the model per step.

In 2025, DeepSeek-R1 builds on DeepSeek-V3’s MoE base with 671B total params and ~37B activated per token; V3 was trained on ~14.8T tokens, then R1 adds reasoning-centric training on top.

Open-weight vs. Closed-weight

Open-weight models share weights publicly for transparency; closed-weight models keep them proprietary. Meta's Llama 4 is open-weight, fostering research, while OpenAI's GPT-5 remains closed for commercial edge.

But you also have gpt-oss, which gives you open-weight models you can run and customize locally. Here is an additional list of open-weight models.

Share Adaline Labs

Prompting & Orchestration

Effective prompt engineering is crucial for getting reliable outputs from AI models.

System Prompt, User Prompt

I set system prompts as foundational instructions that shape an AI's overall behavior and constraints. These prompts don’t change and mostly remain constant over the span of a project. They can also act as a memory.

User prompts deliver specific queries or tasks within that project/framework. These control chatbot personalities and handle targeted requests effectively.

Example:

system: "You are Acme’s product assistant—personality: pragmatic, friendly; stakeholders=PM/Eng; rules: be concise, cite docs, no PII."
User: “Draft a 3-bullet launch checklist for Atlas v1.2 based on our sprint notes.”

Prompt Template

Prompt templates are reusable structures with placeholders for variables to maintain consistency. They standardize responses across apps, like content generation.

Adaline streamlines workflows for product teams, reducing iteration time from months to weeks. LangChain's templates integrate with Gemini 2.0 for scalable API calls.

Zero-shot, One-shot, Few-shot

I apply zero-shot for tasks without examples, one-shot with a single demo, and few-shot using multiple instances to guide learning.

These minimize data needs for novel problems.

Chain-of-Thought (CoT) and Self-consistency

CoT uses elicited step-by-step reasoning and self-consistency, samples multiple solution paths, and “votes” on the final answer. Both of them measure boost reliability on math and logic tasks.

ReAct, Tree-of-Thought (ToT), Graph-of-Thought (GoT), Program-of-Thought (PoT), Self-Refine

ReAct (Reason + Act): A prompt pattern that interleaves natural-language reasoning steps with actions (tool/API calls). The model thinks, acts (e.g., search, code, retrieve), observes results, then continues reasoning.

Tree-of-Thought (ToT): Structured exploration where the model expands multiple reasoning branches, scores/prunes them, and backtracks as needed to reach a higher-quality final answer.

Graph-of-Thought (GoT): Generalizes ToT to a graph (DAG) of subproblems. Intermediate results can merge, fork, and be reused across paths. it is useful for complex, interdependent tasks.

Program-of-Thought (PoT): Guides the model to express reasoning as code. It is often generates executable steps (e.g., Python snippets), enabling precise calculation, verification, and deterministic subroutines.

Self-Refine: An iterative loop where the model critiques its own draft, generates targeted revisions, and repeats until criteria are met (e.g., correctness, style, constraints).

Temperature, Top-p, Top-k

Temperature (creativity dial): Controls randomness in next-token picks.
Low (0–0.3) → focused, repeatable. High (0.8–1.2) → exploratory, surprising.
Top-p / Nucleus (quality bubble): Sample only from the smallest set of tokens whose cumulative probability ≥ p.
Example: top_p=0.9 lets the model pick from the top ~90% mass. It keeps outputs sensible while still varied.
Top-k (shortlist picker): Limit choices to the top k most-likely tokens before sampling.
Example: top_k=1 is greedy (most deterministic). top_k=50 adds room for nuance in creative writing without going off the rails.

Function Calling / Tool Calling

Function calling lets an LLM, via prompting, invoke external tools/APIs with structured arguments and use the results in its answer. It links models to databases and real-time services (search, weather, payments, web-browser) and supports multi-tool workflows.

Modern systems like ChatGPT (GPT-5) and Claude Sonnet 4 expose function schemas (name + JSON parameters) so agents can plan, call tools, and verify outputs.

Retrieval & Memory

AI systems need access to current information and memory capabilities for effective performance.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation combines a pre-trained model with a retriever that pulls external documents at query time, so answers are grounded in up-to-date sources (often with citations). This powers chatbots that tap the web, wikis, or databases for real-time context.

Embeddings, Vector Database, ANN Index (HNSW)

I represent text, images, or records as dense vectors (“embeddings”) so semantically similar items sit near each other in vector space.

A vector database indexes these embeddings and retrieves neighbors via approximate nearest neighbor (ANN) search. HNSW is a popular ANN algorithm that builds a layered small-world graph for fast, high-recall lookups.

Platforms like Pinecone, Weaviate, and ChromaDB power RAG retrieval, recommendations, and deduplication at scale, and support hybrid search that blends dense vectors with sparse signals (e.g., BM25) to boost relevance and reduce latency.

Long-context / Context window

The context window is how many tokens a model can consider at once; long-context models can ingest large docs, codebases, or long chats without losing coherence. Current exemplars: Claude Sonnet 4 supports up to 1M tokens; Gemini 2.5 Pro ships with 1M today.

KV cache / Context caching / Semantic caching

KV cache: Reuses attention key/value tensors across decoding steps to speed generation and cut compute—especially on long outputs.
Context (prompt) caching: Stores shared prompt prefixes so repeated calls are cheaper/faster (e.g., OpenAI offers up to 50% input-cost discount; Anthropic reports up to 90% cost and 85% latency reduction for long prompts). (OpenAI, Anthropic)
Semantic caching: Saves responses for semantically similar requests (not just exact matches) to avoid recomputation in FAQs, RAG, and agent loops.

Session Memory, Semantic/Episodic Memory

Session memory refers to a persistent context that lets an AI recall user preferences and prior work across chats. Teams often split it into semantic memory (facts about the user or project) and episodic memory (events or past conversations). It improves continuity and personalization via prompting plus product features that store and retrieve memories.

Examples today:

ChatGPT Memory stores saved memories and can reference past chat history for more personalized replies; improvements began rolling out to free users on June 3, 2025.
Claude added a memory feature in August 2025 that lets it recall prior conversations to continue projects with better continuity.

Context Rot

The reliability of an LLM declines as prompts or conversations grow long. Relevance dilutes, distractors accumulate, and models start to miss “needles,” over- or under-generate, or abstain.

In July 2025, Chroma evaluated 18 leading models and found consistent but non-uniform degradation with longer inputs; focused ~300-token prompts outperformed full long contexts, with drops amplified by lower question–needle similarity, more distractors, and even haystack structure.

Agents & Tool Use

AI agents represent the evolution from simple chatbots to autonomous task executors.

AI Agent (Planner / Executor / Critic)

AI agents are autonomous systems that decompose goals into tasks, execute actions, and refine outputs. They achieve it via planning, tool use, and self-critique.

They can automate workflows from customer-support triage to repository-wide coding, research, and edit the entire repository or codebase.

Multi-agent Orchestration, Delegation, Task Decomposition

Multi-agent systems break complex tasks into subtasks, delegating them, and orchestrating results. This powers enterprise automation and solves intricate problems collaboratively.

Grok-4 Heavy is one of the best examples. For complex tasks it can spin multiple agents and get the job done.

Model Context Protocol (MCP)

Model Context Protocol is an open standard from Anthropic that formalizes how LLMs/agents discover, authorize, and call external tools and data sources.

MCP defines consistent schemas for tool capabilities (names, parameters, auth), resource access, and connection lifecycle, enabling secure, auditable, least-privilege integrations across apps.

By decoupling models from vendor-specific plugins, MCP lets teams plug the same agent into CRMs, databases, and SaaS reliably. It supports multi-tool, multi-model orchestration with clearer security boundaries and easier observability.

Evaluation

Evals

Evals are standardized test frameworks that run AI models across curated tasks with fixed datasets, prompts, and metrics.

They automate comparisons, detect regressions, and produce reproducible scores for production reliability.

Typical outputs include accuracy, F1, and pass@1 under zero-shot or few-shot settings. Common tools include Adaline, EleutherAI’s lm-evaluation-harness, OpenAI Evals.

Golden Set, Regression Tests

The golden sets act as reference input-output pairs for regression tests to catch drifts post-updates. This prevents quality drops in deployments.

LLM-as-a-Judge, Pairwise Preference, Win Rate

LLM as a Judge is a method where an LLM evaluates system outputs against a rubric. It often uses pairwise comparisons to compute win rates between candidates.

It scales assessments of coherence, helpfulness, style, hallucination, and even factuality when references are provided. Good practice includes blinded comparisons, multiple judges or seeds, calibration to human ratings, randomized prompt order, and audits to reduce bias or leakage.

Benchmarks: MMLU, GSM8K, HumanEval, SWE-bench, GPQA, BIG-bench, ARC

Capability benchmarks are standardized test suites used to measure model performance across domains with fixed datasets and metrics. Examples include MMLU for general knowledge, GSM8K for math, HumanEval for code generation, SWE-bench for software engineering, GPQA for graduate-level expertise, BIG-bench for broad reasoning, and ARC for commonsense and abstraction.

Humanity's Last Exam Benchmark

Nilesh Barla

May 12

Read full story

Factuality, Faithfulness, Hallucination Rate, Sycophancy

Factuality checks if outputs match real-world facts.

Faithfulness checks if outputs stay true to the provided source or context.

Hallucination rate is the share of unsupported claims.

Sycophancy is a model’s tendency to agree with a user even when wrong.

Safety

AI safety and security are non-negotiable requirements for production systems. Safety ensures that the model/AI is safe. Meaning they produce safe outputs that does hurt an individual, race, gender, society, etc.

Prompt Injection

Prompt Injection refers to an attacker hiding instructions in inputs or external data so the model treats them as commands, which can cause data exfiltration or unintended actions.

Jailbreaks

Jailbreaks are attempts to trick a model into ignoring its rules so it produces restricted outputs. I treat them as alignment stress tests and counter them with layered defenses: tight system prompts, policy-tuned refusals, input sanitization, tool whitelists, safety classifiers, rate limits, and continuous red-team evals. When a bypass shows up, it becomes training data to harden the next release.

This was a comprehensive list of AI terminologies that product leaders might need to know in their daily conversations or when building and deciding on a feature to add. I know that there are a lot of terminologies still missing, but I promise to add in the next iteration of this blog.

Thanks for reading Adaline Labs! This post is public so feel free to share it.

Adaline Labs

Humanity's Last Exam Benchmark

Discussion about this post

Ready for more?