AI Engineering for the Middle Ground: Beyond Basics, Before Mastery
The following is a quick primer on AI Engineering based on the book from Chip Huyen and summary from Marina Wyss. Props to both.
Check Books on Amazon
AI engineering is a rapidly expanding field focused on building applications on top of existing Foundation Models. Unlike traditional Machine Learning (ML) engineers who develop models from scratch, AI engineers concentrate on adapting and leveraging these pre-trained models. This shift has occurred because AI models have significantly improved in solving real-world problems, and the entry barrier for building with them has significantly lowered.
Here are the key concepts in AI engineering:
Foundation Models (FMs):
These are massive AI systems, such as Large Language Models (LLMs) and Large Multimodal Models, trained by companies like OpenAI or Google.
They learn through self-supervision, predicting parts of their input data, which circumvents the data labelling bottleneck that previously hindered AI development.
Most FMs use Transformer architectures, which revolutionised sequence-to-sequence tasks by employing an attention mechanism. This allows models to weigh the importance of different input tokens when generating output, enabling parallel processing and faster inference compared to older recurrent neural networks (RNNs).
Context window length is a computational expense for Transformers, as more tokens require more key and value vector computations and storage.
Larger models generally have greater learning capacity and better performance, though sparse models can be more efficient than dense ones with fewer parameters.
Key bottlenecks for scaling models include the availability of high-quality training data (risk of training on AI-generated content leading to degradation) and electricity consumption by data centres.
Post-Training Foundation Models:
Pre-trained FMs are typically optimised for text completion and can produce factually incorrect or ethically problematic outputs.
Supervised Fine-Tuning (SFT) addresses this by optimising the model for conversations using high-quality instruction data.
Preference Fine-Tuning (e.g., Reinforcement Learning from Human Feedback, RLHF, or Direct Preference Optimisation, DPO) aligns the model with human values by training a reward model based on human preferences.
Probabilistic Nature & Sampling:
FMs generate probabilities for possible outputs, not single definitive answers.
Sampling techniques like
temperature(controls creativity vs. determinism),top-K(restricts choices to K most likely tokens), andtop-P(selects smallest set of tokens exceeding a cumulative probability threshold) are used to introduce variability.This probabilistic nature explains inconsistencies with minor input changes and hallucinations (confidently stating incorrect information).
Evaluation:
Evaluating AI systems is significantly harder than traditional ML models due to task complexity, open-ended nature, and the black-box characteristic of FMs.
Key metrics include cross-entropy and perplexity (for training, measuring next-token prediction).
Functional correctness (e.g., did a chatbot make the correct reservation?) is the ultimate metric for applications.
Evaluation methods include exact match, lexical similarity (e.g., edit distance, n-gram overlap), and semantic similarity (comparing text embeddings).
AI judges are powerful tools for evaluation, being fast, cheap, and capable of strong correlation with human evaluators. However, they have limitations like probabilistic nature, biases (self-bias, position bias, verbosity bias), and non-standardised metrics.
Evaluation results should be tied to business metrics (e.g., how factual consistency impacts customer support automation).
Model Selection:
The challenge is selecting the right model from the increasing number of available FMs.
Criteria include domain-specific capabilities, general capabilities, instruction-following abilities, and cost/latency.
Models are filtered based on hard attributes (license, training data, size, privacy, control – difficult to change) and then refined using soft attributes (accuracy, toxicity, factual consistency – can be improved with adaptation).
The decision between commercial API models and self-hosting open-source models depends on data privacy, data lineage/copyright, performance needs, control requirements, and cost.
Prompt Engineering:
This is the easiest and most common model adaptation technique, involving crafting instructions to guide a model's output without changing its weights.
Prompts typically include a task description, examples (few-shot, zero-shot, one-shot), and the concrete task.
System prompts (defining the model's role and constraints) and user prompts (specific query) are combined.
Key strategies include clear and explicit instructions, asking the model to adopt a persona, providing examples, specifying output format (e.g., JSON), breaking complex tasks, and giving the model time to think (e.g., Chain of Thought prompting).
Prompt attacks (extraction, jailbreaking, information extraction) require defence strategies like explicit safety instructions and guardrails.
Retrieval Augmented Generation (RAG):
RAG enhances a model's generation capabilities by retrieving relevant information from external memory sources (databases, web, chat history).
It consists of a retriever (fetches information) and a generator (the FM that produces the response).
Indexing (processing data for quick retrieval) and querying (sending a search query) are key retriever functions.
Term-based retrieval (keyword matching) and embedding-based retrieval (semantic similarity using vector databases) are common methods.
Documents are often split into smaller chunks to manage context window limitations, and these chunks can be augmented with metadata or summaries for better retrieval.
Query rewriting can expand queries with necessary context.
RAG can work with multimodal and tabular data.
Agentic Pattern:
A more active approach where an AI model perceives its environment, makes decisions, takes actions, and learns from outcomes.
Agents are powerful due to the tools they can access, such as knowledge augmentation tools (e.g., web search, SQL executors), capability extension tools (e.g., calculators, code interpreters), and write action tools (e.g., sending emails).
Planning is crucial for complex tasks, often decoupled from execution and validated through heuristics or AI judges.
Agents require robust evaluation methods to detect planning and tool failures.
Memory systems (internal knowledge, context window, external data sources like RAG) are vital for agents to retain information over time.
Fine-Tuning:
Adapts a model to a specific task by further training and adjusting its weights, offering deeper customisation than prompting but requiring more resources.
It improves domain-specific capabilities (e.g., coding) and instruction-following abilities.
Model distillation (fine-tuning a small model to imitate a larger one) is a common application.
Often, a combination of RAG and fine-tuning yields the best performance.
Memory requirements are a major bottleneck for fine-tuning due to backpropagation and the number of parameters.
Parameter Efficient Fine-Tuning (PEFT) techniques, like LoRA (Low-Rank Adaptation), significantly reduce memory requirements by adding new, small trainable parameters that can be merged back into the original model.
Different strategies exist for fine-tuning for multiple tasks, including simultaneous, sequential, or model merging (combining separately fine-tuned models).
Key hyperparameters impacting fine-tuning include learning rate, batch size, number of epochs, and prompt loss weight.
Dataset Engineering:
There's a shift from model-centric to data-centric AI, where enhancing data quality and processing techniques is prioritised for competitive advantage.
The type of data needed depends on the adaptation task (e.g., self-supervised, instruction, preference).
High-quality data is crucial and defined by factors like relevance, alignment, consistency, correct formatting, uniqueness, compliance, and coverage.
The amount of data needed varies based on fine-tuning technique (full vs. PEFT), task complexity, and base model performance. Even a small, well-crafted dataset (50-100 examples) can show improvements.
Data can be obtained through data flywheels (leveraging user interactions), available datasets, human annotation with clear guidelines, data augmentation (creating new examples from existing data), and data synthesis (generating artificial data).
Rigorous data processing (filtering, exploratory data analysis, de-duplication, cleaning, removing non-compliant content) is critical for quality.
Inference Optimisation:
Focuses on reducing cost and latency when a model computes outputs for given inputs in a production environment.
Common bottlenecks are compute-bound (limited by computational power) and memory bandwidth-bound (limited by data movement speed).
Online APIs optimise for low latency, while batch APIs optimise for cost by processing multiple requests together.
Key performance metrics include latency (time to first token, time per output token), throughput (output tokens per second), and utilisation metrics.
Hardware used includes GPUs (optimised for parallel processing) and specialised AI chips.
Model-level optimisations include model compression (quantisation, pruning, distillation) and addressing the autoregressive nature of LLMs (speculative decoding, inference with reference copies, parallel decoding).
Service-level optimisations include batching (static, dynamic, continuous), decoupled prefill and decode, prompt caching, and parallelism (replica, model).
AI Application Architecture:
Evolves from simple input-output to complex systems incorporating various components.
Key architectural enhancements include better context construction (RAG, agents, document upload), guardrails (input and output) for protection, model routing and gateways (for managing multiple models, access control, cost), caching (inference, prompt) for performance and cost, and complex logic with write actions (e.g., sending emails).
Monitoring and observability are critical for tracking performance, detecting issues (MTTD, MTTR, CFR), and diagnosing problems.
AI orchestrator tools (e.g., LangChain) help manage complex pipelines.
User Feedback:
Provides proprietary data, offering a significant competitive advantage.
Comes in explicit forms (ratings, comments) and implicit forms (user behaviour like early termination, error corrections, sentiment).
Should be collected strategically to gather valuable insights without disrupting the user experience.
The field of AI engineering is dynamic, with continuous advancements, requiring engineers to maintain flexibility in their architectural designs.
Join the newsletter https://elsoai.substack.com


