aGENTIC ai dESIGN pATTERNS
Agentic AI ERA
Agentic AI ERA
Agentic AI represents a shift from predictive models (ML/DL) and scripted automation toward systems that can interpret context, make decisions, and act toward goals. Instead of responding passively, the system behaves as an active digital actor capable of adapting to changing situations.
An agent is a software entity designed to pursue objectives autonomously by perceiving its environment, reasoning about context, and taking actions through available tools. It functions as the decision-making core within an agentic system.
LLM Model: A reasoning engine to interpret inputs, a planning capability to sequence actions
Memory management for context:
Tool interface to reach external systems. These operate in a continuous decision loop.
Consider required Memory type for the context (short-term vs long-term)
Consider volume of data
Consider Retrieval speed
Vector search or embedding support: Vector DBs like Pinecone or Redis can be selected based on scale and latency needs.
The agent observes the environment and understands the context using the prompts and memory
Interprets data using its reasoning engine within the proper LLM mode
evaluates goals and constraints, plans, and selects the next action or tool invocation.
Choose an LLM based on task type, domain, latency, and token limits. Examples per domain:
General purpose: GPT-4, Claude 3
Enterprise knowledge/internal docs: Llama 3 fine-tuned, Falcon 180B + vector DB embeddings
Scientific / Medical: BioGPT, PubMedGPT
Code / DevOps: Codex, StarCoder, Code Llama
Finance / Legal: BloombergGPT, LegalBERT, FinBERT
Low-latency / edge: GPT-3.5-turbo, LLaMA 2 7B quantized
Choose based on:
Required external capabilities
Latency, authentication
Standardization (MCP)
Compliance/security.
The lifecycle includes:
Observe: The agent continuously monitors its environment and gathers data from tools, sensors, or user input to understand current state. This forms the basis for reasoning.
Interpret: Incoming data is processed, contextualized, and compared against goals, constraints, and prior knowledge to identify relevant information.
Plan: The agent evaluates possible actions, sequences them into a coherent strategy, and prioritizes based on predicted impact and resources.
Act: Executes the selected action using internal tools or external APIs while adhering to security and operational constraints.
Evaluate: Assesses the outcome of actions against intended goals, identifying successes, errors, or deviations. This informs adaptive decision-making.
Update Memory: Updates short-term and long-term memory with observations, results, and learned patterns, enabling improved performance in future cycles.
Intro:
The new era of architecture is not neither building solution components nor AI models from scratch, it's about thinking about capitalizing on the available agents, tools, and automation, to perform most/all of your business.
If you moved for that architecture, it called Agentic AI.
At this architecture, your architecture lifecycle will be as following:
Orchestration Layer
The brain of the model which includes the context and task queues
Apply event driven architecture
Perception and Parsing layer
Perception layer for inputs
Text/voice/images
Include implementation components and APIs
Reasoning and Planning Layer (Agentic Core)
Actual AI models live here
Components: LLM Cluster, Prompt Templates, Planner Module, Reflection Engine.
Key Function: ReAct (Reasoning + Acting) logic.
Tooling and execution layer
This connects the AI to the outside world. It contains a registry of functions the agent can call (APIs, database queries, code interpreters).
Components: Tool Registry, Function Calling Interface, Connectors (REST, gRPC), Code Executor (sandboxed).
Examples: SQL executor, Weather API, Calculator, CRM connector.
Memory and Observation layer
Crucial for production. It stores short-term (conversation history) and long-term memory (vector embeddings for RAG). It also logs every action for audit and debugging.
Components: Vector Database (for semantic memory), Key-Value Store (for state), Logging Stack (for traceability).
Pattern: Event Sourcing (storing every decision as an immutable event).
Entry Point:
Frontend communicate with Backend
Request hits the API Gateway, including the required task
Orchestrator(Parent Agent)
Initializes a session
Communicate with LLM using a suitable prompt
Receive response for the plan to share the sequence of actions
Child Agent(s): Call back using the orchestrator plan
Execution Engine calls ToolA (e.g., Database Query).
Result goes back to Agents
Memory is updated with the result.
Loop continues until the plan is complete.
Response is returned to the user.
What is?
Technique that connect LLM to external datasource, instead of LLM model knowledge base.
So, instead of relying only on model weights, the system:
Retrieves relevant documents from a knowledge source
Injects them into the prompt context
Generates grounded output
RAG Lifecycle:
1. Knowledge engineering
source selection
freshness policy
metadata schema
chunk strategy
2. Index pipeline
parsing
chunking
embeddings
vector store
3. Retrieval design
query rewriting
hybrid search (BM25 + vector)
reranking
4. Prompt orchestration
context window packing
citation format
system guardrails
5. Generation policy
grounded only
abstain if low confidence
structured output
6. Evaluation loop
golden questions
hallucination tests
answer correctness
7. Ops & monitoring
drift detection
index refresh
cost metrics
latency SLO
Enhance productivity
Seamless task execution
Real-time decision making
Proactive problem solving
Improve accuracy
Seamless scalability
Reuse engineering in depth
To simplify the answer, an AI Agent is the worker that does one job.
Agentic AI is a new architecture approach that you apply in ecosystems using multiple agents, collaborate together to achieve large goals to achieve the benefits of modernization.
Summary: Passive goal creator analyzes users’ articulated goals through the dialogue interface.
Context: When querying agents to address certain issues, users provide related context and explain the goals in prompts.
Problem: Users may lack expertise of interacting with agents, and the provided information can be ambiguous for goal achievement.
Forces:
Underspecification. Users may not be able to provide thorough context information and specify precise goals to agents.
Efficiency. Users expect quick responses from agents.
Solution: A foundation model-based agent provides a dialogue interface where users can directly specify the context and problems, which are transferred to passive goal creator for goal determination.
Benefits:
Interactivity. Users or other agents can interact with an agent via a dialogue interface or related APIs.
Goal-seeking. The agent can analyse user-provided context and retrieve related information from memory, to identify and determine the objectives and create corresponding strategies.
Efficiency. Users can directly send prompts to the agent through the dialogue interface, which is intuitive and easy to use.
Drawbacks:
Reasoning uncertainty. Users may have assorted backgrounds and experiences. Unclear or ambiguous context information may intensify the reasoning uncertainties, especially considering there are no standard prompt requirements.
Known uses:
Liu et al. [1] designed an agent that can communicate with users and help refine research questions via a dialogue interface.
Kannan et al. [2] proposed an agent for users to decompose and allocate tasks to robots through a dialogue interface.
HuggingGPT. HuggingGPT can generate responses to address user requests via a chatbot. Users’ requests including complex intents can be interpreted as their intended goals.
Summary: Proactive goal creator anticipates users’ goals by understanding human interactions and capturing the context via relevant tools.
Context: Users explain the goals that the agent is expected to achieve in the prompt.
Problem: The context information collected via solely a dialogue interface may be limited, and result in inaccurate responses to users’ goals.
Forces:
Underspecification. i) Users may not be able to provide thorough context information and specify precise goals to agents. ii) Agents can only retrieve limited information from the memory.
Accessibility. Users with specified disabilities may not be able to directly interoperate with the agent via passive goal creator.
Solution: the prompts received from dialogue interface, and relevant context retrieved from memory, the proactive goal creator can anticipate users’ goals by sending requirements to detectors, which will then capture and return the user’s surroundings for further analysis and comprehension to generate the goals.
Benefits:
Interactivity. An agent can interact with users or other agents by anticipating their decisions proactively with captured multimodal context information.
Goal-seeking. The multimodal input can provide more detailed information for the agent to understand users’ goals, and increase the accuracy and completeness of goal achievement.
Accessibility. Additional tools can help capture the sentiments and other context information from disabled users, ensuring accessibility and broadening the human values of foundation model-based agents.
Drawbacks:
Overhead. i) Proactive goal creator is enabled by the multimodal context information captured by relevant tools, which may increase the cost of the agent. ii) Limited context information may increase the communication overhead between users and agents.
Known uses:
GestureGPT [3]. GestureGPT can decipher users’ hand gesture descriptions and hence comprehend users’ intents.
Zhao et al. [4] proposed a programming screencast analysis tool that can extract the coding steps and code snippets.
ProAgent [5]. ProAgent can observe the behaviours of other teammate agents, deduce their intentions, and adjust the planning accordingly.
Summary: Prompt/response optimiser refines the prompts/responses according to the desired input or output content and format.
Context: Users may struggle with writing effective prompts, especially considering the injection of comprehensive context. Similarly, it may be difficult for users to understand the agent’s outputs in certain cases.
Problem: How to generate effective prompts and standardised responses that are aligned with users’ goals or objectives?
Solution: A user may input initial prompts to the agent; however, such prompts may be ineffective due to the lack of relevant context, unintentional injection attacks, redundancy, etc. In this regard, the prompt/response optimiser can construct refined prompts and responses adhering to predefined constraints and specifications. These constraints and specifications outline the desired content and format for the inputs and outputs, ensuring alignment with the ultimate goal. A prompt or response template is often used in the prompt/response optimiser as a factory for creating specific instances of prompts or responses [1, 2]. This template offers a structured approach to standardise the queries and responses, improving the accuracy of the responses and facilitating their interoperations with external tools or agents. For instance, a prompt template can contain the instructions to an agent, some examples for few-shot learning, and the question/goal for the agent to work.
Benefits:
Standardisation. Prompt/response optimiser can create standardised prompts and responses regarding the requirements specified in the template.
Goal alignment. The optimised prompts and responses adhere to user-defined conditions, hence they can achieve higher accuracy and relevance to the goals.
Interoperability. Interoperability between agent and external tools is facilitated by prompt/response optimiser, which can provide consistent and well-defined prompts and responses for task execution.
Adaptability. Prompt/response optimiser can accommodate different constraints, specifications, or domain-specific requirements by refining the template with a knowledge base.
Drawbacks:
Underspecification. In certain cases, it may be difficult for prompt/response optimiser to capture and incorporate all relevant contextual information effectively, especially considering the ambiguity of users’ input, and dependency on context engineering. Consequently, the optimiser may struggle to generate appropriate prompts or responses.
Maintenance overhead. Updating and maintaining prompt or response templates may cause significant overhead. Changes in requirements may necessitate modifying multiple templates, which is time-consuming and error-prone.
Known uses:
LangChain. LangChain provides prompt templates for practitioners to develop custom foundation model-based agents.
Amazon Bedrock. Users can configure prompt templates in Amazon Bedrock, defining how the agent should evaluate and use the prompts.
Dialogflow. Dialogflow allows users to create generators to specify agent behaviours and responses at runtime.
Summary: Retrieval augmented generation techniques enhance the knowledge updatability of agents for goal achievement, and maintain data privacy of on-premise foundation model-based agents/systems implementations.
Context: Large foundational model-based agents are not equipped with knowledge related to explicitly specific domains, especially on highly confidential and privacy-sensitive local data, unless they are fine-tuned for pre-trained using domain data.
Forces:
Lack of knowledge. The reasoning process may be unreliable when the agent is required to accomplish domain-specific tasks that the agent has no such knowledge reserve.
Overhead. Fine-tuning large foundation model using local data or training a large foundation model locally consumes high amount of computation and resource costs.
Data Privacy. Local data are confidential to be used to train or fine-tune the models.
Problem: Given a task, how can agents conduct reasoning with data/knowledge that are not learned by the foundation models through model training?
Solution: RAG is a technique for enhancing the accuracy and reliability of agents with facts retrieved from other sources (internal or online data). The knowledge gaps where the agents are lacking in memory are filled using the parameterized knowledge generated in vector databases. For instance, during plan generation, specific steps may require information that is not within the original agent memory. The agent can retrieve information from the parameterized knowledge and use for planning, while the augmented response (i.e. plan) will be return back to the user. The retrieval process requires zero pretraining or fine-tuning of the model served by the agent which preserves the data privacy of local data, reduces training and computation costs, and also provides up-to-date and more precise information required. The retrieved local data can be sent back to the agent via prompts (need to consider the context window size), whereas the agent is able to process the information and generate plans via in-context learning. Currently there is a cluster of RAG techniques focusing on various enhancement aspects, data sources and applications.
Benefits:
Knowledge retrieval. Agents can search and retrieve knowledge related to the given tasks, which ensures the reliability of reasoning steps.
Updatability. The prompts/responses generated using RAG by the agent on internal or online data are updatable by the complimentary parameterized knowledge.
Data privacy. The agent can retrieve additional knowledge from local datastores, which ensures data privacy and security.
Cost-efficiency. Under the data privacy constraint, RAG can provide essential knowledge to the agent without training a new foundation model entirely. This reduced the training costs.
Drawbacks:
Maintenance overhead. Maintenance and update of the parameterized knowledge in the vector store requires additional computation and storage costs.
Data limitation. The agents still mainly rely on the data it has been trained on to generate prompts. This can impact the quality and accuracy of the generated content in those specific domains.
Known uses:
LinkedIn. LinkedIn applies RAG to construct the pipeline of foundation model based agents, which can search appropriate case studies to respond users.
Yan et al. [4] devise a retrieval evaluator which can output a confidence degree after assessing the quality of retrieved data. The solution can improve the robustness and overall performance of RAG for agents.
Levonian et al. [5] apply RAG with GPT-3.5, developing an agent that can retrieve the contents of a high-quality open-source math textbook to generate responses to students.
Summary: The foundation model is accessed in a single instance to generate all necessary steps for the plan.
Context: When users interact with the agent for specific goals, the included foundation model is queried for plan generation.
Problem: How can the agent generate the steps for a plan efficiently?
Solution: After a user specifies goals and constraints in one prompt, the agent will query the incorporated foundation model to generate a corresponding response. The foundation model does not require multiple interactions to comprehend the context and requirements. In this manner, the agent can devise a multi-step plan to achieve a broad goal and provide a holistic explanation for this plan without delving into detailed reasoning steps.
Benefits:
Efficiency. The agent can generate a plan to achieve users’ goals by querying the underlying foundation model only once, which saves consumed time.
Cost-efficiency. Users’ expenses can be reduced since the foundation model is queried for one time.
Simplicity. One-shot model querying can satisfy the tasks that do not require complex action plans.
Drawbacks:
Maintenance overhead. Maintenance and update of the parameterized knowledge in the vector store requires additional computation and storage costs.
Data limitation. The agents still mainly rely on the data it has been trained on to generate prompts. This can impact the quality and accuracy of the generated content in those specific domains.
Known uses: One-shot model querying can be considered configuration or use by default when a user is leveraging a foundation model, while CoT and Zero-shot-CoT both exemplify this pattern
Summary: Incremental model querying involves accessing the foundation model at each step of the plan generation process.
Context: When users interact with the agent for specific goals, the included foundation model is queried for plan generation.
Problem: The foundation model may struggle to generate the correct response at the first attempt. How can the agent conduct an accurate reasoning process?
Solution: The agent could engage in a step-by-step reasoning process to develop the plan for goal achievement with multiple queries to the foundation model. Meanwhile, human feedback can be provided at any time to both the reasoning process and the generated plan, and adjustments can be made accordingly during model querying. Please note that incremental model querying can rely on a reusable template, which guides the process through context injection or an explicit workflow/plan repository and management system.
Benefits:
Supplementary context. Incremental model querying allows users to split the context in multiple prompts to address the issue of a limited context window.
Reasoning certainty. Foundation models will iteratively refine the reasoning steps by self-checking or feedback from users.
Explainability. Users can query the foundation model to provide detailed reasoning steps through incremental model querying.
Drawbacks:
Overhead. i) Incremental model querying requires multiple interactions with the foundation model, which may increase the time consumption for planning determination. ii) The high volume of user queries may be cost-intensive when utilising commercial foundation models.
Known uses:
HuggingGPT. The underlying foundation model of HuggingGPT is queried multiple times to decompose users’ requests into fine-grained tasks, and then determine the dependencies and execution orders of tasks [1].
EcoAssistant [2]. EcoAssistant applies a code executor interacting with the foundation model to iteratively refine code.
ReWOO [3]. ReWOO queries the foundation model to i) generate a list of interdependent plans, and; ii) combine the observation evidence fetched from tools with the corresponding task.
Summary: Single-path plan generator orchestrates the generation of intermediate steps leading to the achievement of the user’s goal.
Context: A agent is considered “black box” to users, while users may care about the process of how an agent achieve users’ goals.
Problem: How can an agent efficiently formulate the strategies to achieve users’ goals?
Solution: After receiving and comprehending users’ goals, the single-path plan generator can coordinate the creation of intermediate steps for other agents or tools and prioritise the tasks, to progressively lead towards goal accomplishment. Specifically, each step in this process is designed to have only a single subsequent step, such as Chain-of-Thought (CoT) [1]. Self-consistency is employed to confirm with the foundation model several times and select the most consistent answer as the final decision [2]. Please note that the generated plan may have different granularity based on the given goal that complex plan may incorporate multiple workflows, processes, tasks and fine-grained steps.
Benefits:
Reasoning certainty. Single-path plan generator generates a multi-step plan, which can reflect the reasoning process and mitigate the uncertainty or ambiguity for achieving users’ goals.
Coherence. The interacting users, agents and tools are provided a clear and coherent path towards the ultimate goals.
Efficiency. Single-path plan generator can increase efficiency in agents via pruning unnecessary steps or distractions.
Drawbacks:
Flexibility. A single-path plan may result in limited flexibility to accommodate diverse user preferences or application scenarios, hence users cannot customise their solutions.
Oversimplification. The agent may oversimplify the generated plan which requires multi-faceted approaches.
Known uses:
LlamaIndex. LlamaIndex fine-tunes a ReAct Agent to achieve better performance with single-path plan generator via CoT.
ThinkGPT. ThinkGPT provides a toolkit to facilitate the implementation of single-path plan generator pattern.
Zhang et al. [3] promote the implementation by elucidating the basic mechanisms and paradigm shift of CoT.
Summary: Multi-path plan generator allows for creating multiple choices at each intermediate step leading to achieving users’ goals.
Context: A agent is considered “black box” to users, while users may care about the process of how an agent achieve users’ goals.
Problem: How can an agent generate a high-quality, coherent, and efficient solution considering inclusiveness and diversity when presented with a complex task or problem?
Solution: Based on single-path plan generator, multi-path plan generator can create multiple choices at each step towards the achievement of goals. Users’ preferences may influence the subsequent intermediate steps, leading to different eventual plans. The employment of involved agents and tools will be adjusted accordingly. Tree-of-Thoughts [1] exemplifies this design pattern.
Benefits:
Reasoning certainty. Multi-path plan generator can generate a plan with multiple choices of intermediate steps to resolve the uncertainty or ambiguity within reasoning process.
Coherence. The interacting users, agents and tools are provided a clear and coherent path towards the ultimate goals.
Alignment to human preference. Users can confirm each intermediate step to finalise the planning, hence human preferences are absorbed in the generated customised strategy.
Inclusiveness. The agent can specify multiple directions in the reasoning process for complex tasks.
Drawbacks:
Overhead. Task decomposition and multi-plan generation may increase the communication overhead between the user and agent.
Known uses:
AutoGPT. AutoGPT can make informed decisions by incorporating Tree-of-Thoughts as the multi-path plan generator.
Gemini. For a task, Gemini can generate multiple choices for users to decide. Upon receiving users’ responses, Gemini will provide multiple choices for the next step.
Open AI. GPT-4 was leveraged to implement a multi-path plan generator based on Tree-of-Thoughts.
Summary: Self-reflection enables the agent to generate feedback on the plan and reasoning process and provide refinement guidance from themselves.
Context: Given users’ goals and requirements, the agent will generate a plan to decompose the goals into a set of tasks for achieving the goals.
Problem: A generated plan may be affected by hallucinations of the foundation model, how to review the plan and reasoning steps and incorporate feedback efficiently?
Solution: In particular, reflection is an optimization process formalised to iteratively review and refine the agent-generated response. The user prompts specific goals to the agent, which then generates a plan to accomplish users’ requirements. Subsequently, the user can instruct the agent to reflect on the plan and the corresponding reasoning process. The agent will review the response to identify and pinpoint the errors, then generate a refined plan and adjust its reasoning process accordingly. The finalised plan will be carried out step by step. Self-consistency [1] exemplifies this pattern.
Benefits:
Reasoning certainty. Agents can evaluate their own responses and reasoning procedure to check whether there are any errors or inappropriate outputs, and make refinement accordingly.
Explainability. Self-reflection allows the agent to review and explain its reasoning process to users, facilitating better comprehension of the agent’s decision-making process.
Continuous improvement. The agent can continuously update the memory or knowledge base and the manner of formalising the prompts and knowledge, to provide more reliable and coherent output to users without or with fewer reflection steps.
Efficiency. On one hand, it is time-saving for the agent to self-evaluate its response, as no additional communication overhead is cost compared to other reflection patterns. On the other hand, the agent can provide more accurate responses in the future to reduce the overall reasoning time consumption considering the continuous improvement.
Drawbacks:
Reasoning uncertainty. The evaluation result is dependent on the complexity of self-reflection and the agent’s competence in assessing its generated responses.
Overhead. i) Self-reflection can increase the complexity of an agent, which may affect the overall performance. ii) Refining and maintaining agents with self-reflection capabilities requires specialised expertise and development process.
Known uses:
Reflexion [2]. Reflexion employs a self-reflection model which can generate nuanced and concrete feedback based on the success status, current trajectory, and persistent memory.
Bidder agent [3]. A replanning module in this agent utilises self-reflection to create new textual plans based on the auction’s status and new context information.
Generative agents [4]. Agents perform reflection two or three times a day, by first determining the objective of reflection according to the recent activities, then generating a reflection which will be stored in the memory stream.
Summary: Cross-reflection uses different agents or foundation models to provide feedback and refine the generated plan and corresponding reasoning procedure.
Context: The agent generates a plan to achieve users’ goals, while the quality of this devised plan should be assessed.
Problem: When an agent has limited capability and cannot conduct reflection with satisfying performance, how to evaluate the output and reasoning steps of this agent?
Forces:
Reasoning uncertainty. The inconsistencies and errors in the agent’s reasoning process may reduce response accuracy and affect the overall trustworthiness.
Lack of explainability. The trustworthiness of the agent can be disturbed by the issue of transparency and explainability of how the plan is generated.
Limited capability. An agent may not be able to perform reflection well due to its limited capability and the complexity of self-reflection.
Solution: Fig. 1 includes a high-level graphical representation of cross-reflection. If an agent cannot generate accurate results or precise planning steps via reflecting its outputs, users can prompt the agent to query another agent which is specialised in reflection. The latter agent can review and evaluate the outputs and relevant reasoning steps of the original agent, and provide refinement suggestions. This process can be iterative until the reflective agent confirms the plan. In addition, multiple agents can be queried for reflection to generate comprehensive responses.
Benefits:
Reasoning certainty. The agent’s outputs and respective methodology are assessed and refined by other agents to ensure the reasoning certainty and response accuracy.
Explainability. Multiple agents can be employed to review the reasoning process of the original agent, providing thorough explanations to the user.
Inclusiveness. The reflective feedback includes different reasoning outputs when multiple agents are queried, which can help formalise a comprehensive refinement suggestion.
Scalability. Cross-reflection supports scalable agent-based systems as the reflective agents can be flexibly updated without disrupting the system operation.
Drawbacks:
Reasoning uncertainty. The overall response quality and reliability are dependent on the performance of other reflective agents.
Fairness preservation. When various agents participate in the reflection process, a critical issue would be how to preserve fairness among all the provided feedback.
Complex accountability. If the cross-reflection feedback causes serious or harmful results, the accountability process may be complex when multiple agents are employed.
Overhead. i) There will be communication overhead for the interactions between agents. ii) Users may need to pay for utilising the reflective agents.
Known uses:
XAgent. In XAgent, the tool agent can send feedback and reflection to the plan agent to indicate whether a task is completed, or pinpoint the refinements.
Yao et al. [1] explore agents’ capability of learning through communicating with each other. A thinker agent can provide suggestions to an actor agent, who is responsible for decision-making.
Qian et al. [2] develop a virtual software development company based on agents, where the tester agents can detect bugs and report to programmer agents.
Talebirad and Nadiri [3] analyse the inter-agent feedback which involves criticism of each other, which can
help agents adapt their strategies.
Summary: The agent collects feedback from humans to refine the plan, to effectively align with the human preference.
Context: Agents create plans and strategies that decompose users’ goals and requirements into a pool of tasks. The tasks will be completed by other tools and agents.
Problem: How to ensure human preference is fully and correctly captured and integrated into the reasoning process and generated plans?
Forces:
Alignment to human preference. Agents are expected to achieve users’ goals ultimately, consequently, it is critical for agents to comprehend users’ preferences.
Contestability. If the agent’s outputs do not satisfy users’ requirements and will cause negative impacts, there should be a timely process for users to contest the responses of agent.
Solution: Fig. 1 presents a high-level graphical representation of human-reflection. When a user prompts his/her goals and specified constraints, the agent first creates a plan consisting of a series of intermediate steps. The constructed plan and its reasoning process can be presented to the user for review, or sent to other human experts to validate the feasibility and usefulness. The user or expert can provide comments or suggestions to indicate which steps can be updated or replaced. The plan will be iteratively assessed and improved until it is approved by the user/expert.
Benefits:
Alignment to human preference. The agent can directly receive feedback from users or additional human experts to understand human preferences, and improve the outcomes or procedural fairness, diversity in the results, etc.
Contestability. Users or human experts can challenge the agent’s outcomes immediately if abnormal behaviours or responses are found.
Effectiveness. Human-reflection allows agents to include users’ perspectives for plan refinement, which can help formalise responses tailored to users’ specific needs and level of understanding. This can ensure the usability of strategies, and improve the effectiveness for achieving users’ goals.
Drawbacks:
Fairness preservation. The agent may be affected by users who provide skewed information about the real world.
Limited capability. Agents may still have limited capability to understand human emotions and experiences.
Underspecification. Users may provide limited or ambiguous reflective feedback to agents.
Overhead. Users may need to pay for the multiple rounds of communication with the agent.
Known uses:
Inner Monologue [1]. Inner Monologue is implemented in a robotic system, which can decompose users’ instructions into actionable steps, and leverage human feedback for object recognition.
Ma et al. [2] explore the deliberation between users and agents for decision-making. Users and agents both need to provide related evidence and arguments for their conflicting opinions.
Wang et al. [3] incorporate human feedback for agents to capture the dynamic evolution of user interests and consequently provide more accurate recommendations.
Known uses:
Inner Monologue [1]. Inner Monologue is implemented in a robotic system, which can decompose users’ instructions into actionable steps, and leverage human feedback for object recognition.
Ma et al. [2] explore the deliberation between users and agents for decision-making. Users and agents both need to provide related evidence and arguments for their conflicting opinions.
Wang et al. [3] incorporate human feedback for agents to capture the dynamic evolution of user interests and consequently provide more accurate recommendations.
Summary: Agents can freely provide their opinions and reach consensus through voting-based cooperation.
Context: Multiple agents can be leveraged within a compound AI system. Agents need to collaborate on the same task while having their own perspectives.
Problem: How to finalise the agents’ decisions properly to ensure fairness among different agents?
Forces:
Diversity. The employed agents can have diverse opinions of how a plan is constructed or how a task should be completed.
Fairness. Decision-making among agents should take their rights and responsibilities into consideration to preserve fairness.
Accountability. The behaviours of agents should be recorded to enable future auditing if any violation is found in the collaboration outcomes.
Solution: Fig. 1 illustrates how agents can cooperate to finalise a decision via votes. Specifically, an agent can first generate a candidate response to the user’s prompts, then it holds a vote in which different reflective suggestions are presented as choices. Additional agents are requested to submit their votes to select the most appropriate feedback according to their capabilities and experiences. In this circumstance, agents communicate in a centralised manner that the original agent will act as a coordinator. The voting result will be formalised and sent back to the original agent, who can refine the response accordingly before answering the user.
Benefits:
Fairness. Votes can be held in multiple ways to preserve fairness. For instance, counting heads to ensure agents’ rights are equal, or weights can be distributed considering the roles of agents, etc.
Accountability. The overall procedure and final results are recorded in the respective voting system. Stakeholders can trace back to identify the accountable agents selecting certain options.
Collective intelligence. The finalised decisions after votes can leverage the strengths of multiple agents (e.g. comprehensive knowledge base), hence they are regarded as more accurate and reliable than the ones generated by a single agent.
Drawbacks:
Centralisation. Specific agents may gain the majority of decision rights and hence have the ability to compromise the voting process.
Overhead. Hosting a vote may increase the communication overhead for agents to examine and vote for the choices.
Known uses:
Hamilton [1] utilises nine agents to simulate court where the agents need to vote for the received cases. Each case is determined by the dominant voting result.
ChatEval [2]. Agents can reach consensus on users’ prompts via voting, while the voting results can be totalled by calculating either the majority vote or the average score.
Yang et al. [3] explore the alignment of agent voters based on GPT-4 and LLaMA-2 and human voters on 24 urban projects. The results indicate that agent voters tend to have uniform choices while human voters have diverse preferences.
Li et al. [4] incrementally query a foundation model to generate N samples, and leverage multiple agents to select a finale response via majority voting.
Summary: Agents are assigned assorted roles and decisions are finalised in accordance with their roles.
Context: Multiple agents can be leveraged within a compound AI system. Agents need to collaborate on the same task while having their own perspectives.
Problem: How can agents cooperate on certain tasks considering their specialties?
Forces:
Diversity. The employed agents can have diverse opinions of how a plan is constructed or how a task should be completed.
Division of labor. As agents can be trained with different corpus for various purposes, their strengths and expertise should be taken into consideration for task completion.
Fault tolerance. Agents may be unavailable during cooperation, which will affect the eventual task result.
Solution: Fig. 1 illustrates a high-level graphical representation of role-based cooperation, where agents coordinate in a hierarchical scheme. In particular, an agent-as-a-planner can generate a multi-step plan by decomposing user’s goal into a chain of tasks. Subsequently, the agent-as-an-assigner can orchestrate task assignment, i.e., some tasks can be completed by the assigner itself, while other tasks can be delegated to specific agent-as-a-worker based on their capabilities and expertise. In addition, if there is no available agent, agent-as-a-creator can be invoked to create a new agent with a specific role, by providing necessary resources, clear objectives and initial guidance to ensure a seamless transition of tasks and responsibilities. Please note that more elaborate roles can be defined and assigned to the agents.
Benefits:
Division of labor. Agents can simulate the division of labor in the real world according to their roles, which enables the observation of social phenomena.
Fault tolerance. Since multiple agents are leveraged, the system can continue operation by replacing inactive agents with other agents of the same role.
Scalability. Agents of new roles can be employed or created anytime to refine the task workflow and extend the capability of the whole system.
Accountability. Accountability is facilitated as the responsibilities of agents are attributed clearly regarding their expected roles.
Drawbacks:
Overhead. Cooperation between agents will increase communication overhead, while agent services with different roles may have different prices.
Known uses:
XAgent. XAgent consists of three main parts: planner agent for task generation, dispatcher agent for task assignment, and tool agent for task completion.
MetaGPT [1]. MetaGPT utilises various agents acting as different roles (e.g., architect, project manager, engineer) in standardized operating procedures.
MedAgents [2]. Agents are assigned roles as various domain experts (e.g. cardiology, surgery, gastroenterology) to provide specialised analysis and collaboratively work on healthcare issues.
Wang et al. [3] propose Mixture-of-Agents where proposer agents provide useful reference responses to aggregator agents, and the aggregator agents are composed in layers to synthesise and refine the responses.
Summary: An agent receives feedback from other agents, and adjusts the thoughts and behaviours during the debate with other agents until a consensus is reached.
Context: A compound AI system can integrate multiple agents to provide more comprehensive services. The included agents need to collaborate on the same task while having their own perspectives.
Problem: How to leverage multiple agents to create refined responses, while facilitating the evolution of agents.
Forces:
Diversity. Different agents may have various opinions to help refine the generated responses to users.
Lack of adaptability. An agent may exhibit limited creativity in reasoning and response generation when given new context or tasks.
Lack of explainability. The interaction process of agents should be interpreted for auditing if violations are detected.
Solution: Fig. 1 depicts a high-level graphical representation of debate-based cooperation. A user can send queries to an agent, which will then share the questions with other agents. Given the shared question, each agent generates its own initial responses, and subsequently, a round of debate will start between the agents. Agents will propagate their initial response in a decentralised manner to each other for verification, while also providing instructions and potential planning directions to construct a more comprehensive response based on inclusive and collective outcomes. In addition, agents may utilise a shared memory in certain circumstances, or allow each other to access the respective memory facilitating the debate. This debate process can be iterative to enhance the performance of all participating agents. Debate-based cooperation can end according to a predefined number of debate rounds, or the agents will continue the procedure until a consensus answer is obtained.
Benefits:
Adaptability. Agents can adapt to other agents during the debate procedure, achieving continuous learning and evolution.
Explainability. Debate-based cooperation is structured with agents’ arguments and presented evidence, preserving transparency and explainability of the whole procedure.
Critical thinking. Arguing with other agents can help an agent develop the ability of critical thinking for future reasoning process.
Drawbacks:
Limited capability. The effectiveness of debate-based cooperation relies on agents’ capabilities of reasoning, argument, and evaluation of other agents’ statement.
Data privacy. Agents may need to withhold certain sensitive information, which can affect the debate process.
Overhead. The complexity of debate may increase the communication and computation overhead.
Scalability preservation. The system scalability may be affected as the number of participating agents increases. The coordination of agents and processing of their arguments may become complex.
Known uses:
crewAI. crewAI provides a multi-agent orchestration framework where multiple agents can be grouped for discussion on a given topic.
Liang et al. [1] leverage multi-agent debate to address the issue of “Degeneration-of-Thought”. Within the debate, an agent needs to persuade another and correct the mistakes.
Du et al. [2] employ multiple agents to discuss the given user input, and the experiment results indicate that the agents can converge on a consensus answer after multiple rounds.
Chen et al. [3] explore the negotiation process in a multi-agent system, where each agent can perceive the outcomes of other agents, and adjust its own strategies.
Li et al. [4] propose a framework including peer rank and discussion between agents, to mitigate the biases in automated evaluation process.
Summary: Multimodal guardrails can control the inputs and outputs of foundation models to meet specific requirements such as user requirements, ethical standards, and laws.
Context: An agent consists of foundation model and other components. When users prompt specific goals to the agent, the underlying foundation model is queried for goal achievement.
Problem: How to prevent the foundation model from being influenced by adversarial inputs, or generate harmful or undesirable outputs to users and other components?
Forces:
Robustness. Adversarial information may be sent to the foundation model, which will affect the model’s memory and all subsequent reasoning processes and results.
Safety. Foundation models may generate inappropriate responses due to hallucinations, which can be offensive to users, and disturb the operation of other components (e.g., other agents, external tools).
Standard alignment. Agents and the underlying foundation models should align with the specific standards and requirements in industries and organisations.
Solution: Fig. 1 presents a simplified graphical representation of multimodal guardrails. Guardrails can be applied as an intermediate layer between the foundation model and all other components in a compound AI system. When users send prompts or other components (e.g. memory) transfer any message to the foundation mode, guardrails can first verify whether the information meets specific predefined requirements, only valid information will be delivered to the foundation model. For instance, personally identifiable information should be treated with care or removed to protect privacy. Guardrails can evaluate the contents either relying on predefined examples, or in a “reference-free” manner. Equivalently, when the foundation model creates results, the guardrails need to ensure that the responses do not include biased or irrespective information to users, or fulfil the particular requirements of other system components. Please note that a set of guardrails can be implemented where each of them is responsible for specialised interactions, e.g., information retrieval from datastore, validation of users’ input, external API invocation, etc. Meanwhile, guardrails are capable of processing multimodal data such as text, audio, video to provide comprehensive monitoring and control.
Benefits:
Robustness. Guardrails preserve the robustness of foundation models by filtering the inappropriate context information.
Safety. Guardrails serve as validators of foundation model outcomes, ensuring the generated responses do not harm agent users.
Standard alignment. Guardrails can be configured referring to organisational policies and strategies, ethical standards, and legal requirements to regulate the behaviours of foundation models.
Adaptability. Guardrails can be implemented across various foundation models and agents, and deployed with customised requirements.
Drawbacks:
Overhead. i) Collecting diverse and high-quality corpus to develop multimodal guardrails may be resourceintensive. ii) Real-time processing multimodal data can increase the computational requirements and costs.
Lack of explainability. The complexity of multimodal guardrails makes it difficult to explain the finalised outputs.
Known uses:
NeMo guardrails [1]. NVIDIA released NeMo guardrails, which are specifically designed to ensure the coherency of dialogue between users and AI systems, and prevent negative impact of misinformation and sensitive topics.
Llama guard [2]. Meta published Llama guard, a foundation model based safeguard model fine-tuned via a safety risk taxonomy. Llama guard can identify the potentially risky or violating content in users’ prompts and model outputs.
Guardrails AI. Guardrails AI provides a hub, listing various validators for handling different risks in the inputs and outputs of foundation models.
Summary: The tool/agent registry maintains a unified and convenient source to select diverse agents and tools.
Context: Within an agent, the task executor may cooperate with other agents or leverage external tools for expanded capabilities.
Problem: There are diverse agents and tools, how can the agent efficiently select the appropriate external agents and tools?
Forces:
Discoverability. It may be difficult for users and agents to discover the available agents and tools considering the diversity.
Efficiency. Users/agents need to finalise agent and tool selection within a certain time period.
Tool appropriateness. Particular tasks may have specific requirements of agents/tools (e.g. certain capabilities).
Solution: Fig. 1 depicts how can an agent search external agents and tools via a tool/agent registry. A user prompts goals to an agent, which then decomposes the goals into fine-grained tasks. The agent can query the tool/agent registry, which is the main entry point for collecting and categorising various tools and agents regarding a series of metrics (e.g., capability, price, context window). Based on the returned information, the agent can employ and assign the tasks to respective tools and agents. Please note that a registry can be implemented in different manners, for instance, a coordinator agent with specific knowledge base, blockchain-based smart contract, etc., and a registry can be extended into a marketplace for tool/agent service trading.
Benefits:
Discoverability. The registry provides a catalogue for users and agents to discover tools and agents with different capabilities.
Efficiency. The registry offers an intuitive inventory listing the attributes (e.g., performance, price) of tools and agents, which saves time for comparison.
Tool appropriateness. Given the task requirements and conditions, users and agents can select the most appropriate tools/agents according to the provided attributes.
Scalability. The registry only stores certain metadata about tools and agents, hence the data structure is simple and lightweight, which ensures the scalability of the registry.
Drawbacks:
Centralisation. The registry may become a vendor lock-in solution and cause single point of failure. It may be manipulated and compromised if it is maintained by external entities.
Overhead. Implementing and maintaining a tool/agent registry can introduce additional complexity and overhead.
Known uses:
GPTStore. GPTStore provides a catalogue for searching ChatGPT-based agents.
TPTU [1]. TPTU incorporates a toolset to broaden the capabilities of AI Agents.
VOYAGER [2]. VOYAGER can store action programs and hence incrementally establish a skill library for reusability.
OpenAgents [3]. An agent is specifically developed to manage the API invocation of plugins.
Summary: An agent adapter provides interface to connect the agent and external tools for task completion.
Context: An agent may leverage external tools to complete certain tasks for expanded capabilities.
Problem: The agent needs to deal with different interfaces of diverse tools, while certain interfaces might be incompatible or inefficient to interact for the agent. How can the agent assign tasks to external tools and process the results?
Forces:
Interoperability. Certain tasks require external tools to complete, and the tools may need agents to process particular information during intermediate steps.
Adaptability. Agents may employ new tools considering task complexity, tool capability, cost, etc.
Overhead. Manual development of compatible interfaces for agents and external tools can be intensive and inefficient.
Solution: Fig. 1 demonstrates a simplified graphical representation of agent adapter. Given user’s instructions, the agent generates a plan consisting of a set of tasks to achieve the user’s goals. In particular, the agent may employ diverse external tools to complete different tasks. However, tools have respective interfaces, which can be of different abstraction levels for the agent to deal with, or have specific format requirements, etc. Agent adapter can help invoke and manage these interfaces by converting the agent messages into required format or content, and vice versa. In particular, the adapter can retrieve tool manual or tutorial from datastore, to acquire available interfaces. It then transforms the agent outputs based on the interface requirements, and invokes the service [68]. Please note that fine-grained interface description can help agent to understand and hence improve the performance. The adapter also receives execution results from tools, which will be sent to the underlying foundation model for further analysis (e.g. task assignment to other tools, self-reflection for tool employment). For instance, the adapter can translate tasks into system messages when interacting with local file system, or capture and operate graphical user interface when playing a video game.
Summary: Agent evaluator can perform testing to assess the agent regarding diverse requirements and metrics.
Context: Within an agent, the underlying foundation model and a series of components coordinate to conduct reasoning and generate the responses given users’ prompts.
Problem: How to assess the performance of agents to ensure they behave as intended?
Forces:
Functional suitability guarantee. Agent developers need to ensure that a deployed agent operates as intended, providing complete, correct, and appropriate services to users.
Adaptability improvement. Agent developers need to understand and analyse the usage of agents in specific scenarios, to perform suitable adaptations.
Solution: Fig. 1 presents a simplified graphical representation of agent evaluator. Developers can deploy evaluator to assess the agent regarding responses and reasoning process at both design-time and runtime. Specifically, developers need to build up the evaluation pipeline, for instance, by defining specific scenario-based requirements, metrics and expected outputs from agents. Given particular context, the agent evaluator prepares context-specific test cases test cases (either searching from external resources or generating by itself), and performs evaluation on the agent components respectively. The evaluation results provide valuable feedback such as boundary cases, near-misses, etc., while developers can further fine-tune the agent or employ corresponding risk mitigation solutions, and also upgrade the evaluator based on the results.
Figure 1. Agent evaluator.
Benefits:
Functional suitability. Agent developers can learn the agent’s behavior, and compare the actual responses with expected ones through the evaluation results.
Adaptability. Agent developers can analyse the evaluation results regarding scenario-based requirements, and decide whether the agent should adapt to new requirements or test cases.
Flexibility. Agent developers can define customised metrics and the expected outputs to test a specific aspect of the agent.
Drawbacks:
Metric quantification. It is difficult to design quantified rubrics for the assessment of software quality attributes.
Quality of evaluation. The evaluation quality is dependent on the prepared test cases.
Known uses:
Inspect. UK AI Safety Institute devised an evaluation framework for large language models that offers a series of built-in components, including prompt engineering, tool usage, etc.
DeepEval. DeepEval incorporates 14 evaluation metrics, and supports agent development frameworks such as LlamaIndex, Hugging Face, etc.
Promptfoo. Promptfoo can provide efficient evaluation services with caching, concurrency, and live reloading, and also enable automate scoring based on user-defined metrics.
Ragas. Ragas facilitates evaluation on the RAG pipelines via test dataset generation and leveraging LLMassisted
evaluation metrics.