[00:00] Everyone's John. So, today we're going
[00:01] to go through probably the biggest
[00:03] buzzwords in AI agent system recently,
[00:05] agent harness, loop engineering, LLMOps,
[00:08] which stands for large language models
[00:09] operations, eval, which stands for
[00:12] evaluation system for AI agents. And
[00:15] these things become popular or become
[00:16] viral on the internet not because they
[00:18] are just some really complicated
[00:20] concepts. Instead, they're actually very
[00:22] simple. And I believe that simple
[00:23] building blocks will actually help us
[00:25] build the biggest architecture in the
[00:27] world that will function like an
[00:28] intelligent system. Let's walk through
[00:30] this step-by-step, and it does not
[00:31] matter if you're technical or not. We're
[00:34] going to go through this and we'll make
[00:35] sure that you're equipped with the right
[00:37] knowledge for prompting your way through
[00:39] building such a system in the future.
[00:41] Let's jump in and get started. For those
[00:43] of you who have watched my previous
[00:44] video on AI agent memories, you're
[00:46] probably already familiar with this
[00:47] chart. This is an AI agent run, which
[00:50] means that it takes an input from a user
[00:52] prompt. For example, you're asking
[00:54] ChatGPT or Deep Seek a question and say,
[00:57] "Hey, when was Sam Altman fired from
[00:58] OpenAI?" And then it's going to go
[00:59] through entire run, but the end goal is
[01:01] that you want to get a response. This is
[01:03] actually ephemeral, which means that
[01:05] there's no memory in this at all. We're
[01:06] sending that question, "When was Sam
[01:08] Altman fired?" and any chat history
[01:10] that's currently in the chat. For
[01:11] example, maybe we had some conversations
[01:13] before that, which for example could be
[01:15] you should talk to me like Elon Musk
[01:17] grilling on Sam Altman because they
[01:19] don't like each other. And then these
[01:20] things will be fed into this thing
[01:21] called a working memory or a context
[01:23] RAM. In this video, we probably won't
[01:25] dive too much in-depth into the memory
[01:27] system because there's a previous video
[01:29] talking about it already. I'll just
[01:30] quickly go through it and then we'll
[01:32] introduce the concept of what a harness
[01:34] means. When you have these kind of
[01:35] short-term working memory, there will be
[01:37] an LLM or a large language model which
[01:39] performs as a question and answer agent.
[01:41] And at the end, you're going to get a
[01:42] reply. But the problem with a simple
[01:45] agent run with simply just the question,
[01:47] current chat history, and system prompt
[01:49] is that the memory is very short-term.
[01:52] But when you run an AI agent system,
[01:54] sometimes we need extra memories. For
[01:56] example, how should the agent respond to
[01:58] the person? A procedural memory is
[02:00] exactly that. It basically tells the
[02:01] agent how to act and what are some of
[02:03] the instructions for the skill. We might
[02:05] also want the agent to know some durable
[02:07] facts about this context. For example, I
[02:09] might want to compare my own early-stage
[02:11] startup journey with Sam Altman's early
[02:13] startup journey. We need this agent to
[02:15] have a memory of who I am, which in this
[02:17] context would be a durable facts or a
[02:20] semantic memory. Who Sean is, what did
[02:22] he build in the past? These kind of
[02:23] things became a fact that you want your
[02:25] agent to know, but they're not publicly
[02:27] available if you're not famous because
[02:28] the AI model won't be trained on such
[02:30] information yet. But if you're famous
[02:32] already, you can skip this. They already
[02:33] know who you are. And another thing we
[02:35] need is called episodic memory, and they
[02:36] include things like the past events or
[02:38] past chat history that does not exist in
[02:41] this current conversation. For example,
[02:43] I might suddenly be wondering, "When was
[02:44] the last time I was preparing for a job
[02:46] application?" And can we retrieve that
[02:48] information and match, you know, if we
[02:49] can get a job in ChatGPT. So, these
[02:52] things will be retrieved from this thing
[02:54] called an episodic memory, which is
[02:55] basically a time series of the previous
[02:57] conversations or previous triggers that
[03:00] happened if you have a more complex
[03:01] system. So, for those of you who have
[03:03] watched my previous memory agent system
[03:04] design, you might be wondering, "Sean,
[03:06] why are you repeating all of these
[03:07] things?" And that is because if you
[03:09] think about the entire thing that we
[03:10] just covered in the past few minutes,
[03:13] we're really stating the one fact that a
[03:15] large language model can't do these
[03:17] things by itself. It's like a really
[03:18] powerful brain that knows everything
[03:22] about humanity, everything about
[03:23] science, anything that happened in human
[03:25] or biology history. But, it does not
[03:27] know you. With you or the software who's
[03:30] running this AI agent system, the large
[03:32] language model has no clue with how you
[03:35] want it to perform. This is why the
[03:37] concept called harness becomes really
[03:40] important in this. What harness means
[03:42] literally is that it's a set of harness
[03:44] tools that you use to control a horse
[03:47] when you're doing a horse riding.
[03:48] Imagine this large language model is a
[03:50] horse, right? This horse is very
[03:51] powerful, they can run around, but if
[03:53] you don't have a good set of tools to
[03:55] ride this horse, you could just get
[03:57] hurt, you might go anywhere, you might
[03:59] go somewhere random. If you're in a war,
[04:01] you don't want that to happen. And
[04:02] that's why we're doing all of these to
[04:04] make sure we have good control over this
[04:07] large language model and make sure we're
[04:08] utilizing it at its maximum potential.
[04:10] That's why in addition to just the
[04:12] question or use a prompt and getting the
[04:14] reply, and we're feeding them all in as
[04:16] a working memory which can be enhanced
[04:18] by these three memories we just talked
[04:20] about. And in order for these three
[04:21] memories to actually work, there's a bit
[04:23] more details and they're all included in
[04:26] Harness. Remember, Harness means we're
[04:27] building this agent framework to control
[04:30] this large language model so that it
[04:32] works the way we want. For those of you
[04:34] who study statistics or machine
[04:35] learning, you would understand that a
[04:37] large language model is actually
[04:38] predicting the probability of the next
[04:40] word that it should spit out. When
[04:42] everything comes with probability,
[04:43] there's randomness in it. But when we
[04:45] solve problems, we sometimes don't want
[04:47] too much randomness. So, that's why we
[04:49] need to have a good control over this
[04:51] technology. Now, let's continue to
[04:52] finish this Harness. There are lots of
[04:54] tools on the market that's already quite
[04:56] useful. For example, you could try tools
[04:58] like LangGraph, LangChain, or Pydantic,
[05:00] and there are many others. In this
[05:01] video, we won't dive too much in depth
[05:03] into that, and we're going to finish
[05:04] building up this Harness before we move
[05:06] on to the next topic. So again, for this
[05:07] agent to work properly, we need this
[05:09] memory system to work, but this memory
[05:11] system needs an update system cuz memory
[05:13] doesn't just exist or pop up from
[05:15] nowhere. You need to constantly update
[05:17] it. That's why we need a database to
[05:19] store all these memories so that when
[05:21] the agent is running in this agent run,
[05:23] it knows where to retrieve these
[05:25] memories. So, whenever you see an icon
[05:27] like this, this is a database, okay?
[05:29] Procedural memory is basically remember
[05:31] it's it's about instructions, right?
[05:32] It's about how the agent should be
[05:34] acting. It's like with a Harness on a
[05:35] horse, you want the horse to ride faster
[05:37] or slower. Normally, these are just
[05:39] files or text, and that's why you
[05:40] probably heard of this buzzword called
[05:42] skills. Skill is basically a piece of
[05:44] text in a markdown file that you feed
[05:46] into AI agent like Clockwork. But, if
[05:48] you want to harness the system well,
[05:50] just having files and text is not
[05:52] enough. And they're stored in databases
[05:54] say like AWS, Superbase, Google Cloud,
[05:57] you know, Azure, all these kind of
[05:58] places. Or you can set up your own
[06:00] server at home if you want, but that's
[06:01] just too expensive. You don't want to do
[06:03] that. And in order for this harness to
[06:04] work properly, you also need to figure
[06:06] out how to store the memories, okay? So,
[06:09] for example, the episodic memory is a
[06:11] time series of the events that happened
[06:13] or the previous chat history, again. The
[06:14] way we store it is actually very simple.
[06:16] You just track every single thing that
[06:17] happened, and then it's going to become
[06:19] like a very long list of things that
[06:21] happened in history with timestamps.
[06:23] Durable facts is different story. You
[06:24] can either input it yourself or you want
[06:26] the system to sort of automatically
[06:28] evolve over time. And the way for it to
[06:30] evolve is that you want to consolidate
[06:32] some of the conversations into the
[06:34] semantic memory. If I'm running a D2C
[06:36] brand e-commerce company, perhaps my
[06:37] customers have talked to my customer
[06:39] service agents for a million times about
[06:42] how do I get reimbursed if this product
[06:43] does not work? You want to consolidate
[06:45] these conversations and distill them as
[06:48] a fact into the semantic memory. And
[06:50] that's why here we have a little gate
[06:51] here. If a brand has a million people
[06:53] purchasing products from it, let's say
[06:54] you're Alibaba or Amazon, it just
[06:56] doesn't make sense, and it's very
[06:57] expensive. So, from a harness
[06:59] perspective, you want to be smart about
[07:00] this, and you want the system to be
[07:02] automatic. And then simple way is
[07:03] probably like maybe consolidate these
[07:05] time-ordered events after every say
[07:07] 2,000 conversations because you have a
[07:08] million customers. And then you can feed
[07:10] these things into a summarizer agent,
[07:12] which is another large language model
[07:14] harness. You can define the system
[07:15] prompt in this one. You can probably
[07:17] feed it with some memories, too. Uh you
[07:19] can configure different models. Maybe it
[07:21] could be a cheaper model because you're
[07:22] feeding too much text into it. So, the
[07:25] context window's too big, and probably
[07:27] these are very expensive. So, you can
[07:28] use cheaper open-source models if you
[07:30] want to. Having such a mechanism allows
[07:32] you to consistently update this memory
[07:36] system. The data should be coming from
[07:39] the previous large language model
[07:41] replies. Again, let's review how this
[07:43] harness works. A user sent a prompt
[07:45] where in one Asian runtime with the
[07:47] current chat history and how the agent
[07:49] should be performing, the system prompt,
[07:52] we're preparing a working memory for
[07:53] this AI agent to be able to answer a
[07:55] question. And after every single time it
[07:58] answered a question, it will send these
[08:00] messages to this database. And then this
[08:03] database is basically is feeding back to
[08:06] this working memory every single time
[08:08] when a question is checking for relevant
[08:10] context. And at the same time, because
[08:13] this database is too big, sometimes you
[08:15] want to consolidate them into some
[08:17] summarized information or distilled
[08:19] facts so that they're stored properly in
[08:22] a semantic memory so that the retrieval
[08:24] of such memories is just faster. I know
[08:26] we talked about retrieval a lot and
[08:28] that's just another buzzword called rag,
[08:30] which is retrieval augmented
[08:31] generations. I also have a few videos
[08:33] explaining what rags are. Feel free to
[08:35] watch them. There's a little bit of
[08:36] difference between how retrieve from
[08:38] semantic memory and episodic memory. For
[08:41] semantic memory, it's just rags because
[08:42] these are just facts and text or files,
[08:45] right? But then for episodic memory,
[08:47] remember this is a time series. Let's
[08:48] say we're still in this e-commerce
[08:49] store, right? The user question could be
[08:51] like, "What are the previous 10
[08:52] conversations that we had with this
[08:54] specific customer from the US?" And then
[08:56] you might just need a SQL query to query
[08:58] something that's pretty recent from this
[09:00] episodic memory. But if your question is
[09:02] like, "What were my previous 20
[09:03] conversations that have customer
[09:05] complaints on the quality of the
[09:07] products and our agent did not
[09:09] successfully resolve." With such a
[09:11] question, you not only need a SQL query,
[09:14] which is just capturing the dated events
[09:17] in a data table, you also want to do
[09:19] some semantic search. And that's why
[09:21] here rag is important because it's
[09:23] checking for relevant information for
[09:25] you. You don't want the entire 2,000
[09:27] messages. You want that 20 messages out
[09:29] of these 2,000 that are exactly relevant
[09:32] to what you want. And because these
[09:34] complaints are in text, we need to do
[09:36] some retrieval augmented generation to
[09:39] match the semantic meanings between text
[09:41] and the user prompt, so that you're
[09:43] fetching the right context for the
[09:44] working memory. By now, this is probably
[09:46] a fast walk-through of the memory system
[09:49] again, but we're just thinking about it
[09:51] from a hardness perspective in this
[09:53] video. And remember for harness, we're
[09:54] training this horse of LLM to run
[09:57] autonomously without having too much
[09:59] randomness, okay? There's another piece
[10:00] of it that's quite important, which is
[10:02] the agent might not only just read the
[10:04] memory. It might also do some tasks or
[10:07] call some tools. When an agent calls
[10:10] tools, it might not necessarily be just
[10:12] one-time call. It could be multiple
[10:14] times of calls. For instance, let's say
[10:17] this AI agent has a bunch of agentic
[10:19] tools such as help me schedule a
[10:22] meeting, help me read or write my
[10:23] customer relationship data from the CRM
[10:26] system, or help me fetch the payment
[10:28] information say from Stripe or Alipay.
[10:30] And here's something we should be
[10:31] careful about. If we give this horse or
[10:33] this LLM technology full power, it could
[10:36] just continuously do this forever,
[10:38] right? Or it might not even know what's
[10:40] the right time to stop or what is the
[10:42] right tool calls it should make, when is
[10:44] the end point to decide, okay, this
[10:46] response is good enough, let's move on
[10:48] to reply. That's why we have this
[10:50] mechanism called end loop guardrails.
[10:52] Yes, now we're talking about loop
[10:54] engineering, one of the biggest
[10:55] buzzwords in the recent few months. A
[10:56] loop is part of harness. Why? Because a
[11:00] loop is also helping us to control this
[11:04] technology to make sure it runs the way
[11:06] we want it to run. An example that could
[11:08] be helpful for you is that let's say the
[11:11] custom prompt is help me find out what
[11:13] customers are complaining about our
[11:16] products. What are some of the
[11:17] follow-ups we could do in order to win
[11:20] them back. And if they're asking for
[11:23] reimbursement, have we done the
[11:25] reimbursement or not? If not, can we do
[11:26] that? This is probably a series of
[11:29] questions, but sometimes we just dump
[11:30] all of these things into AI agent, okay?
[11:32] And after you have this prompt, this LLM
[11:35] agent needs to decide, okay, what are
[11:37] some of the tools that could be helpful
[11:39] for me to finish this task? Loop here is
[11:42] basically an architectural thinking of
[11:45] when is good enough so that we stop and
[11:48] give the user or the business owner a
[11:50] reply, okay? So, what might happen here
[11:53] is that the LLM agent is doing a bunch
[11:55] of tool calls, it's doing some thinking,
[11:57] it's saying, let me read from our
[11:59] customer relationship management tool
[12:01] like Salesforce, HubSpot, or Automanous.
[12:03] And then it's going to find out, okay,
[12:04] there were 30 customer complaints in the
[12:07] past 2 months. 12 of them have got
[12:09] reimbursement, the other eight have not
[12:11] got reimbursement. So, after the first
[12:13] initial fetch, it's probably responding
[12:15] to the AI agent, right? And the agent
[12:17] will be probably thinking, okay, the the
[12:18] task or the ask is that can we can we
[12:21] follow up with some of those who did not
[12:23] get
[12:24] the reimbursement, right? So, and then
[12:26] it probably like just make another tool
[12:27] calls and be like, hey, let's schedule a
[12:30] meeting with those customers who did not
[12:33] get a reimbursement, which are the eight
[12:34] of them. If we go a little bit more
[12:36] advanced, we even just use the
[12:37] reimbursement trigger on Stripe or
[12:39] Alipay to refund the customer. Can you
[12:41] see that this is a loop until we finish
[12:43] the task? And but of course, this is
[12:45] like a case-by-case situation. It really
[12:47] depends on what your task is, how you
[12:49] build the system. So, there's no
[12:50] one-solution-fits-all. Here, I'm just
[12:52] explaining what a loop is. The very,
[12:54] very essential part of this loop is that
[12:56] it needs to know when it should stop.
[12:58] That's why we need this end loop
[13:00] guardrails. The guardrails could just
[13:01] simply be the task is done. And perhaps
[13:04] when the agent was doing the planning,
[13:05] it should confirm with the user what is
[13:07] a good ending point. It might clarify
[13:08] with you, is this what you want,
[13:10] reimbursing the other eight people, or
[13:11] should I just tell you who they are and
[13:13] then you will follow up later? All
[13:14] right, these are two different decisions
[13:16] you can make. And after you make, you're
[13:17] basically telling the agent loop that
[13:19] there's an ending scenario. Another good
[13:21] example I saw today is that, you know,
[13:23] when you're doing coding and cloud code
[13:26] can just always pop up some windows and
[13:28] ask you for permissions, right? So, the
[13:30] good way to use a loop engineering here
[13:32] is that you can set up a loop or set up
[13:34] a hook in clock code and telling it that
[13:36] you should always send me a notification
[13:38] on my laptop if you are pending on some
[13:41] permissions from me. Otherwise, if I'm
[13:42] watching YouTube and then when I come
[13:44] back 30 minutes later, I realize that
[13:46] clock code is stuck in that one
[13:48] permission like 25 minutes ago. That
[13:50] would be a waste of my time, okay? So,
[13:52] you can set up a loop like this to make
[13:53] sure that there's a way to send you
[13:55] notification pop-ups so that you know
[13:57] the loop has ended or it needs your
[13:59] input again. Are you guys still with me?
[14:01] Good. So, by far we have covered AI
[14:04] agent run with a memory system, with a
[14:07] loop engineering around the large
[14:08] language model agent, which has a
[14:10] trigger to end the loop so that it sends
[14:12] a reply to the user, and basically this
[14:14] whole thing is an AI agent harness
[14:17] system. What's next is one of the other
[14:19] biggest buzzwords that Y Combinator
[14:22] always mentions, which is eval or LLM
[14:24] ops. Let's jump into it. But firstly, I
[14:27] want you to understand why do we need
[14:29] LLM ops here? Let's still look at the
[14:31] left-hand side with this harness system.
[14:33] The biggest problem here is that we
[14:35] don't know how well it's performing. And
[14:37] that's why we need a feedback loop to
[14:39] help us understand is this agent
[14:42] actually performing properly, right? For
[14:44] my business or for my use case. And can
[14:47] I continuously get feedback on how do we
[14:50] fix it and actually fix it ourselves,
[14:53] okay? And when we say fix it, a simple
[14:55] way to understand it is that
[14:57] can we have a better system prompt?
[15:00] Can we have a better large language
[15:01] model configurations?
[15:03] Is there something we should change for
[15:05] how we retrieve the AI agent memories?
[15:08] These are kind of things that we can
[15:09] continue to iterate. But in order to
[15:11] iterate to make sure this system runs
[15:13] properly, we need a way to evaluate it,
[15:17] diagnose problems, solve the problems
[15:19] until it's a healthy and well-performing
[15:22] system. And that is called large
[15:25] language model operations system, LLM
[15:28] ops. So again, in order to understand
[15:30] this properly, we need to come back to
[15:33] what an agent run is. So an agent run,
[15:35] you can simply understand it as a user
[15:37] question is sent to a large language
[15:39] model and then you get a reply. That is
[15:41] one agent run. But in this agent run,
[15:44] the agent tool calling could happen
[15:45] multiple times. That does not matter,
[15:47] right? We're just talking about from a
[15:49] user input to a response from agent
[15:52] perspective. That's one agent run. And
[15:55] then we're going to introduce this
[15:56] system called a tracing system. So every
[15:59] agent run, we should trace like a tree
[16:02] of events that happened. And there are
[16:04] lots of tools that could help you with
[16:05] that. It could be LangFuse, could be
[16:06] LangSmith, etc., etc. A tree of events
[16:08] could be like what did the person
[16:09] actually ask, what retrievals did the
[16:11] model actually retrieve, how many times
[16:14] did the large language model actually
[16:15] call the tools, and how was the tool
[16:17] usage, how was the response time, right?
[16:20] How long did it take for this entire
[16:22] system to run for checking latencies,
[16:24] and how many tokens have we used when we
[16:26] do these tool calls, agent run, you
[16:29] know, doing this retrieval augmented
[16:31] generation, these kind of things. So
[16:33] trace is helping us to track events,
[16:35] basically. And that's the first step.
[16:37] This is the step first to collect data.
[16:39] And these data will be used for the
[16:41] following two purposes. Was it a good
[16:44] system run? And was it healthy? Which
[16:46] corresponds to evaluation system.
[16:49] We can probably use large language model
[16:50] as a judge here to give us a score on
[16:54] how well it performed. For example, if
[16:56] the task has something to do with
[16:58] schedule meetings, did the meeting
[17:00] actually triggered? How long was the
[17:01] response for an agent to reply to a
[17:03] question? Was it 20 seconds or was it 2
[17:05] milliseconds? And also things like how
[17:07] many tokens have we used? These two are
[17:09] basically in the same system. You can
[17:11] write it as a deterministic code, you
[17:12] can use an AI agent to do it, but this
[17:14] is like part of the procedure, which is
[17:16] helping us to understand was this a
[17:17] healthy system?
[17:19] And was it a good system? And after
[17:21] that, we're going to diagnose, okay,
[17:23] where and why something was broken. For
[17:25] example, the meeting scheduling event
[17:28] was never triggered. Why was that?
[17:30] Right? We want to understand why was
[17:31] that. And we could probably feed that
[17:33] into coding agent in Claude to sort of
[17:35] deep dive into it. Or, you know, if the
[17:37] latency is 20 seconds instead of 2
[17:39] milliseconds, something's wrong. Maybe
[17:41] one of the tool call is taking too much
[17:43] time.
[17:44] Maybe the working memory is too large,
[17:48] uh so that the response time for a large
[17:49] language model to a memory retrieval is
[17:51] just taking too much time. Maybe not
[17:53] every single question requires a
[17:55] retrieval from all these gigantic memory
[17:58] system. Maybe you're just asking a
[18:00] simple question be like, when was my
[18:01] birthday? When was OpenAI started? And
[18:04] these kind of information you probably
[18:05] don't need to do a ton of retrieval. The
[18:07] model itself already knows. So, you
[18:08] basically want this system to provide a
[18:10] dashboard for you to understand the
[18:12] metrics. And then, with these metrics,
[18:14] you can diagnose what is going wrong.
[18:16] And then we're going to have a little
[18:16] gate here, which is if the evaluation
[18:19] system passed, well, you can define the
[18:21] rules, we can either ship some very
[18:23] simple fix, have a new version of the
[18:25] prompt, or update the model
[18:26] configuration, you know, some tool
[18:28] changes, or the parameters for
[18:30] retrievals. The LLM Ops will feed the
[18:33] improved system prompt and the
[18:35] configuration of the model back to this
[18:37] agent run system. When then one LLM Ops
[18:40] loop is finished. If, let's say,
[18:42] something is deeply broken, right? We
[18:44] cannot just simply ship the latest
[18:46] version of the prompt. Then we should go
[18:48] fix the bug, rerun the agent run, resend
[18:50] the question, and then retrace the
[18:52] events, and then redo this evaluation
[18:55] system in this LLM Ops architecture. So,
[18:57] now let's zoom out and look at this
[19:00] chart one more time. We covered what an
[19:02] AI agent run is, we covered how it would
[19:05] retrieve information from memories, and
[19:07] we understood how an LLM agent would ask
[19:10] questions, would call tools to help it
[19:13] finish the task in a loop, and it knows
[19:15] when to stop the loop so that it we can
[19:16] get the reply. This whole thing is a set
[19:18] of harness tools that we're controlling
[19:20] this horse, this technology, to run in
[19:23] the right direction, okay, to do the
[19:25] right task. And at the same time, we
[19:27] have like a health checking system or
[19:29] evaluation system to understand how
[19:31] every single run is being traced, is
[19:34] being observed, and how do we diagnose
[19:37] some problems and fix some problems, and
[19:39] ship the latest updates of the prompt,
[19:41] about the model configuration, about all
[19:44] these parameters or knobs that needs to
[19:46] be updated, so that this system will be
[19:49] an autonomous system that will just
[19:50] self-evolve and grow over time. I really
[19:53] hope this was helpful. Let me know what
[19:54] you think, and if you have any
[19:55] questions, you can always reach out to
[19:57] me. I'll see you in the next video.
[19:59] Thanks so much.