The most common way to interact with a large language model today is through prompting: you write a text input, the model generates a response, and the interaction is complete. This works well for self-contained questions like "Explain quantum computing" or "Write a Python function to sort a list." But consider a more realistic request:
"Plan me a weekend trip to Tokyo. Find flights under $800, check the weather forecast, and avoid areas I've already visited."
Try this with any standard LLM, and you will quickly run into three walls.
No hands. The model cannot actually look up flight prices or check a live weather forecast. It has no way to reach out to external systems: no access to airline APIs, weather services, or booking platforms. It can only generate text based on what it learned during training, which means any prices or forecasts it produces are fabricated. The model has no ability to act on the world.
No memory. The phrase "areas I've already visited" assumes the model knows something about you, perhaps that you visited Shibuya and Asakusa on a previous trip. But a standard LLM starts every conversation from scratch. It has no record of past interactions, no user profile, no accumulated context. Each prompt is an isolated event. The model has no ability to retain context across sessions.
No autonomy. Even if the model could search for flights and recall your travel history, who decides what to do first? The user would have to spell out every step in advance: "First check my budget, then search flights in that range, then look up the weather, then filter out places I've been, then combine it all into an itinerary." But real tasks rarely unfold so predictably. Maybe the cheapest flights land at midnight, so you need to rethink the first day's schedule, or maybe rain is forecast on Saturday, so you should swap indoor and outdoor activities. The user cannot anticipate all of this upfront. The model has no ability to figure out the next step on its own.

[Source: https://arxiv.org/abs/2308.11432]
To address these limitations, research on AI agents is being actively pursued. An AI agent is a system designed to act on its environment, observe the results, and decide what to do next, repeating this loop until a goal is achieved.
LLMs already bring strong language understanding, broad knowledge, and the ability to reason through complex instructions. If we can give them the hands, memory, and autonomy they currently lack, they become powerful agents. Let's explore how.
By itself, an LLM is a standalone model. It has no access point to the outside world: it cannot check today's weather, look up a flight price, or read a file on your computer. Everything it produces comes from patterns learned during training. To give an LLM hands, we need a mechanism that lets it reach out and interact with external systems.
