AI Agents in Action: A Beginner's Guide to Reinforcement Learning
Promising the automation of mechanical processes, resolution of bottlenecks, and optimization of redundant tasks, AI agents have been drumming-up increasing excitement among enterprises recently. On one hand, they can save significant time and money while on the other, can drive business growth through improved lead generation, faster project completions, personalized customer experiences, and the like. But what exactly are they, and how do you know they’re a worthy investment for your business?
An AI agent is a computational entity capable of observing its environment and taking actions based on those observations. These span the landscape of complexity and domain, be it virtual assistants accessible via handhelds like Siri and Alexa, or autopilot algorithms that guide Teslas around traffic. This blog will focus on a particular class of these agents: Reinforcement Learning (RL) Agents.
In an RL context, an agent is designed to perceive the state of its environment and adjusts its actions dynamically to maximize progress towards a predefined goal.
For example, an RL agent designed to win chess games could evaluate its win-lose probability for the position on the board at each move and play to maximize the probability of winning. Notably, since the goal is to maximize rewards over the long term, the agent might strategically sacrifice a piece to gain a slow positional advantage that increases its chances of winning in the future. An alternate approach could be trying to maximize the number of wins instead of win-probability within games, which could lead to different downstream behavior.
This is different from supervised learning because there is no separate training period where the agent is exposed beforehand to the ‘correct’ moves to make. It's different from unsupervised learning because of the sole existence of a live feedback mechanism (win-lose probability)- the agent isn’t learning to play from patterns uncovered in some provided game-data. More generally, training periods and principles of the learning process are good broad differentiators between RL and other schools of machine learning, and these should primarily inform your decision to work with one over another.
Question: given a position, should our chess-playing agent always make the move it currently believes to be the best or venture to try new moves and learn from their outcomes? This exemplifies the exploit vs explore dilemma that an RL agent is often faced with.
To be able to answer this, let's flesh out some of the mechanics RL agents use more explicitly. Each of these can themselves be studied extensively- this blog is focused at providing an introductory working glimpse!
pic of a the agent-environment cycle with state: chess position, reward: win-lose probability bar, action: move suggestion (’Nh5!’ or best move given the position with a piece alongside it )
The agent, as defined above, is placed into an environment. This environment is a clearly defined space where the agent operates. The agent perceives this environment through states, which are bundles of information relevant to the agent about the environment.
The agent exhibits learning by performing certain actions on the environment which change its state. The agent discovers whether its action was helpful through reward signals—this is the reinforcement bit. So, in the language of RL, the agent’s goal is to accumulate the maximum possible reward over time by taking actions that progressively move the environment closer to the desired state.
The state, reward, and action have a notion of time attached to them, and are often notated as $S_t, R_t$ and $A_t$ to capture it. Each action $A_t$ generates the state $S_{t+1}$ and rewards $R_{t+1}$ which are fed to the agent iteratively. This cycle continues, forming a sequence of state-action-reward transitions that the agent learns from over time. Terminating this process can either be programmed in (upon reaching a desired state, for instance) or can be done by manually intervening. But, how are the rewards generated? And how does the agent learn to choose better actions based on them?
The specific reward function in an RL application can vary widely depending on the context. However, it is always designed with two main objectives: first, to align with the environment’s rules, where adhering to the rules results in higher rewards and violations resulting in penalties; and second, to increase as the agent makes progress toward the predefined goal.
The agent maintains estimates of the value of each state or state-action pair via value-functions. These capture the cumulative future rewards obtainable given a current state/state-action pair. But, how is the choice of actions made?
This is where a policy (often notated $\pi$) comes in. Think of this as a function mapping each state to an action. Depending on the size of the action-space, these range from lookup-tables to function approximators, which in-turn means that mappings range from deterministic to stochastic/probabilistic.
The goal in RL is to converge to an optimal policy and value function. Both are updated iteratively throughout the learning process using convergence algorithms like Value Iteration, Policy Iteration, and Q-Learning, among many others, most of which are (directly) inspired by the Bellman Equation, which relates the value of any state to values of possible future states.
Now, using this prerequisite knowledge, a possible answer to the chess-agent’s exploitation vs exploration dilemma could be as follows: initially, visit new states (positions) and try new actions (moves) to gather knowledge about the environment (the game of Chess). As you iterate, start increasingly relying on the updated value-functions, which will now have gotten better at predicting future rewards (win probabilities).
left: pic of a chess position with a handful of bad-moves highlighted with arrows, right: pic of the same position with the best move
As it turns out, this answer built upon a rudimentary understanding of RL actually gets close to the intuition of some widely used methods of mitigating this dilemma (Epsilon-Greedy Strategy, Decaying Exploration Rate)!
Again, you are encouraged to delve deeper into any of the topics touched upon here, but this read has prepared you sufficiently to now move on to the second edition (hyperlink) of this post.
Can an AI Agent Help Me With This?
Just as a chess grandmaster might struggle with quantum physics, Reinforcement Learning (RL) agents aren't universally applicable problem-solvers. Although undoubtedly exciting, they have use-cases where they perform well and others where they don’t.
The mechanics talked about in the previous post (hyperlink) are a good starting point to begin to shortlist tasks which can benefit from RL. For instance, our chess-playing agent might excel in winning games owing to the clear rules, definable states, and unambiguous rewards, but task it with bluffing on the board (or rather, on a live-play website) and you will start to see the limitations.
For these agents to succeed, states must convey the environment completely, relevant information should be extractable after each action, and reward functions must be truly reflective of what it is the agent should learn. Oftentimes, these tend to be tricky to achieve. This (hyperlink: https://openai.com/index/faulty-reward-functions/) article from OpenAI for instance demonstrates how it can be counter-intuitively ‘difficult or infeasible to capture exactly what we want an agent to do’.
Adding to these, there are some practical considerations to help decide whether or not a task is suited for RL. Let’s look at these with the example of an agent tasked with designing medical treatment plans.
Firstly, data privacy and security concerns in medicine often limit the availability of comprehensive datasets for effective learning. Patients respond to treatments variably based on intangible lifestyle choices and environmental factors, making it challenging to predict outcomes in a consistently reliable fashion. Healthcare, overall, is a field with low risk tolerance, making exploration tricky. Deploying a reliable RL agent can require computational resources that even small urban clinics might lack. Being a continuously studied field, medical guidelines and best practices can undergo frequent changes in response to new research, and agents trained on outdated data might not be able to adapt quickly enough for a good ROI.
Abstracting away the details, lack of large amounts of high-quality data, variability in outcomes, highly risk-averse environments, lack of computational power, and dynamic objectives are some of the major practical hurdles that need to be considered before opting for RL solutions.
However, if these are dealt with, the benefits to be reaped are substantial. At large, any problem with a well-defined environment, clear reward-structures, and ample high-quality data can find an efficient solution in RL. For enterprises, this happens to be commonplace!
This can look like recommendation engines of any kind. The environment can comprise the product-database + user interface, with the reward being a user-interaction with the suggestion of the engine. It can look like dynamic pricing models in e-commerce, where the agent adjusts prices based on real-time demand, competition, and inventory levels to maximize revenue. In UPI apps, RL agents can detect and prevent fraudulent transactions by learning patterns of normal and suspicious behavior, improving security. Previously successful resource-scheduling applications of agents can be leveraged by consultants for project-management. For software solution providers, testing can be entirely automated via agents.
Expectedly, emergent themes across potential use-cases are risk-tolerant environments with an abundance of data and explicit reward structures. With increasing interest and research into RL, the deployment is and will continue to get more accessible with time. To identify how agents can transform your company, reach out to us at Nurix!