Reinforcement learning in AI
The Carrot and the Stick
One of the most common ways to train AI models to behave like humans is to teach them in the same way humans are taught. Reinforcement learning is a technique that works by letting models try certain challenges and ‘reinforcing’ good behaviour or discouraging bad performance.
This way models get to discover, along with some guidance, what the most effective way of achieving their task is. Let’s follow three examples to see how reinforcement learning is applied:
Give the model a challenge
Ask your chatbot a question
Give your soccer robot the chance to shoot
Display an ad to a consumer on the internet
Let it try its best
The chatbot will try to generate the most well thought out and accurate response. Sometimes two are generated at once to compare.
The soccer robot will look at the field, where the ball is, where any other players are, and try to make the best possible movement to score a goal.
The ad model will look at the information about the consumer it is trying to serve to, their interests and past habits, and will pick an ad which they think the consumer is most likely to click on or take an interest in.
Calculate how much of a reward it deserves
If the answer is correct, it gets a high reward. If two answers were generated to compare, the better answer is given a higher reward.
If the soccer robot scores, it gets a reward. If it misses, it does not. Sometimes this may be given a negative reward to disincentivise it from missing.
If the consumer clicks on the ad, the model gets a reward. If the consumer interacts more with the ad, the model’s reward increases.
Update the model to make it get a higher reward next time
Models will update their inner workings to increase the likelihood of choosing the actions that worked for them, and decrease the likelihood of picking ones that did not.
Reinforcement Learning with Human Feedback
Having human input is extremely useful to teach a model what is good behaviour and what is not. Imagine trying to write a math function that tells you whether a chatbot response was good or not. It’s just not possible. Instead, using humans to review language model responses called reinforcement learning with human feedback (RLHF) is a much better and more comprehensive way of placing a value on complex tasks like writing essays and solving math problems.
AB Testing
One common form of getting human feedback, seen in the examples above, is called AB testing. AB testing is about comparing two answers and seeing which performs better, sort of like natural selection. For language models, this occurs by generating two responses, a human picking a better one, and then the model adjusting its knobs and dials so that it will more likely produce a response like the better one next time. For ad models or recommendation models in general (Youtube, Google, Facebook, etc.), AB testing can occur in two ways:
finding the right ad or piece of content to show a consumer, or
businesses picking the most effective variant of their ad or thumbnail
In both these cases, the ones generating the most interaction are favoured in the future.
In summary, reinforcement learning is a very effective way at training models to perform complex tasks because it simulates the trial and error phase humans experience. It also means that people can define rigorous mathematical rewards (such as scoring a goal) or allow humans to decide rewards (for language model responses) to reinforce models to behave in the most ideal way possible.