Demystifying ChatGPT: Understanding Reinforcement Learning with Human Feedback
- Tech news
- July 25, 2023
- No Comment
- 25
Introduction
ChatGPT, the revolutionary language model developed by OpenAI, has attracted widespread interest from various industries thanks to its capacity to produce answers resembling those made by actual individuals. Although, has it ever crossed your mind how ChatGPT manages to achieve such remarkable effectiveness? Within this all-inclusive manual, we shall delve into the functioning of ChatGPT’s internal mechanism. To be specific, we will concentrate on the RLHF algorithm that empowers it.
Understanding Algorithmic Bias
An obstacle encountered by large AI models is algorithmic bias, In these cases, the model displays consistent errors or outcomes that lack justification. This bias can arise due to the biased nature of the data used for training the model. As an illustration, certain occupational labels could be more prone to being linked with particular genders, thus sustaining damaging stereotypes across various fields.
Challenges in Training Language Models
To train language models for various use cases, including tasks such as storytelling with creativity, providing accurate information, or coding automation requires a loss function that effectively captures all the required attributes. While utilizing metrics like BLEU or ROUGE to enhance output quality, they possess constraints when it comes to accurately representing human preferences.
Reinforcement Learning with Human Feedback (RLHF)
In order to address the constraints of loss functions and metrics, RLHF was introduced by OpenAI. The RLHF algorithm involves incorporating human feedback as a performance measure or loss function during the training process. This grants the language model to learn from human-like behavior and preferences.

The RLHF Training Process
The representation of the RLHF training process can be shown in a flowchart. In this process, an “Agent” (RL algorithm) observes the environment and takes actions. The environment provides rewards to the Agent based on its actions. Regarding RLHF, the calculation of rewards involves human feedback rather than relying solely on the environment.
Acquiring the skill of summarizing through feedback from humans from Human Feedback
To enhance our comprehension of RLHF within NLP, we examine the paper “Learning to Summarize from Human Feedback.” A language model guided by human feedback was suggested in this research paper for summarizing tasks. To train the model, comparison data was utilized, where humans ranked generated summaries to create a reward model.

RLHF in ChatGPT
The training process for ChatGPT involves three key steps:
Pretraining Task: Supervised learning is utilized to pretrain the language model, Training data is generated by AI trainers who assume both user and AI assistant roles.
Preparing the Reward Model: The ranking of generated text’s quality relies on human feedback., and a reward model is constructed based on this feedback.
RL-based Fine-tuning LLM: Fine-tuning of the language model involves utilizing the Proximal Policy Optimization (PPO) algorithm. through integrating rewards obtained from a reward model and imposing restrictions on how it operates.
Conclusion
By involving human feedback in the training process, Through learning from humans, ChatGPT can exhibit behaviors that resemble human behavior and achieve high performance in diverse applications. OpenAI’s strategy has been fruitful, However, it also prompts inquiries regarding how AI affects human creativity and learning. To set apart AI-generated text from human-written content, one proposal is to employ watermarking techniques, drawing inspiration from deepfakes used in computer vision. As AI technology evolves, OpenAI will persist in enhancing its RLHF algorithm to increase the effectiveness and dependability of ChatGPT.