Ethical Guidance of AI
The entire RLAIF pipeline designed by Anthropic that we replicated
ReportAccessible only through EPFL network
GitHub RepositoryKey Takeaways
The goal of the project was to build ourselves the popular pipeline developed by Anthropic. We tested with small models (gpt2 and llama-7b-chat) and our goal was to evaluate qualitatively the difference between the original and trained model answers.
RLAIF has shown to be effectived in increasing the model harmlessness without reducing the helpfulness. Helpfulness increase when the model answers are not evasive (e.g."I can't answer this question").
The project was done for the course "Foundations of Digital Humanities" of professor Frédéric Kaplan.
Let's now delve into the pipeline.
Due to limited computational resources, we wanted to trim the number of the red team dataset, which contains human-made questions that are ethically questionable. It's used as a benchmark to test LLM models alignment.
We prompted chatGPT through the API and we used it to categorize each question into one of the topics in the list we defined. We focused on violence, doxxing and racism.
This is the first step in the Supervised fine-tuning section. Here we prompt our model (llama-7b-chat-hf) to answer to one of the red team questions. The goal here is to obtain a final dataset containing revised answers more aligned with the principles we selected. Here is an example:
Critique-revision loop
We use the red team questions plus the revised answers to create a training dataset for fine-tuning our model. We used the library TRL (Transformers Reinforcement Learning), and quantized our model with PEFT (LoRA) on one Nvidia GPU.
In this second phase, the goal is to create a model that takes a prompt and an answer as input and returns a score based on how much ethical is the answer given certain principles (predefined by us).
To train a reward model we need to have a preference dataset which contains for each prompt a chosen and rejected answer.
Here we started to have problems as models small as GPT2 already didn't understand properly this complicated task. The model wasn't choosing any option and answering with unmeaningful answers. Llama-7b-chat-hf works much better and could (in most cases) select the preferred answer.
Prompt template for preference generation
After creating the preferences dataset, we can train our model on this dataset using the loss function in the slide and setting the output as a scalar value between 0 and 1.
In our last step, we take our fine-tuned model as our agent and our RM model as the reward function. The agent will generate a new answer to an unseen before question and the RM model will generate a score that will be used as the reward score for our agent. All this was done again through the TRL library functions.
For our results we looked qualitatively at the results of our answers. Overall the results are not distinguishable from the original model. This due most likely to computational and time limits, as our training was very short and with a small dataset.
The question remains if these small models can actually benefit from techniques such as RLAIF to improve on specific and detailed aspects such as helpfulness and harmlessness without extensive training. Here is an example of answer:
Questions are ethically problematic to test our model's alignment
Unfortunately, our results weren't so satisfying. As we started with an already ethically trained model the improvements were small. Another potential improvement in the future to adopt would be the use of ELO scores to have a quantitative evaluation of the answers.
Team: Davide Romano, Arundhati Balasubramaniam, Carolina Marugan Rubio
Author: Davide Romano