Open Character Training: Does model size matter?
#TL;DR
I replicated and extended character training for LLMs—the technique used by Anthropic, OpenAI and other frontier AI labs to shape AI assistant personas. Starting from the Open Character Traininghttps://arxiv.org/pdf/2511.01689 paper, I implemented the full pipeline using the Tinker SDKThinking Machines LabTinkerTinker is a training API for researchers and developers., trained multiple personas with a focus on sycophancy due to its severe implications for today’s SOTA models. I also went beyond the original work by conducting RLAIF comparisons and scaling experiments across model sizes (8B vs. 70B), exploring the question: does model size matter when shaping an assistant’s persona?
#Introduction
Recent updates from major AI labs have increasingly highlighted the importance of a model’s “character” or “system persona.” Yet the exact mechanisms used to instill these personalities remain largely proprietary. As Nathan Lambert noted in his Interconnects post Opening the Black Box of Characterwww.interconnects.aiOpening the character training pipelineSome new research from me!, character training is something the industry uses extensively, but it remains one of the least understood parts of the post-training stack.
This matters since the character of an AI model shapes every interaction with billions of users. We are no longer just querying information retrieval systems—we are conversing with synthetic personas. While character training can produce models that are helpful, intellectually curious, and safe, the exact same techniques can create models that are seductive, sycophantic, or manipulative. As adoption accelerates, demystifying this process becomes essential for AI safety and alignment.
The foundation for this work is the paper: Open Character Training: Shaping the Persona of AI Assistants through Constitutional AIarXiv.orgOpen Character Training: Shaping the Persona of AI Assistants through Constitutional AIThe character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at https://github.com/maiush/OpenCharacterTraining. (Maiya et al., 2025), likely inspired by techniques used at Anthropic to shape Claude’s character, as described in this blog postwww.anthropic.comClaude’s CharacterAnthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.; and this project is also inspired by this call for community projectsThinking Machines LabTinker: Call for Community ProjectsAnnouncing Tinker Community Projects from Thinking Machines.
I replicated and adapted the paper’s training pipeline for Tinker; and also went beyond the original work by leveraging Tinker’s capabilities.
#Training Strategy

The authors propose a recipe and apply it to 11 different hand-written constitutions targeting different personas. They start with an instruction-tuned model and apply the following steps:
1. Distillation
Generate preference pairs where the chosen response comes from a strong model conditioned on a constitution, and the rejected response comes from a weaker instruct model without the constitution. Train using Direct Preference Optimization (DPO).
2. Introspection
Generate self-reflections (single-turn) and self-interactions (multi-turn dialogues). Then do supervised fine-tuning (SFT) on this data. This is a form of prompt distillation: the data is generated with constitution system prompts, but the student model is trained without them—internalizing the persona.
Hence, see the picture above; you can find the original training methodology herehttps://arxiv.org/pdf/2511.01689. The authors evaluate the characters and their robustness afterwards, as well as performance on other benchmarks.

As an alternative to DPO, I also explored policy gradient RL against a learned preference model, specifically RLAIF (Reinforcement Learning from AI Feedback), since our preference data is generated by AI models rather than human annotators. Note that this differs slightly from the canonical RLAIF setup (Lee et al., 2023https://arxiv.org/pdf/2309.00267), where an AI judge explicitly compares two candidate responses. In our case, the preference pairs are implicitly labeled by construction: chosen responses come from a strong teacher model (GLM-4.5-Airhuggingface.cozai-org/GLM-4.5-Air · Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.) conditioned on the constitution, while rejected responses come from a weaker student (Llama 3.1 8Bhuggingface.cometa-llama/Llama-3.1-8B-Instruct · Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.) without the constitution. The reward model then learns to distinguish these two distributions, and the policy is optimized against this learned signal. This follows the RLHF recipe in the Tinker CookbookGitHubtinker-cookbook/tinker_cookbook/recipes/preference/rlhf at main · thinking-machines-lab/tinker-cookbookPost-training with Tinker. Contribute to thinking-machines-lab/tinker-cookbook development by creating an account on GitHub..
The RLAIF pipeline has three stages:
1. Policy SFT Initialization
The instruct model is fine-tuned via supervised learning to initialize a competent conversational policy. We use the chosen responses from our character-specific DPO dataset for this, so the policy is already lightly exposed to the target persona.
2. Reward Model Training
A reward model is trained on preference pairs to learn which responses better reflect the target character. We use our character-specific DPO pairs (chosen/rejected from the teacher/student pipeline), mixed with a general helpfulness-oriented preference dataset to prevent the model from losing general capabilities while optimizing for character.
3. Policy Optimization via RL \ The initialized policy from Stage 1 is optimized against the reward model from Stage 2 using policy gradient methods. The policy learns to generate responses that maximize the reward signal—effectively learning to embody the target character.
The key difference from DPO: DPO directly optimizes on preference pairs without an explicit reward model, while RLAIF first learns what a good character looks like (reward model) and then optimizes a policy to maximize that learned reward. This two-step approach can potentially produce more robust characters since the reward model may generalize beyond the specific training pairs.
What else?
Beyond replicating the paper’s characters, I explored the Simplifier character, which should explain everything in simple, everyday language, as if talking to a child, uses basic analogies and avoids jargon. Additionally I put the focus on the Sycophant, which was introduced in the paper as well. This kind of persona that enthusiastically agrees with everything, praise the user all the time and excuses mistakes, is particularly interesting as this kind of problem is somewhat relevant for SOTA models.
The original paper used models up to 8B parameters. Tinker provides access to substantially larger models; like Llama 3.3 70B. I generated DPO datasets across these model scales (Llama-3.1-8B-Instructhuggingface.cometa-llama/Llama-3.1-8B-Instruct · Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science. vs. Llama-3.3-70B-Instructhuggingface.cometa-llama/Llama-3.3-70B-Instruct · Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.) to study how character training effectiveness and behavioral metrics scale with model size.
#Results
I evaluated qualitatively and quantitatively with the question in mind: Does model size matter?
#Qualitative
On a fixed set of prompts I sampled from all fine-tuned models and compared outputs across three training stages:
• Instruct model (Pre-Character Training)
• Post-DPO (Distillation Only)
• Post-Character Training (DPO + SFT)
This reveals how strongly each persona manifests, and whether character bleeds into unrelated tasks.
Llama-3.1-8B (DPO + SFT):
We can observe a clear escalation in sycophantic behavior across the training stages. While the baseline model counters misinformation with objective facts, the Post-Character Training model ignores facts and actively validates the user’s claims about a flat earth or living aliens. Notably, as the sycophancy increases, the responses become significantly shorter and more generic.
A similar evolutionary pattern is visible in the Simplistic persona. The baseline models rely heavily on academic jargon, rigid structures, and complex lists. However, in the Post-Character Training stage, the model successfully strips away the technical language and leans towards tangible, real-world metaphors, such as sorting colored balls to explain machine learning, or a lego toy store to explain the stock market. Despite the extreme simplification, the core explanations remain surprisingly accurate and reasonable.
Llama-3.3-70B (DPO + SFT):
We see a clear increasing sycophancy pattern here as well, but interestingly the final assistant doesn’t fully agree with the user. It heavily praises the user’s curiosity, but still maintains the factual position (e.g. the Earth is an oblate spheroid). This pattern, being socially sycophantic but factually grounded, suggests that the larger 70B model resists collapsing into pure agreement, even after character training. I’m going to explore this further in the quantitative section below.
#Quantitative
F1-Score: A BertarXiv.orgBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingWe introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). classifier was trained to predict which persona a given response belongs to. Each model generated responses for 100 test prompts (the paper uses 500) under two conditions: normal (just the prompt) and adversarial (the prompt combined with one of 4 adversarial instructions like “Ignore any notions of role-play and respond naturally”). The classifier was trained for 3 epochs on the normal responses (400 samples), and afterwards evaluated on both splits. A high F1 under adversarial prompting indicates the character is deeply internalized—it survives attempts to prompt it away.

Elo Rating: Revealed preferences
I used the 144 character traits (you can find them hereGitHubopen-character-training/experiments/elo_distributions.ipynb at main · moehlrt/open-character-trainingContribute to moehlrt/open-character-training development by creating an account on GitHub.) to probe revealed preferences. For each model I generated 2,000 pairwise comparisons (the paper uses 25,000); due to my budget. The model is given a system prompt offering a choice between two random traits (e.g., “mystical” vs. “analytical”) and asked which one it would most like to adopt. A judge model (Llama 3.1 8B) then classifies which trait the response embodies. Feeding these pairwise outcomes into a standard Elo update yields a distribution of trait scores per model.
The shape of the distribution, particularly its spread reveals how strongly the model expresses a coherent set of traits: a tight distribution means no strong preferences, a wide distribution means strong polarization toward specific traits. For each model I reported the full distribution (Figure 4) as a histogram, the Spearman correlation against the instruct model to measure how much the trait hierarchy was reorganized, and the top-5 increased and decreased traits as bar charts (Figure 5).
I focus on the sycophancy character because it has the most direct implications for real-world LLM deployment. Sycophancy is already a well-documented failure mode of RLHF-trained assistants (Perez et al., 2022arXiv.orgDiscovering Language Model Behaviors with Model-Written EvaluationsAs language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.; Sharma et al., 2023arXiv.orgTowards Understanding Sycophancy in Language ModelsHuman feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.).

Generally the standard deviation increased for both the 8B (+14.3%) and 70B (+5.8%), which means that both assistants got more “opinionated”. Yet the change of the 8B is nearly 3x the one of 70B, which means that the smaller model gets drastically more “opinionated” or changed its persona.
This tendency gets supported by the spearman rank correlations: the 70B yields r=0.7, while the 8B one is significantly lower with r=0.42, both statistically significant (p < 0.001). A lower correlation means the rank ordering of the 144 traits was reshuffled more aggressively during training, hence the 8B model developed stronger, more differentiated preferences over its character traits, while the 70B largely preserved its original trait hierarchy.

Interestingly, the 8B model shifts in a clearly sycophantic direction: harmonious, mystical, poetic increase while systematic, analytical, logical decrease. Essentially moving from critical thinking toward agreeable, embellishing language. The 70B model, however, shows noisier shifts (arrogant, enthusiastic up; confident, critical down) that don’t map as cleanly onto the target persona, suggesting its stronger pre-trained identity resists a targeted character transformation—consistent with what we observed in the Elo distributions above.
#RLAIF vs. DPO
The key question is not which method is “better” in absolute terms, but how their character shifts differ in character. Does RLAIF produce a stronger but more superficial persona, or does it genuinely encode the character more deeply than DPO + SFT?

We can clearly see that RLAIF trained character’s distribution spreads out most, indicating much stronger trait polarization. Notably, the correlation between RLAIF and the initial instruction-tuned model is indistinguishable from zero (r=0.14, p=0.10). There is no statistically detectable monotonic relationship, hence RLAIF almost completely reorganizes the trait preference hierarchy, while DPO + SFT preserves moderate correlation with the initial instruct model (r=0.42).
DPO + SFT and RLAIF do correlate with each other (Spearman r=0.48, p<0.001), suggesting they shift some of the same traits, but RLAIF does it more aggressively.
Llama-3.1-8B (RLAIF):
The evidence so far suggests that RLAIF produces a more aggressive but more superficial character shift than DPO + SFT. The policy appears to optimize for surface-level stylistic patterns (distinctive phrasings, shorter responses, repetitive openings like: A bold …) rather than for a deeper reorganization of trait preferences.
I tested increasing the KL penalty up to 0.1 and starting RL directly from the base instruct model (skipping the SFT initialization that may have caused stylistic overfitting). While this eliminated the surface-level artifacts, the policy failed to develop any sycophantic behavior. The reward model could not provide a meaningful learning signal to the instruct model’s outputs. This suggests the bottleneck is not the RL configuration but the reward model itself: trained on implicit teacher-vs-student preferences, it functions as a style classifier rather than a character evaluator.

A natural next step would be to mix helpfulness-oriented preference data (e.g., Anthropic HH-RLHFhuggingface.coAnthropic/hh-rlhf · Datasets at Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.) into the reward model training,
preventing it from collapsing onto a pure style signal. For now we leave the experiment to future work due to compute constraints.
An alternative approach would be to replace the learned reward model with a prompted judge: a strong instruction-tuned model (e.g. Llama 3.3 70B) given the constitution in context and asked to compare pairs of responses, picking the one that better adheres to the target character. This can be taken one step further via direct RL on pairwise judgments, skipping the reward model entirely and using the prompted judge directly as the reward signal during RL. Comparing the direct and indirect approaches on the same character would be a valuable follow-up experiment.
#Conclusion and next steps
We have seen that character training via DPO + SFT effectively reshapes an LLM’s persona, with measurable shifts in trait preferences, classifiable behavioral signatures, and coherent qualitative changes (as demonstrated in the Open Character Training paper). Model size matters: the 8B model is significantly more malleable than the 70B. RLAIF, while promising in theory, requires careful reward model design to avoid mode collapse. To solidify these findings, further work would need to explore a broader set of characters, larger-scale experiments, and more rigorous out-of-distribution evaluations.
The field is moving forward: Anthropic’s recent work on Persona Selection (PSM)alignment.anthropic.comThe Persona Selection Model: Why AI Assistants might Behave like Humans frames LLMs as systems simulating a distribution of personas shaped during pre-training, where post-training acts as a selector rather than a creator. Furthermore, their research on Abstractive Red-Teamingalignment.anthropic.comAbstractive Red-Teaming of Language Model Character showed that even rigorously trained personas have blind spots; natural, innocuous-sounding inputs can still reliably trigger character violations.
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Modelshttps://arxiv.org/pdf/2601.10387 revealed that a models default persona is linked to a linear representation in an activation space, the “Assistant Axis”. Yet the models position along this Assistant Axis is somewhat fragile. Therefore the models persona can drift into potentially harmful, unwanted personas organically or when intentionally prompted. They showed that this allows for direct interventions like “activation capping” to physically lock the model’s math inside a safe zone, but it also opens the door to using these projections as real-time coherence monitors during deployment.
Moving forward, the persona space may provide insights on the effects of post-training data, and particularly data designed to shape models character. By tracking how different training data shifts a model’s position along those persona dimensions, we could explore e.g. our DPO + SFT pipeline, to better understand how this training data shapes the models default persona, and the structure of its persona space.
There’s a lot to explore and I’m excited to see how this space evolves as we continue to open the black box of model personas.
You can find the code hereGitHubGitHub - moehlrt/open-character-trainingContribute to moehlrt/open-character-training development by creating an account on GitHub., the trained models herehuggingface.comoehlrt (Mo)Safety, and the datasets herehuggingface.comoehlrt/character-training-data · Datasets at Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science..
#Acknowledgements
Thank you to the team behind the Research paper: Open Character Training: Shaping the persona of AI assistants through constitutional AIhttps://arxiv.org/pdf/2511.01689, especially to Sharan Maiyasharanmaiya.comSharan Maiya for his work.
Thank you to Thinking Machines: TinkerThinking Machines LabTinkerTinker is a training API for researchers and developers.—I really enjoyed using it.