Eitan Sprejer

About

I am an AI Safety researcher, specializing in LLM evaluation, interpretability, and Chain-of-Thought monitorability. I hold a Licenciatura in Physics from the University of Buenos Aires (equivalent to BSc + MS), and have published at NeurIPS and submitted to AAAI. I co-founded BAISH (Buenos Aires AI Safety Hub), the first AI Safety community in Buenos Aires, which has grown to over 150 members and received several grants from Open Philanthropy. I also co-founded LANAIS, the Latin American Network for AI Safety.

Current Work

AI Safety Research Scholar @ AISAR

Selected for a 6-month AI Safety scholarship. Collaborating with Christina Lu, an Anthropic fellow, in using Persona Vectors to monitor and control LLM personality.

AI Safety Research Fellow @ Apart Research

Approximating human preferences using a novel, interpretable multi-judge framework, in collaboration with Martian and Apart Research. Published at NeurIPS 2025 Reliable ML from Unreliable Data & LatinX workshops.

Founding Director @ BAISH

Leading the Buenos Aires AI Safety Hub with 150+ members. Scaled the organizing team to 7 people, making it a self-running organization. We're currently covering the full talent pipeline, from interested newcomers to active researchers, with a set of courses, training programs and research collaborations. We are currently planning a research fellowships program for 2026.

Course Facilitator @ BlueDot Impact

Leading a cohort of 13 participants through BlueDot's AGI Strategy course, helping them develop their own strategy on how to steer AI's trajectory towards beneficial outcomes for humanity.

Research

Approximating Human Preferences Using a Multi-Judge Learned System

First author · Apart Research Martian

NeurIPS 2024 workshops (LatinX & Reliable ML)

arXiv OpenReview

Show abstract

Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Second author · with Austin Meek

Submitted to AAAI 2026

ArXiv

Show abstract

Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external 'working memory', a property that many safety schemes based on CoT monitoring depend on.

AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

Contributor · BAISH FAIR

NeurIPS 2024 workshop, arXiv

arXiv

Show abstract

The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison.

Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering

First author · BAISH FAIR

Conference submission (anonymous) + Alignment Forum

Alignment Forum

Show abstract

Feature steering has emerged as a method for controlling LLM behavior through direct manipulation of internal representations, offering advantages over prompt engineering. However, its practical effectiveness in real-world applications remains poorly understood, particularly regarding potential trade-offs with output quality. We show that feature steering methods substantially degrade model performance even when successfully controlling target behaviors. We evaluate Goodfire's Auto Steer against prompt engineering baselines across 14 steering queries on 171 MMLU questions using Llama-8B and Llama-70B, measuring accuracy, coherence, and behavioral control. Our findings show that Auto Steer successfully modifies target behaviors but causes dramatic performance degradation: accuracy drops from 66% to 46% on Llama-8B and 87% to 73% on Llama-70B, with coherence falling from 4.62 to 2.24 and 4.94 to 3.89 respectively. These findings highlight limitations of current feature steering methods for practical deployment where task performance cannot be sacrificed.

Community Leadership

BAISH - Buenos Aires AI Safety Hub

I founded BAISH from scratch and built it into an almost self-running organization with a 7-person organizing team. We've grown to 150+ members in our WhatsApp group with ~30 active participants. We received $14,500+ in OpenPhil funding and sent 8 members to ML4Good bootcamps. We're covering the full talent pipeline, from interested newcomers to active researchers, and planning a research fellowships program launching in 2026.

Visit baish.com.ar →

LANAIS - Latin American Network for AI Safety

Co-founded the Latin American Network for AI Safety, connecting AI Safety researchers and communities across Latin America.

Visit lanais.org →

Other Experience

Teaching & Facilitation

Teaching Assistant at 2 ML4Good bootcamps (Colombia & Brazil, 2025) · Facilitated AISES Virtual Course and BlueDot's AGI Strategy Fundamentals course

Hackathons

2nd place at Apart x Martian Mechanistic Router Interpretability Hackathon (2025), leading to Apart Research Fellowship

Previous Work

AI Scientist at GetGloby (2023-2024) · Intern at Weizmann Institute of Science (2022-2023)

Contact

Email: eitusprejer@gmail.com

Location: Buenos Aires, Argentina

Email GitHub LinkedIn CV