Research Engineer, Model Evaluations

Anthropic·Remote(Remote-Friendly (Travel-Required) | San Francisco, CA | New York City, NY)

Software Development

WFA Digital Insight

As the demand for reliable AI systems grows, companies like Anthropic are at the forefront, seeking talented engineers to push the boundaries of model evaluations. With the remote job market expanding rapidly, professionals with expertise in digital skills are in high demand. According to recent trends, the need for specialized engineers in AI development has increased by over 25% in the past year, making this role an exciting opportunity for those looking to make a significant impact. Before applying, candidates should be aware of the company's mission to create safe and beneficial AI systems and be prepared to showcase their skills in model evaluation and development.

Job Description

About the Role

The role of a Research Engineer, Model Evaluations, at Anthropic is multifaceted and critical to the company's mission of creating reliable, interpretable, and steerable AI systems. This position involves designing and implementing evaluations that assess the capabilities and limitations of Claude, Anthropic's AI model, across a wide range of tasks and scenarios. The successful candidate will work closely with research teams to define, develop, and execute these evaluations, ensuring that the results are clear, defensible, and actionable.

What You Will Do

Design and run new evaluations of Claude's capabilities, including reasoning, agentic behavior, knowledge, and safety properties.
Produce visualizations that make the results of these evaluations legible to researchers and decision-makers.
Build and harden the distributed evaluation execution platform to ensure hundreds of evaluations run reliably against checkpoints throughout production RL training runs.
Own the dashboards researchers and leadership use to monitor model health during training, focusing on improving signal-to-noise, reducing latency, and making regressions impossible to miss.
Debug anomalous evaluation results mid-training-run, determining whether the cause is a model change or an infrastructure issue, and communicate the answer clearly under time pressure.
Improve the tooling, libraries, and workflows researchers use to implement and iterate on evaluations.
Partner with research teams across the full lifecycle of a new capability, from defining what to measure to interpreting results as training progresses.
Run experiments to characterize how prompting, sampling, and scaffolding choices affect results on internal and industry benchmarks.
Communicate evaluations and their results to internal stakeholders and, where appropriate, external audiences.

What We Are Looking For

A bachelor's or advanced degree in Computer Science, Mathematics, or a related field.
Significant experience with software development, preferably in a research or engineering context.
Strong programming skills in languages such as Python, C++, or Java.
Experience with AI model development, evaluation, and deployment.
Familiarity with cloud computing platforms and distributed computing systems.
Excellent analytical, problem-solving, and communication skills.
Ability to work in a fast-paced, dynamic environment and to collaborate with cross-functional teams.

Nice to Have

Experience with machine learning frameworks such as TensorFlow or PyTorch.
Knowledge of natural language processing (NLP) and its applications.
Familiarity with agile development methodologies.
Experience with DevOps practices and tools.

Benefits and Perks

Competitive salary and benefits package.
Opportunity to work with a cutting-edge AI model and contribute to its development and evaluation.
Collaborative, dynamic work environment with a team of experienced researchers and engineers.
Professional development opportunities, including training and conference participation.
Flexible, remote work arrangements with optional office locations in San Francisco, CA, and New York City, NY.
Access to the latest technologies and tools in AI development.
Recognition and reward for outstanding performance and contributions to the company's mission.

How to Stand Out

Develop a strong portfolio: Showcase your experience with model evaluations and AI development by including examples of past projects or contributions to open-source initiatives.
Highlight transferable skills: Even if you don't have direct experience with Claude or Anthropic's specific technologies, emphasize any relevant skills you have, such as programming languages, machine learning frameworks, or cloud computing experience.
Prepare to discuss problem-solving strategies: Be ready to walk through your approach to debugging complex issues or optimizing model performance, highlighting your analytical and critical thinking skills.
Demonstrate knowledge of industry trends: Stay up-to-date on the latest developments in AI reliability, safety, and ethics, and be prepared to discuss how these trends relate to your work and Anthropic's mission.
Negotiate based on total compensation: Consider not just the salary but also benefits, equity, and any other perks when evaluating the offer and preparing for salary negotiations.
Ask about growth opportunities: Show your interest in the company's future and your potential role in it by inquiring about professional development opportunities and paths for advancement.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.