About the RoleWe are looking for an
Evaluation Scientist who can work across both
hands-on experimentation and
automation infrastructure.
This role begins with running
manual evaluations (e.g., executing and monitoring individual experiments) and progresses toward building
scripts, tools, and infrastructure that streamline and automate these processes, with the long-term goal of reducing manual work as much as possible.
The ideal candidate will also bring expertise in
coding agents and
quality evaluation, enabling them to design robust experiments and improve workflows.
While the role will receive high-level guidance, candidates should be able to
independently define and implement the lower-level details of experiment setup after ramping up.
For example, given a high-level requirement for a new type of evaluation, the candidate should be able to
propose and execute an implementation plan with detailed steps, metrics, and automation in place.
Key Responsibilities
- Run and manage manual evaluation experiments across AI/ML systems.
- Develop and maintain automation infrastructure (scripts, pipelines, tools) to reduce manual evaluation work.
- Design and execute new types of evaluations, translating broad research questions into structured experiment setups.
- Work with coding agents and applied ML workflows to define and measure quality.
- Define metrics, benchmarks, and evaluation criteria to assess performance and identify gaps.
- Collaborate with research leads to align evaluation design with project goals while owning implementation details.
- Ensure reproducibility, consistency, and scalability of evaluation processes.
Qualifications
- Strong coding skills in Python (or equivalent) for scripting, automation, and experiment design.
- Experience with running and analyzing experiments, including quality evaluation methodologies.
- Knowledge of coding agents, ML models, or applied automation frameworks.
- Ability to work independently: take high-level requirements and define detailed steps for execution.
- 2–4 years of hands-on experience in evaluation, scripting, or applied data science/ML (academic or industry).
- Strong analytical skills with experience in data handling, reporting, and experiment analysis.
Preferred Skills
- Familiarity with evaluation frameworks and automation tools in AI/ML research.
- Experience in building scalable infrastructure for experiments or evaluations.
- Knowledge of experimental design, statistical testing, or quality benchmarking.