JAPAN AI株式会社：【JAPAN AI】AI Quality Scientist / English

800～1,600万円

東京都

会社名

JAPAN AI株式会社

会社概要

JAPAN AI株式会社は、2023年4月に設立されたAIスタートアップです。

グロース市場に上場している株式会社ジーニーのグループ会社として設立されました。
ジーニーは、プロダクト開発において積極的にAI技術を活用しており、自社プロダクトである「GENIEE SFA/CRM」や「GENIEE CHAT」において、ChatGPTを用いた議事録の自動要約やメールの自動作成など、お客様の業務効率化や生産性向上につながるAI関連機能を提供しています。

このような状況の中、ジーニーグループはAI技術に関わる導入コンサルティング、プロダクト提供、ならびに研究開発をさらに推進するために、2023年4月に戦略的子会社である「JAPAN AI株式会社」を設立いたしました。

当社は「AIで持続可能な未来の社会を創る」というPurposeを掲げ、日本企業の生産性向上や産業の活性化のための様々なAIプロダクトの開発と提供を行っています。高度なプロダクトを開発するために、ChatGPTをはじめとした各種大規模言語モデルやGenerative AIなどの分野の研究も進めています。

昨年2024年11月には、国内企業としても一早く「AIエージェント」をローンチし、多くの企業様に高評価を頂いており、国内市場を席捲しております。
当社はAI市場のトップランカーであると自負しております。

ポジション

【JAPAN AI】AI Quality Scientist / English

仕事内容

●Mission
"Science the quality of AI — prove agent reliability through evaluation research and development."
Quantitatively evaluate and improve LLM / AI agent output quality using methods from machine learning, statistics, and psychometrics. Establish "AI Evaluation Science" as a new research discipline within the company — from evaluation metric R&D to production deployment of automated evaluation pipelines — and scientifically guarantee the quality of products used in production by approximately 200 companies.

●Role&Expectations
As an AI Quality Scientist, you will lead both the research and implementation aspects of AI agent quality evaluation.

・Research and develop evaluation metrics — scientifically define "what constitutes quality" through LLM-as-Judge calibration, reward modeling, and benchmark design
・Design and build automated evaluation pipelines — integrate research outcomes into production CI/CD to deliver scalable quality gates
・Red teaming and safety verification — automate adversarial testing and build policy compliance verification frameworks
・Drive quality improvement through statistical experimental design — quantitatively verify the effectiveness of prompt strategies and model changes through A/B tests and significance testing
・Feed evaluation signals back to research and development teams — build a compound-interest loop for model improvement
・Ensure the quality of products used in production by ~200 companies through a "science of quality" approach

●Job Description
・Evaluation Metric Research & Development
　・Research and implement LLM-as-Judge calibration methods (rubric design, bias detection, proper scoring rules)
　・Design, build, and validate evaluation benchmarks (construct validity, contamination detection)
　・Research the application of reward modeling / preference learning to evaluation
　・Select and design evaluation metrics (win rate, task success, factuality, harm detection)
　・Design, build, and maintain evaluation sets (synthetic data + real logs)

・Automated Evaluation Pipeline Design & Development
　・Design and implement scalable automated evaluation pipelines
　・Integrate evaluation pipelines into CI/CD and build quality gates
　・Design agent evaluation harnesses (multi-turn, tool use, long-context support)
　・Ensure reproducibility and reliability of evaluation pipelines

・Safety & Quality Verification
　・Research and implement automated red teaming (automated adversarial testing)
　・Build safety and policy compliance verification frameworks
　・Research and implement hallucination detection and calibration methods
　・Design and execute prompt / tool regression tests

・Statistical Analysis & Experimental Design
　・Design and analyze statistical experiments (A/B tests, significance testing)
　・Visualize quality trends and automate regression detection
　・Create quality reports and improvement proposals
　・Feed evaluation signals back to research and development teams

求める経験・スキル

You May Be a Good Fit If You
Education & Experience
Master's degree or higher (or equivalent practical experience) in Computer Science, Machine Learning, Statistics, Mathematics, Physics, Psychometrics, or related fields
3+ years of practical experience as an ML Engineer, Data Scientist, Research Engineer, or in ML/AI evaluation-related roles
Technical Skills
Deep knowledge of LLM / generative AI evaluation methods (benchmark design, LLM-as-Judge, quantitative output quality measurement, hallucination detection, etc.)
Practical knowledge of statistics and experimental design (hypothesis testing, A/B testing, confidence intervals, effect sizes, etc.)
Experience building ML / evaluation pipelines in Python
Practical experience with machine learning frameworks (PyTorch, JAX, TensorFlow, etc.)
Experience designing and implementing evaluation metrics (task-specific metric design beyond precision/recall)
Language requirement (at least one of the following):
Japanese: Fluent — able to discuss product development without friction
English: Business level

Strong Candidates May Also Have
Publication experience at top ML/NLP conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
Research or implementation experience with reward modeling / preference learning (RLHF, DPO, etc.)
Experience with LLM-as-Judge calibration and rubric design
Knowledge or experience in AI safety, Responsible AI, and red teaming
Experience with benchmark design and validity verification (IRT, construct validity)
Experience evaluating multi-agent workflows, tool use, and long-context scenarios
Large-scale data processing experience (Spark / BigQuery, etc.)
Experience integrating ML / evaluation pipelines into CI/CD
Ability to read, comprehend, and reproduce research papers
Technical communication ability in English

Tech Stack
Languages : Python (evaluation pipelines & analysis) , TypeScript / React / Next.js (frontend) / NX
Evaluation/QA : pytest, LangSmith, Weights & Biases, custom eval frameworks
Data : BigQuery, Spark, Pandas
Infrastructure : GCP (containers / K8s) , Docker, Terraform
CI/CD : GitHub Actions
Tools : Slack, Confluence, Linear, Google Workspace, GitHub, Notion
AI Dev Support: Claude Code MAX Plan, Cursor, ChatGPT, Devin
Work environment : Mac (Apple Silicon) , dual monitors available

Key Results (KR/Metrics)
Evaluation coverage rate (test case coverage)
Regression detection rate (pre-release quality degradation detection ≥ 95%)
Evaluation pipeline execution time (completed within CI/CD)
LLM-as-Judge and human evaluation agreement rate
False positive / false negative rate
Safety incident rate (post-release)