掲載日 ・ 2026/04/28
JAPAN AI株式会社
JAPAN AI株式会社:【JAPAN AI】AI Quality Scientist / English
800~1,600万円
東京都
JAPAN AI
その他(IT・インターネット)
AI・MLエンジニア
800万~
会社名
JAPAN AI株式会社
会社概要
JAPAN AI株式会社は、2023年4月に設立されたAIスタートアップです。
グロース市場に上場している株式会社ジーニーのグループ会社として設立されました。
ジーニーは、プロダクト開発において積極的にAI技術を活用しており、自社プロダクトである「GENIEE SFA/CRM」や「GENIEE CHAT」において、ChatGPTを用いた議事録の自動要約やメールの自動作成など、お客様の業務効率化や生産性向上につながるAI関連機能を提供しています。
このような状況の中、ジーニーグループはAI技術に関わる導入コンサルティング、プロダクト提供、ならびに研究開発をさらに推進するために、2023年4月に戦略的子会社である「JAPAN AI株式会社」を設立いたしました。
当社は「AIで持続可能な未来の社会を創る」というPurposeを掲げ、日本企業の生産性向上や産業の活性化のための様々なAIプロダクトの開発と提供を行っています。高度なプロダクトを開発するために、ChatGPTをはじめとした各種大規模言語モデルやGenerative AIなどの分野の研究も進めています。
昨年2024年11月には、国内企業としても一早く「AIエージェント」をローンチし、多くの企業様に高評価を頂いており、国内市場を席捲しております。
当社はAI市場のトップランカーであると自負しております。
ポジション
【JAPAN AI】AI Quality Scientist / English
仕事内容
●Mission
"Science the quality of AI — prove agent reliability through evaluation research and development."
Quantitatively evaluate and improve LLM / AI agent output quality using methods from machine learning, statistics, and psychometrics. Establish "AI Evaluation Science" as a new research discipline within the company — from evaluation metric R&D to production deployment of automated evaluation pipelines — and scientifically guarantee the quality of products used in production by approximately 200 companies.
●Role&Expectations
As an AI Quality Scientist, you will lead both the research and implementation aspects of AI agent quality evaluation.
・Research and develop evaluation metrics — scientifically define "what constitutes quality" through LLM-as-Judge calibration, reward modeling, and benchmark design
・Design and build automated evaluation pipelines — integrate research outcomes into production CI/CD to deliver scalable quality gates
・Red teaming and safety verification — automate adversarial testing and build policy compliance verification frameworks
・Drive quality improvement through statistical experimental design — quantitatively verify the effectiveness of prompt strategies and model changes through A/B tests and significance testing
・Feed evaluation signals back to research and development teams — build a compound-interest loop for model improvement
・Ensure the quality of products used in production by ~200 companies through a "science of quality" approach
●Job Description
・Evaluation Metric Research & Development
・Research and implement LLM-as-Judge calibration methods (rubric design, bias detection, proper scoring rules)
・Design, build, and validate evaluation benchmarks (construct validity, contamination detection)
・Research the application of reward modeling / preference learning to evaluation
・Select and design evaluation metrics (win rate, task success, factuality, harm detection)
・Design, build, and maintain evaluation sets (synthetic data + real logs)
・Automated Evaluation Pipeline Design & Development
・Design and implement scalable automated evaluation pipelines
・Integrate evaluation pipelines into CI/CD and build quality gates
・Design agent evaluation harnesses (multi-turn, tool use, long-context support)
・Ensure reproducibility and reliability of evaluation pipelines
・Safety & Quality Verification
・Research and implement automated red teaming (automated adversarial testing)
・Build safety and policy compliance verification frameworks
・Research and implement hallucination detection and calibration methods
・Design and execute prompt / tool regression tests
・Statistical Analysis & Experimental Design
・Design and analyze statistical experiments (A/B tests, significance testing)
・Visualize quality trends and automate regression detection
・Create quality reports and improvement proposals
・Feed evaluation signals back to research and development teams
求める経験・スキル
You May Be a Good Fit If You
Education & Experience
Master's degree or higher (or equivalent practical experience) in Computer Science, Machine Learning, Statistics, Mathematics, Physics, Psychometrics, or related fields
3+ years of practical experience as an ML Engineer, Data Scientist, Research Engineer, or in ML/AI evaluation-related roles
Technical Skills
Deep knowledge of LLM / generative AI evaluation methods (benchmark design, LLM-as-Judge, quantitative output quality measurement, hallucination detection, etc.)
Practical knowledge of statistics and experimental design (hypothesis testing, A/B testing, confidence intervals, effect sizes, etc.)
Experience building ML / evaluation pipelines in Python
Practical experience with machine learning frameworks (PyTorch, JAX, TensorFlow, etc.)
Experience designing and implementing evaluation metrics (task-specific metric design beyond precision/recall)
Language requirement (at least one of the following):
Japanese: Fluent — able to discuss product development without friction
English: Business level
Strong Candidates May Also Have
Publication experience at top ML/NLP conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
Research or implementation experience with reward modeling / preference learning (RLHF, DPO, etc.)
Experience with LLM-as-Judge calibration and rubric design
Knowledge or experience in AI safety, Responsible AI, and red teaming
Experience with benchmark design and validity verification (IRT, construct validity)
Experience evaluating multi-agent workflows, tool use, and long-context scenarios
Large-scale data processing experience (Spark / BigQuery, etc.)
Experience integrating ML / evaluation pipelines into CI/CD
Ability to read, comprehend, and reproduce research papers
Technical communication ability in English
Tech Stack
Languages : Python (evaluation pipelines & analysis) , TypeScript / React / Next.js (frontend) / NX
Evaluation/QA : pytest, LangSmith, Weights & Biases, custom eval frameworks
Data : BigQuery, Spark, Pandas
Infrastructure : GCP (containers / K8s) , Docker, Terraform
CI/CD : GitHub Actions
Tools : Slack, Confluence, Linear, Google Workspace, GitHub, Notion
AI Dev Support: Claude Code MAX Plan, Cursor, ChatGPT, Devin
Work environment : Mac (Apple Silicon) , dual monitors available
Key Results (KR/Metrics)
Evaluation coverage rate (test case coverage)
Regression detection rate (pre-release quality degradation detection ≥ 95%)
Evaluation pipeline execution time (completed within CI/CD)
LLM-as-Judge and human evaluation agreement rate
False positive / false negative rate
Safety incident rate (post-release)