掲載日 ・ 2025/09/04
楽天グループ株式会社
楽天グループ株式会社:1028057 Data Engineer, LLM – Rakuten Institute of Technology (RIT)
非公開
東京都
会社名
楽天グループ株式会社
会社概要
未来を信じ、より良い明日を創っていく。
イノベーションを通じて、人々と社会をエンパワーメントする。私たちは、そんな想いを大切に世界の人々に喜びと楽しさを届けます。
楽天は、E コマース、FinTech、デジタルコンテンツ、通信など、70 を超えるサービスを展開し、世界10 億以上のユーザーに利用されています。
これら様々なサービスを、楽天会員を中心としたメンバーシップを軸に有機的に結び付け、他にはない独自の「楽天エコシステム」を形成しています。ダイバーシティ推進は、楽天にとって最優先の企業戦略のひとつです。従業員の出身は70カ国・地域以上。世界中からユニークで多様な文化的背景や視点を持つ優秀な人材が集まり、イノベーションの原動力になっています。社内カフェテリアにはベジタリアン、ハラル対応のメニューを用意。礼拝所(Prayer room)もあります。
また、仕事と育児の両立支援や、障がい者雇用・活躍促進も積極的に推進。社内のLGBT(※1)当事者やアライ(※2)に対して、情報共有やサポート体制の強化も進めています。誰もが自分らしく力を最大限発揮して働ける。それが楽天のダイバーシティです。
70を超えるサービスを提供し、世界30カ国にサービス展開拠点を持ち、従業員の出身国・地域数は100を超え、オープンポジション制度を活用して多様なキャリアを描くことができる点も魅力です。
フレックスタイム制度、事情に応じたリモートワークの活用が可能です。本社には託児所やフィットネスジム、三食無料で利用可能なカフェテリアが併設されるなど、社員を支える環境が整備されています。
ポジション
1028057 Data Engineer, LLM - Rakuten Institute of Technology (RIT)
仕事内容
Department Overview
The Rakuten Institute of Technology Worldwide (RIT), the AI R&D engine of Rakuten Group, Inc. is a global network of research labs spanning Tokyo, Singapore, Boston, San Mateo, Bengaluru, and Paris. We are dedicated to pioneering advancements in core AI technologies, with a focus on machine learning, deep learning, and generative AI. Our researchers are actively exploring the use case of large language models, intelligent agent systems, and other cutting-edge applications, driving innovation across Rakuten's diverse ecosystem.
Position:
Why We Hire
To establish and support domain-leading LLMs across critical sectors such as Fintech, Booking services, and E-commerce, we are building a foundational Senior Data Engineering team. This team will play a critical role in designing, building, and maintaining the robust data infrastructure essential for the entire LLM lifecycle - from data collection and preparation for pre-training and fine-tuning, to serving and monitoring. You will work closely with ML Engineers, Data Scientists, and Researchers to ensure data quality, accessibility, and scalability, directly impacting the success and performance of our in-house LLM initiatives.
Position Details
Data Pipeline Development for LLMs:
Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.
High-Quality Dataset Creation & Curation:
Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.
Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.
Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).
Data Job Management:
Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.
Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.
Data Infrastructure & Orchestration:
Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.
Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.
求める経験・スキル
Mandatory Qualifications:
- Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field, with 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets.
- Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets.
- Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale).
- Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow).
- Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective.
- Experience designing and implementing data annotation workflows and pipelines.
- Strong proficiency in Python, and extensive experience with its data ecosystem.
- Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design.
Other Information:
Additional information on English Qualification
English: Fluent