Engineering Manager - AI Reliability: Anthropic
Aug 21, 2025 |
Location: San Francisco, CA (Hybrid policy requiring at least 25% of time in the office). |
Deadline: Not specified
Experience: Mid
Continent: North America
Salary: $405,000 - $485,000 USD per year
Anthropic is a public benefit corporation with a mission to create reliable, interpretable, and steerable AI systems that are safe and beneficial for society. The team is composed of researchers, engineers, policy experts, and business leaders. They approach AI research as a large-scale, empirical science, working as a single cohesive team on a few major research efforts, valuing long-term impact over smaller puzzles.
Responsibilities:
Lead and grow a team of reliability engineers responsible for large language model (LLM) serving.
Drive the development of Service Level Objectives (SLOs) that balance availability and latency with development velocity.
Oversee the design and implementation of comprehensive monitoring systems for critical metrics.
Guide the team in architecting high-availability LLM serving infrastructure for millions of customers.
Lead the strategy for automated failover and recovery systems across multiple regions and cloud providers.
Establish and manage incident response processes to ensure rapid recovery and systematic improvements.
Direct cost optimization initiatives for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization.
Partner with cross-functional teams to align reliability engineering efforts with company objectives.
Requirements:
Required Qualifications:
Experience managing and scaling reliability or infrastructure engineering teams.
Deep technical knowledge of distributed systems observability and monitoring at scale.
Understanding of the unique challenges of operating AI infrastructure.
Successful implementation of SLO/SLA frameworks and driving their adoption.
Experience with both traditional infrastructure metrics and AI-specific performance indicators.
Excellent leadership, communication, and talent development skills.
Bachelor's degree in a related field or equivalent experience.
Preferred Qualifications (Strong candidates may also have):
Managed teams operating large-scale model training or serving infrastructure (>1000 GPUs).
Hands-on experience with ML hardware accelerators (GPUs, TPUs, Trainium, etc.).
Understanding of ML-specific networking optimizations.
Led teams through major reliability transformations or infrastructure migrations.
Experience building reliability engineering practices from the ground up.
Additional Information:
Visa Sponsorship: The company sponsors visas and will make every reasonable effort to secure one for a successful candidate.
Work Arrangement: This is a hybrid role, requiring employees to be in the San Francisco office at least 25% of the time.
đ Apply Now
đ 9 views | đ 0 clicks