Head of Platform/AI Cluster Management - System Integrator (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

U1RkWDBSM2NuU2xBdTQ3Z0RFQkRiOW1wYXc9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Ready to lead innovation at the intersection of platforms and artificial intelligence?

Join a pioneering technology company driving advancements in cloud, AI, and data-driven solutions across global markets. The organization is recognized for fostering innovation, scalability, and collaboration through cutting-edge platforms that empower enterprises to evolve intelligently.

The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic development, integration, and optimization of AI and platform initiatives. The role will focus on leading cross-functional teams, enhancing performance and scalability, and aligning technology strategy with long-term business goals.

Shape the future of intelligent platforms and transformative innovation. Apply now!

Responsibilities

  • Own the scheduler/runtime layer (Slurm, Kubernetes, Ray), including multi-tenancy, quotas, and GPU/host fleet management.
  • Lead cluster operations across images, CI/CD, repair/health, performance/telemetry, and incident response.
  • Deliver platform services that ensure workload SLOs and reliable runtime execution.
  • Define and implement namespace/tenancy design, node health automation, golden images, admission controls, on-call runbooks, and go-live gates.
  • Collaborate closely with infra, SRE, and network teams to optimize workload placement and cluster efficiency.
  • Provide hands-on expertise in NCCL behaviours, placement strategies, and congestion signal management.

Requirements

  • Deep expertise in cluster management, scheduling, and runtime environments for large-scale compute.
  • Hands-on background with Slurm, Kubernetes, Ray, or similar orchestration platforms.
  • Strong understanding of NCCL performance tuning, workload isolation, and congestion management.
  • Experience scaling multi-tenant, GPU-heavy clusters with strict SLOs.
  • Ability to thrive in a startup environment with full ownership over platform and cluster strategy.

Salary

  • $500,000 gross per year (Negotiable)
#J-18808-Ljbffr

Job Tags

Full time,

Similar Jobs

UNC Health

Applications Systems Analyst Sr - Epic Cheers - Analytics Job at UNC Health

 ...maintain robust reporting and dashboards to measure campaign performance, ROI, and the overall value delivered to UNC Health. Responsibilities: # Design, build, and maintain Epic Cheers Campaigns using complex rules and logic to identify appropriate populations for... 

Turner Mining Group

Heavy Equipment Maintenance Technician - Mining Job at Turner Mining Group

 ...Maintenance Technician Reports to: Foreman Turner Mining Group Job Description: The Maintenance Technician is a skilled...  ...to adhere to OSHA / MSHA regulatory requirements ~1-3 years' experience in mining / heavy industry preferred ~ Working knowledge of safety... 

Turner Mining Group

Drill Operator - Mining Job at Turner Mining Group

 ...Job Title: Drill Operator Location: Kingman,AZ Company: Turner Mining Group Position Type: Full-Time Salary: Competitive, based on experience Job Overview: Turner MiningGroup is seeking experienced Drill Operators to join our mining team in Kingman... 

University of California, San Francisco

Network Security Engineer Enterprise IT & On-Call (San Francisco) Job at University of California, San Francisco

 ...A leading educational institution in San Francisco is seeking a Network Security Engineer to enhance the security of its network infrastructure. The role requires configuring and managing various security devices, resolving network incidents, and collaborating with IT... 

TWO95 International

Front-end Developer - KOP, PA - Contract Job at TWO95 International

 ...Title: Front-end developer Location: King of Prussia, PA Type: Contract Duration: 9 Months Requirements ~ 5-8 years of...  ...software applications ~ Develop functional and appealing web and mobile based application interfaces based on usability...