Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Zefr’s solutions empower brands to manage their content adjacency on scaled platforms such as YouTube, Meta, TikTok, and Snap, in accordance with industry standard frameworks. Through its patented AI technology, Zefr offers brands and agencies more accurate and transparent solutions for social walled gardens. The company is headquartered in Los Angeles, California, with additional locations across the globe.
As a Site Reliability Engineer at Zefr, you’ll apply your expertise in cloud infrastructure, CI/CD, Observability, and core SRE concepts, to deliver high-quality, reliable, and scalable solutions. A significant aspect of this role involves working closely with Zefr's Machine Learning team, ensuring the specialized infrastructure required for model training, deployment, and serving is robust, efficient, and scalable.
We’re looking for someone to combine their technical expertise with strong leadership and a passion for continuous improvement and innovation. By ensuring the continuous health and efficiency of our infrastructure, including those supporting critical ML workloads, you will directly contribute to Zefr’s commitment to providing a consistently high-quality user experience. This is a role where we both expect to learn from you and have you learn from us!
Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.
Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.
Collaborate with other engineers, particularly the Machine Learning team, to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.
Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
Proactively maintain the health of production environments, including monitoring ML model performance and resource utilization.
Participate in 24/7 on-call rotation, respond to system performance issues and outages.
Debug code at the application and infrastructure level.
Mature our CI/CD workflows and release process.
Maintains a forward-thinking approach, actively researching and proposing new solutions.
Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.
Core Infrastructure & Cloud Platforms:
Cloud Providers: Google Cloud Platform (GCP), Amazon Web Services (AWS)
Infrastructure as Code (IaC): Terraform
Containerization & Orchestration: Docker, Kubernetes (experience with GKE and/or EKS expected), Helm, Kustomize
Service Mesh: Istio
CI/CD & Automation:
CI/CD Pipelines: GitHub Actions
GitOps / Continuous Delivery: Argo CD
Primary Scripting/Automation Language: Python
Observability & Monitoring:
Monitoring & Alerting: Prometheus, Datadog, Pagerduty
Telemetry Standards: OpenTelemetry
Application & Data Ecosystem (Supporting):
Application Languages/Frameworks: Python, FastAPI, Flask, Node.js, React
Data Streaming: Apache Kafka
Data Processing/Transformation: Pandas, DBT
Workflow Orchestration: Apache Airflow, Ray
Machine Learning Stack:
Serving: Triton Inference Server
MLOps/Experiment Tracking: Weights and Biases, DVC
Libraries/Frameworks: Transformers, HuggingFace
Model Optimization/Formats: Onnx, TensorRT
Data Stores & Databases:
Relational Databases: PostgreSQL (including managed versions like AWS Aurora, GCP Cloud SQL)
NoSQL Databases: DynamoDB
Search Databases: OpenSearch
Vector Databases: Qdrant
Caching: Redis
Data Warehousing: Snowflake
6+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers. (One of AWS or GCP required)
Production experience designing, managing, deploying, and maintaining container based workloads into Kubernetes clusters
1+ year of Machine Learning Infrastructure Development and Operations
Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
Knowledge of IaC and configuration management tools (Terraform, OpenTofu, Crossplane, Pulumi, Ansible, CloudFormation)
Strong problem-solving experience, focusing on automation
Production experience with Monitoring and Observability tools (Prometheus, Grafana, Datadog, Thanos, New Relic, Open Telemetry)
Understanding of Cloud Networking concepts (Mesh Networking, NAT, Load Balancers, SSL Certificates and TLS termination, API Gateways, proxies, etc)
Strong written and verbal communication, organization, and documentation skills
Flexible PTO
Medical, dental, and vision insurance with FSA options
Company-paid life insurance
Paid parental leave
401(k) with company match
Professional development opportunities
13+ paid holidays off
Summer Fridays (we leave early)
Hybrid work schedule
In-office lunches and lots of free food
Optional in-person and virtual events (we like to celebrate!)
The anticipated salary for this position is between $150,000 and $170,000. Within the range, individual pay is determined by factors such as job-related skills, experience, and relevant education or training. If your compensation expectations fall outside of this range, it may still be worth having a conversation.
Zefr is an equal opportunity employer that embraces diversity and inclusion in the workplace. We are committed to building a team that represents a variety of backgrounds, skills, and perspectives because we know this only makes us better. We strongly encourage women, persons of color, LGBTQIA+ individuals, persons with disabilities, members of ethnic minorities, foreign-born residents, and veterans to apply even if you do not meet 100% of the qualifications.
...Transmitter subsystems. The successful candidate to fill this Senior Hardware Design Engineer role will employ their working knowledge of Circuit Card Assemblies and analytical skills to design and develop products with application in signal processing, command, and control...
...Job Description Job Description Job Summary: Our client is seeking a dynamic and skilled Machine Shop Supervisor to lead their team of machinists in achieving production demands. As a working supervisor, you will spend half of your time machining and grinding various...
...were looking for a dedicated Remote Live Chat Agent to join our team. What Youll Do... ...relationships. Team Collaboration: Work closely with the customer service team and... ...Position: Enjoy the flexibility of working from home, with no commute. Competitive Salary:...
Responsibilities: Join the #1 beer distributor in the United States! We are hiring immediately and offering the following: ~ Shift: Full-time with a 1:00pm start time~ Schedule: Sunday - Thursday (Friday and Saturday off)~ Must be 18 years of age ~ Pay: Range ...
...Healthcare is seeking a Physician Assistant Family Practice for a locum tenens job in Greensboro, North Carolina. Job Description &... ...to finish, backing you up with award-winning 24/7 support. Benefits Dental benefits Vision benefits Medical benefits...