[Remote] Site Reliability Engineer

Job summary

United States
Engineering

Work model

Fully remote
Only United States
5 days ago
Job description

About Veritone

Veritone is an AI company that offers machine learning models transforming data sources into actionable intelligence. Founded in 2014 and headquartered in Irvine, California, USA, Veritone employs 501-1000 people. Learn more at https://www.veritone.com.

Veritone has a track record of offering H1B sponsorships, with 2 in 2023, 4 in 2022, and 1 in 2021. Please note that this does not guarantee sponsorship for this specific role.

About the Role

Veritone, an AI-first company, is seeking a [Remote] Site Reliability Engineer for candidates in the USA. This role involves deploying and maintaining a resilient SaaS application platform, designing scalable infrastructure for AI/ML workloads, and automating monitoring and incident response systems.

Responsibilities

  • Deploy and maintain a resilient, secure, and efficient SaaS application platform to meet established SLAs
  • Build and maintain robust CI/CD pipelines and developer platforms to empower engineering teams to release features quickly and safely
  • Design and deploy scalable infrastructure specifically optimized for AI/ML workloads, including managing GPU resources and integrating MLOps tools
  • Automate monitoring, management, and incident response to achieve an auto-remediation system
  • Participate in on-call rotation to ensure stability and uptime for our platforms
  • Scale infrastructure to meet rapidly increasing demand
  • Independently design and develop tools to aid in operations and automation to AI, as well as work jointly with other team members to deliver innovative solutions to complex business and technical challenges
  • Provide deployment and operations support for multi-tiered distributed software applications
  • Estimate engineering effort, plan implementation, and roll out system changes that meet requirements for functionality, performance, scalability, reliability, and adherence to development goals and principles
  • Collaborate in a fast-paced environment with multiple teams (software development, release management, build and release, etc.)
  • Define how the behavior of large-scale systems can be achieved
  • Measure and achieve reliability through engineering and operations automation
  • Develop monitoring and alerts, documentation, and management with the goal of creating an auto-remediation system to bring platform stability
  • Adapt security controls to products not typically native to GA releases
  • Develop automation methods to extend standard deployment pipelines for bespoke implementations
  • Patch, configure, manage, enforce policies, and audit production systems
  • Drive the Disaster Recovery process

Skills

  • 7+ years of experience in Linux systems and software management
  • Expertise with Terraform, Ansible, and cloud platforms like AWS, Azure, and GCP
  • Experience with large-scale distributed systems
  • Monitoring/alerting systems (Prometheus, Grafana)
  • CI/CD pipelines
  • Container orchestration (Docker, Kubernetes)
  • Programming languages (Go, Java, Python)
  • Engineering scalable infrastructure for machine learning workloads
  • GPU provisioning and MLOps integrations
  • Implementing security controls
  • Automating deployments
  • Troubleshooting complex systems
  • 7+ years of professional Linux and Windows systems and software management experience
  • Expertise with Infrastructure-as-Code such as Terraform and Cloud Formation
  • Knowledgeable with code languages including: Python, Go, Node.js
  • Experience managing infrastructure within Azure, GCP, and AWS
  • Expertise in Kubernetes management and upgrades
  • Strong scripting skills for systems and data-driven solutions
  • Strong GitOps and CICD experience with tools such as Jenkins, ArgoCD, Helm
  • Proven ability to lead root-cause analysis (RCA) and blameless post-mortems
  • Act as an infrastructure consultant to software engineering teams
  • Identify systemic weaknesses across our multi-tiered applications
  • Drive a culture of observability
  • Comprehensive background in monitoring and alerting systems in auto-remediation systems
  • Familiarity with deploying, scaling, and observing AI models, Vector Databases, or LLMs in production environments
  • Proven examples of standardizing security controls and configuration management across large-scale infrastructure
  • Comfort working within project/task management platforms
  • Bachelor's degree in Computer Science or related field
  • Experience provisioning and managing GPU infrastructure (e.g., NVIDIA CUDA)
  • Experience working in regulated or public sector environments through development and assessment of cloud-based solutions
  • Experience with the following languages, platforms, and tools: Perl, Java, VMWare
  • Concrete examples of creating auto-remediation systems and infrastructure with agentic solutions

Benefits

  • Incentive compensation
  • Health benefits
  • Retirement benefits
  • Life insurance
  • Paid time off
  • Parental leave and benefits
  • Other employee perks and benefits

Company Overview

Veritone is an AI company that offers machine learning models transforming data sources into actionable intelligence. It was founded in 2014, and is headquartered in Irvine, California, USA, with a workforce of 501-1000 employees. Its website is https://www.veritone.com.

Company H1B Sponsorship Veritone has a track record of offering H1B sponsorships, with 2 in 2023, 4 in 2022, 1 in 2021, 4 in 2020. Please note that this does not guarantee sponsorship for this role.

Locale / language / Country Code: us