[Remote] Site Reliability Engineer

Job summary

United States

Engineering

Work model

Fully remote

Only United States

5 days ago

Job description

About Veritone

Veritone is an AI company that offers machine learning models transforming data sources into actionable intelligence. Founded in 2014 and headquartered in Irvine, California, USA, Veritone employs 501-1000 people. Learn more at https://www.veritone.com.

Veritone has a track record of offering H1B sponsorships, with 2 in 2023, 4 in 2022, and 1 in 2021. Please note that this does not guarantee sponsorship for this specific role.

About the Role

Veritone, an AI-first company, is seeking a [Remote] Site Reliability Engineer for candidates in the USA. This role involves deploying and maintaining a resilient SaaS application platform, designing scalable infrastructure for AI/ML workloads, and automating monitoring and incident response systems.

Responsibilities

Deploy and maintain a resilient, secure, and efficient SaaS application platform to meet established SLAs
Build and maintain robust CI/CD pipelines and developer platforms to empower engineering teams to release features quickly and safely
Design and deploy scalable infrastructure specifically optimized for AI/ML workloads, including managing GPU resources and integrating MLOps tools
Automate monitoring, management, and incident response to achieve an auto-remediation system
Participate in on-call rotation to ensure stability and uptime for our platforms
Scale infrastructure to meet rapidly increasing demand
Independently design and develop tools to aid in operations and automation to AI, as well as work jointly with other team members to deliver innovative solutions to complex business and technical challenges
Provide deployment and operations support for multi-tiered distributed software applications
Estimate engineering effort, plan implementation, and roll out system changes that meet requirements for functionality, performance, scalability, reliability, and adherence to development goals and principles
Collaborate in a fast-paced environment with multiple teams (software development, release management, build and release, etc.)
Define how the behavior of large-scale systems can be achieved
Measure and achieve reliability through engineering and operations automation
Develop monitoring and alerts, documentation, and management with the goal of creating an auto-remediation system to bring platform stability
Adapt security controls to products not typically native to GA releases
Develop automation methods to extend standard deployment pipelines for bespoke implementations
Patch, configure, manage, enforce policies, and audit production systems
Drive the Disaster Recovery process

Skills

7+ years of experience in Linux systems and software management
Expertise with Terraform, Ansible, and cloud platforms like AWS, Azure, and GCP
Experience with large-scale distributed systems
Monitoring/alerting systems (Prometheus, Grafana)
CI/CD pipelines
Container orchestration (Docker, Kubernetes)
Programming languages (Go, Java, Python)
Engineering scalable infrastructure for machine learning workloads
GPU provisioning and MLOps integrations
Implementing security controls
Automating deployments
Troubleshooting complex systems
7+ years of professional Linux and Windows systems and software management experience
Expertise with Infrastructure-as-Code such as Terraform and Cloud Formation
Knowledgeable with code languages including: Python, Go, Node.js
Experience managing infrastructure within Azure, GCP, and AWS
Expertise in Kubernetes management and upgrades
Strong scripting skills for systems and data-driven solutions
Strong GitOps and CICD experience with tools such as Jenkins, ArgoCD, Helm
Proven ability to lead root-cause analysis (RCA) and blameless post-mortems
Act as an infrastructure consultant to software engineering teams
Identify systemic weaknesses across our multi-tiered applications
Drive a culture of observability
Comprehensive background in monitoring and alerting systems in auto-remediation systems
Familiarity with deploying, scaling, and observing AI models, Vector Databases, or LLMs in production environments
Proven examples of standardizing security controls and configuration management across large-scale infrastructure
Comfort working within project/task management platforms
Bachelor's degree in Computer Science or related field
Experience provisioning and managing GPU infrastructure (e.g., NVIDIA CUDA)
Experience working in regulated or public sector environments through development and assessment of cloud-based solutions
Experience with the following languages, platforms, and tools: Perl, Java, VMWare
Concrete examples of creating auto-remediation systems and infrastructure with agentic solutions

Benefits

Incentive compensation
Health benefits
Retirement benefits
Life insurance
Paid time off
Parental leave and benefits
Other employee perks and benefits

Company Overview

Veritone is an AI company that offers machine learning models transforming data sources into actionable intelligence. It was founded in 2014, and is headquartered in Irvine, California, USA, with a workforce of 501-1000 employees. Its website is https://www.veritone.com.

Company H1B Sponsorship Veritone has a track record of offering H1B sponsorships, with 2 in 2023, 4 in 2022, 1 in 2021, 4 in 2020. Please note that this does not guarantee sponsorship for this role.

Locale / language / Country Code: us

More Remote jobs in Engineering

Senior Information Security Engineer

UnitedHealth Group

Join Optum as a Senior Information Security Engineer. Leverage TrendMicro EDR expertise, drive security initiatives, and enjoy telecommuting flexib...

Fully remote· Only United States

vor 3 Tagen

Senior Information Security Engineer, Endpoint Security Engineering

JLL

JLL seeks a Senior Information Security Engineer for Endpoint Security Engineering. Develop agent health processes, lead global monitoring, and lev...

Fully remote· Only US

vor 3 Tagen

Data Engineer

UnitedHealth Group

Join Optum as a Data Engineer, designing data models & ETL/ELT pipelines. Utilize Python, SQL, Azure & Databricks. Remote (US). Apply now!

Fully remote· Only United States

vor 3 Tagen

View all Remote jobs in Engineering