Already filled

Don't miss the next one. Get matching roles delivered to your inbox.

AI Infrastructure Engineer

Job summary

Naperville
Engineering

Work model

Fully remote
Only US
1 month ago
Job description

AI Infrastructure Engineer

Location: 100% Remote (Continental United States) Position Type: In-house Bright Vision Technologies SOW engagement Experience: 6+ years Sponsorship: No new H1B sponsorship available. H1B transfers welcomed. Employment Type: Full-time, direct W2 with Bright Vision Technologies

Bright Vision Technologies is seeking a skilled AI Infrastructure Engineer to design, build, and operate the platform layer that powers large-scale AI training and inference workloads. This role focuses on GPU clusters, distributed training frameworks, scheduling, storage performance, and developer experience for ML engineers and researchers, with a strong emphasis on reliability, efficiency, and cost control.

Key Responsibilities

  • Design and operate GPU and accelerator infrastructure for training and inference.
  • Build scheduling, queueing, and resource-sharing systems to maximize accelerator utilization.
  • Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train.
  • Operate high-performance storage systems and data pipelines.
  • Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication.
  • Build observability for AI workloads.
  • Implement checkpointing, restart, and fault-tolerance patterns.
  • Drive cost optimization across compute, storage, and networking.
  • Develop developer tooling and paved-road workflows.
  • Partner with research and applied ML teams to plan capacity.
  • Implement security controls, isolation, and access management.
  • Drive automation across cluster provisioning, lifecycle management, and configuration enforcement.
  • Maintain runbooks, capacity dashboards, and operational documentation.
  • Stay current with AI infrastructure research, accelerator hardware, and emerging open-source AI tooling.

Required Qualifications

  • Bachelor's or Master's degree in Computer Science or a related field.
  • Six or more years of experience in infrastructure, platform, or HPC engineering.
  • Hands-on experience operating GPU clusters or large-scale ML training infrastructure.
  • Strong proficiency in Python and at least one systems language (Go or C++).
  • Deep understanding of distributed training, accelerator architectures, and collective communication.
  • Experience with Kubernetes, Slurm, Ray, or similar scheduling systems.
  • Strong understanding of Linux internals, networking, and high-performance storage.
  • Experience with at least one major cloud provider's ML infrastructure offerings.
  • Strong software engineering practices (testing, CI/CD, code review).
  • Excellent communication and cross-functional collaboration skills.

Preferred Qualifications

  • Experience operating InfiniBand or RDMA networking at scale.
  • Contributions to open-source ML infrastructure projects.
  • Familiarity with custom orchestrators or research-grade training stacks.
  • Exposure to frontier model training operations.
  • Experience with FinOps for AI workloads.

How to Apply For immediate consideration, please send your resume to [email protected] or contact us at +1 (908) 765-8199. Learn more about Bright Vision Technologies at www.bvteck.com.

Bright Vision Technologies is an equal opportunity employer and values diversity and inclusion.


Note: This is a direct W2 position with Bright Vision Technologies. No C2C, 1099, or third-party arrangements are accepted. H1B transfers are welcomed for qualified candidates. A technical coding assessment is mandatory.