- Home
- Remote Jobs
- AI Infrastructure Engineer
Already filled
Don't miss the next one. Get matching roles delivered to your inbox.
AI Infrastructure Engineer
Job summary
Work model
AI Infrastructure Engineer
Location: 100% Remote (Continental United States) Position Type: In-house Bright Vision Technologies SOW engagement Experience: 6+ years Sponsorship: No new H1B sponsorship available. H1B transfers welcomed. Employment Type: Full-time, direct W2 with Bright Vision Technologies
Bright Vision Technologies is seeking a skilled AI Infrastructure Engineer to design, build, and operate the platform layer that powers large-scale AI training and inference workloads. This role focuses on GPU clusters, distributed training frameworks, scheduling, storage performance, and developer experience for ML engineers and researchers, with a strong emphasis on reliability, efficiency, and cost control.
Key Responsibilities
- Design and operate GPU and accelerator infrastructure for training and inference.
- Build scheduling, queueing, and resource-sharing systems to maximize accelerator utilization.
- Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train.
- Operate high-performance storage systems and data pipelines.
- Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication.
- Build observability for AI workloads.
- Implement checkpointing, restart, and fault-tolerance patterns.
- Drive cost optimization across compute, storage, and networking.
- Develop developer tooling and paved-road workflows.
- Partner with research and applied ML teams to plan capacity.
- Implement security controls, isolation, and access management.
- Drive automation across cluster provisioning, lifecycle management, and configuration enforcement.
- Maintain runbooks, capacity dashboards, and operational documentation.
- Stay current with AI infrastructure research, accelerator hardware, and emerging open-source AI tooling.
Required Qualifications
- Bachelor's or Master's degree in Computer Science or a related field.
- Six or more years of experience in infrastructure, platform, or HPC engineering.
- Hands-on experience operating GPU clusters or large-scale ML training infrastructure.
- Strong proficiency in Python and at least one systems language (Go or C++).
- Deep understanding of distributed training, accelerator architectures, and collective communication.
- Experience with Kubernetes, Slurm, Ray, or similar scheduling systems.
- Strong understanding of Linux internals, networking, and high-performance storage.
- Experience with at least one major cloud provider's ML infrastructure offerings.
- Strong software engineering practices (testing, CI/CD, code review).
- Excellent communication and cross-functional collaboration skills.
Preferred Qualifications
- Experience operating InfiniBand or RDMA networking at scale.
- Contributions to open-source ML infrastructure projects.
- Familiarity with custom orchestrators or research-grade training stacks.
- Exposure to frontier model training operations.
- Experience with FinOps for AI workloads.
How to Apply For immediate consideration, please send your resume to [email protected] or contact us at +1 (908) 765-8199. Learn more about Bright Vision Technologies at www.bvteck.com.
Bright Vision Technologies is an equal opportunity employer and values diversity and inclusion.
Note: This is a direct W2 position with Bright Vision Technologies. No C2C, 1099, or third-party arrangements are accepted. H1B transfers are welcomed for qualified candidates. A technical coding assessment is mandatory.