Member of Technical Staff - CI Engineer

RadixArk • Full-time • Palo Alto, CA, US • $120k - $160k / year • 2d ago

About the Role

RadixArk is hiring a Member of Technical Staff - CI Engineer to own the infrastructure that keeps SGLang moving. Our CI system runs 300+ GPU tests across NVIDIA, AMD, Intel, and Ascend hardware pools, gating every commit to one of the fastest-growing open-source LLM inference engines. When CI is green and fast, 100+ contributors ship with confidence. When it isn't, the entire project stalls. That bottleneck is your problem to solve.

You won't just maintain pipelines - you'll architect them. You'll replace brittle static thresholds with regression-based detection, harden runners against supply-chain attacks from fork PRs, and cut cycle times so contributors get feedback in minutes, not hours. You'll work directly with core maintainers, hardware partners, and the open-source community to keep the system that gates every merge request trustworthy, fast, and secure.

This is not a role for someone who wants to write CI YAML and walk away. It's for an engineer who treats CI infrastructure the way we treat serving infrastructure - as a system worth designing well.

What You'll Do

Own CI reliability end-to-end - triage failures, distinguish real regressions from flaky tests and infra issues, keep main green
Build regression-based CI - replace hardcoded static thresholds with automated baseline comparison (metrics pipeline, durable storage, detection logic)
Harden runner infrastructure - ephemeral runners, container isolation, security hardening for fork PR execution
Cut CI time - right-size eval suites, deduplicate server startups, separate PR smoke tests from nightly full runs
Improve developer experience - faster feedback, clearer failure messages, workflow orchestration

Requirements

- 3+ years operating CI/CD at scale (GitHub Actions, Buildkite, Jenkins, GitLab CI, or similar)
- Deep Linux, Docker, GPU computing knowledge
- Self-hosted runner management experience
- Strong Bash and Python
- Security mindset - CI supply chain risks, fork PR attack vectors, runner hardening
- NVIDIA GPU drivers, CUDA, NCCL, InfiniBand/RDMA experience in CI contexts
- Familiarity with ML inference workloads (model loading, KV cache, quantization)

Nice to Have

Large open-source project CI experience (100+ contributors)
AMD ROCm or Intel XPU CI pipelines

What Success Looks Like

Day 20 - Full CI landscape understood, daily triage taken over, top recurring flaky tests fixed, PR CI time reduced 30%+
Day 40 - Regression-based checks live on nightly CI, ephemeral runner prototype deployed, runner isolation in place
Day 60 - Zero flaky tests. Main CI 100% green when no real regression exists

Compensation

We offer competitive base with meaningful equity, comprehensive health benefits, and flexible work arrangements. Compensation is determined by location, level, and experience.

How to Apply

Reach out via Slack or email. CI fix PRs to major open-source projects are worth more than a resume.