The OpportunityThe Model Runtime team owns the systems that train, align, and serve Netflix's most critical ML models. We are a small, highly autonomous team with outsized impact — the infrastructure we build directly shapes what Netflix can do with AI.
We're looking for a Software Engineer who thrives at the intersection of systems engineering and ML. You will:
Build alignment and post-training infrastructure — Design infrastructure for reinforcement learning (GRPO, DPO, PPO), reward modeling, and preference optimization so Netflix can train recommendation models directly against what members actually value.
Enable next-generation GenAI workloads — Create infrastructure for multimodal and diffusion models, including distributed training, disaggregated serving, real-time, near-real-time and batch inference, and asynchronous GPU pipelines.
Scale distributed training — Engineer fault-tolerant training systems using FSDP, tensor/pipeline/context parallelism, and mixed-precision strategies across clusters of hundreds of GPUs.
Optimize across the full stack — Profile and tune from PyTorch operators down to GPU kernels, driving utilization improvements and building cost models that inform infrastructure strategy.
Evaluate emerging hardware and frameworks — Be the team's eyes on specialized accelerators, next-gen NVIDIA silicon, and the open-source ecosystem to keep Netflix at the efficiency frontier.
If you want to work on problems where the gap between "possible" and "deployed at scale" is the hard part, this is the role.
Minimum Job QualificationsExperience in ML systems engineering — building infrastructure for training, fine-tuning, or inference of pre-LLM and post-LLM era models at scale.
Strong systems programming skills with the ability to work across multiple layers of the stack, from high-level ML frameworks down to GPU kernels and memory management
Hands-on experience with PyTorch internals, large-scale distributed training and system-model codesign
Comfortable with ambiguity and working across multiple business and technical domains to execute on both 0-to-1 and 1-to-100 projects
Adopt and promote best practices in operations, including observability, logging, reporting, and on-call processes to ensure engineering excellence
Experience with cloud computing providers, preferably AWS
Excellent written and verbal communication skills
Strong communication skills; effective across distributed time zones and remote environments
Preferred QualificationsDeep experience with distributed training at scale (FSDP, parallelism strategies, checkpointing) or LLM post-training (SFT, RLHF, DPO/GRPO)
Inference optimization — vLLM, TensorRT, quantization, continuous batching, KV-cache management
GPU performance profiling and tuning (CUDA, NCCL, Nsight, PyTorch profiler)
Experience with multimodal or diffusion model architectures and generation pipelines
Track record building reusable ML libraries or contributing to open-source ML projects
Generally, our compensation structure consists solely of an annual salary; we do not have bonuses. You choose each year how much of your compensation you want in salary versus stock options. To determine your personal top of market compensation, we rely on market indicators and consider your specific job family, background, skills, and experience to determine your compensation in the market range. The range for this role is $466,000.00 - $750,000.00. This compensation range will vary based on location.
Netflix provides comprehensive benefits including Health Plans, Mental Health support, a 401(k) Retirement Plan with employer match, Stock Option Program, Disability Programs, Health Savings and Flexible Spending Accounts, Family-forming benefits, and Life and Serious Injury Benefits. We also offer paid leave of absence programs. Full-time hourly employees accrue 35 days annually for paid time off to be used for vacation, holidays, and sick paid time off. Full-time salaried employees are immediately entitled to flexible time off. See more details about our Benefits here.
Netflix is a unique culture and environment. Learn more here.
Job is open for no less than 7 days and will be removed when the position is filled.