Staff Machine Learning Engineer, GenAI Platform

RedditReddit·Remote(Remote - United States)
AI & Machine Learning

WFA Digital Insight

Demand for machine learning engineers with expertise in Generative AI has surged, with over 50% of companies investing in AI technologies. Reddit's commitment to innovation and community-driven approach makes this role stand out. As the job market for remote tech professionals continues to grow, candidates with experience in large-scale distributed systems and a strong technical leadership background are in high demand. With the rise of foundation models, the ability to architect and scale GenAI infrastructure is crucial. Before applying, candidates should be prepared to showcase their technical expertise and experience with complex distributed systems.

Job Description

## About the Role As a Staff Machine Learning Engineer on the Machine Learning Platform team at Reddit, you will be at the forefront of building the company's Generative AI capabilities. This role entails driving the technical strategy and architecture of Reddit's next-generation LLM platform, working closely with machine learning engineers and researchers to enable seamless training, evaluation, and iteration on large language models. The Machine Learning Platform team is a high-impact organization that owns the infrastructure powering recommendations, content discovery, and user quantification. The role is part of a larger effort to expand Reddit's platform to meet the unique demands of foundation models, building the foundational infrastructure to support massive-scale, long-running LLM workloads. This is a unique opportunity to make a significant impact on the company's growth and user experience. Reddit is a community of communities, home to the most open and authentic conversations on the internet, with over 100,000 active communities and approximately 121 million daily active unique visitors. ## What You Will Do - Drive GenAI Infrastructure Strategy: propose, design, and lead the architecture of Reddit's next-generation LLM platform - Design Resilient, Large-Scale Distributed Systems: architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters - Build Self-Serve LLM Workflows: design and implement robust, production-grade pipelines for LLM fine-tuning - Develop Comprehensive Evaluation & Benchmarking Infrastructure: build scalable systems for automated regression detection, structured metrics tracking, and complex inference-heavy evaluation patterns - Architect Advanced Data Ingestion Pipelines: extend Reddit's distributed data platforms to natively and efficiently handle massive, multimodal datasets - Provide Technical Leadership & Mentorship: analyze complex bottlenecks in distributed systems and mentor senior engineers - Collaborate with cross-functional teams: work closely with machine learning engineers, researchers, and product managers to develop and deploy GenAI models - Stay up-to-date with industry trends: participate in industry conferences, meetups, and workshops to stay current with the latest advancements in GenAI and machine learning ## What We Are Looking For - 8+ years of experience in software engineering, with a focus on machine learning, distributed systems, or related fields - Strong technical leadership background, with experience in architecting and scaling complex systems - Expertise in designing and implementing large-scale distributed systems, with a focus on fault tolerance and scalability - Experience with machine learning frameworks and technologies, such as TensorFlow, PyTorch, or JAX - Strong understanding of computer science fundamentals, including data structures, algorithms, and software design patterns - Excellent communication and collaboration skills, with experience working with cross-functional teams - Strong problem-solving skills, with the ability to analyze complex systems and identify areas for improvement ## Nice to Have - Experience with Generative AI models, such as foundation models or large language models - Familiarity with cloud-based infrastructure, such as AWS or GCP - Experience with containerization technologies, such as Docker or Kubernetes - Strong understanding of DevOps practices, including continuous integration and continuous deployment ## Benefits and Perks - Competitive salary and equity package - Comprehensive health, dental, and vision insurance - 401(k) matching and retirement plan - Flexible PTO policy and remote work stipend - Access to cutting-edge technologies and tools - Opportunities for professional growth and development - Collaborative and dynamic work environment

How to Stand Out

- Tip: Showcase your experience with large-scale distributed systems and technical leadership background in your resume and cover letter.

  • Tip: Be prepared to discuss your approach to architecting and scaling GenAI infrastructure during the interview process.
  • Tip: Highlight your ability to collaborate with cross-functional teams and communicate complex technical concepts to non-technical stakeholders.
  • Tip: Research Reddit's company culture and values, and be prepared to discuss how your skills and experience align with the company's mission.
  • Tip: Prepare examples of your experience with machine learning frameworks and technologies, and be ready to discuss your approach to solving complex technical problems.
  • Tip: Don't hesitate to ask about the company's approach to remote work and flexible work arrangements during the interview process.
  • Tip: Be prepared to negotiate your salary and benefits package based on your research and industry standards.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.