Member of Technical Staff, Model Efficiency

Cohere·Remote(New York)

Other

WFA Digital Insight

As the demand for AI and ML specialists continues to grow, Cohere is at the forefront with its mission to scale intelligence. With a 25% increase in AI adoption in 2025, skilled professionals with expertise in model efficiency are in high demand. This role stands out for its remote-friendly environment and the opportunity to work with a talented team. Before applying, candidates should be aware of the company's emphasis on innovation, collaboration, and continuous learning, with a strong bias for action and a willingness to experiment and measure impact.

Job Description

About the Role

Cohere is seeking a skilled Member of Technical Staff to join its Model Efficiency team. As a key member of this fast-growing group, you will focus on building reliable ML systems and pushing the boundaries of LLM inference efficiency. Your day-to-day work will involve collaborating closely with modeling and systems teams to experiment, measure, and ship improvements that meaningfully accelerate inference. You will have the opportunity to dive deep into model execution, identify bottlenecks, and develop innovative optimizations.

The Model Efficiency team is concentrated in the EST and PST time zones, promoting collaboration and flexibility. As a remote-friendly company, Cohere values diversity and strives to create an inclusive work environment. With a strong emphasis on innovation and continuous learning, this role is ideal for professionals who thrive in a fast-paced environment and are passionate about their craft.

As a Member of Technical Staff, you will be part of a talented team that is passionate about building great products. You will have the opportunity to work with a diverse range of perspectives, contributing to the development of frontier models for developers and enterprises. Your work will be instrumental in the widespread adoption of AI, and you will be responsible for driving lower latency, higher throughput, and consistent quality across diverse workloads.

What You Will Do

Develop techniques to improve how models execute in production, driving lower latency and higher throughput
Collaborate closely with modeling and systems teams to experiment, measure, and ship improvements
Dive deep into model execution to identify bottlenecks and develop innovative optimizations
Work across the inference stack to improve core performance metrics
Develop expertise in advanced performance techniques, including GPU/CUDA optimizations and kernel-level improvements
Build expertise in model execution strategies for MoE and large-scale architectures
Collaborate with cross-functional teams to design and implement new features and improvements
Participate in code reviews and contribute to the improvement of the codebase
Stay up-to-date with industry trends and emerging technologies, applying this knowledge to drive innovation

What We Are Looking For

5+ years of experience writing high-performance, production-quality code
Strong programming skills in C++ or Python (Rust/Go also welcome)
Experience working with large language models and familiarity with the LLM inference ecosystem
Ability to diagnose and resolve performance bottlenecks across the model execution stack
A strong bias for action — you ship fast, measure impact, and iterate
Experience with GPU programming, CUDA, or low-level systems optimization
Familiarity with language modeling with transformers (MoE, speculative decoding, KV-cache optimizations)
Experience scaling performance-critical distributed systems (e.g., computation, search, storage)
Strong communication and collaboration skills, with the ability to work effectively in a remote environment

Nice to Have

Experience with cloud-based infrastructure and containerization (e.g., Docker, Kubernetes)
Familiarity with agile development methodologies and version control systems (e.g., Git)
Experience with testing and validation frameworks (e.g., Pytest, Unittest)
Knowledge of data structures and algorithms, with the ability to apply this knowledge to optimize system performance

Benefits and Perks

Competitive salary and benefits package
Opportunity to work with a talented team of researchers, engineers, and designers
Remote-friendly work environment, with the ability to work from anywhere
Weekly lunch stipend, in-office lunches, and snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement

How to Stand Out

Tip: Showcase your expertise in C++ or Python, highlighting your experience with high-performance code and model efficiency.
Tip: Be prepared to discuss your experience working with large language models and the LLM inference ecosystem.
Tip: Emphasize your ability to diagnose and resolve performance bottlenecks, providing specific examples from your previous experience.
Tip: Highlight your strong communication and collaboration skills, demonstrating your ability to work effectively in a remote environment.
Tip: Be prepared to discuss your experience with GPU programming, CUDA, or low-level systems optimization, and how you can apply this knowledge to drive innovation.
Tip: Showcase your passion for AI and ML, demonstrating your enthusiasm for working with a talented team to drive the widespread adoption of AI.
Tip: Be prepared to discuss your experience with agile development methodologies and version control systems, highlighting your ability to work effectively in a fast-paced environment.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.