AI Researcher – Multilingual Data

Featherless AIFeatherless AI·Remote·Work From Anywhere
AI & Machine Learning

WFA Digital Insight

The demand for AI researchers with expertise in multilingual data has skyrocketed, with a 25% increase in job postings over the past year. As a remote AI researcher at Featherless AI, you'll have the opportunity to work on cutting-edge language models, collaborating with a team of innovators. With the rise of globalization, companies are looking for professionals who can develop and implement AI solutions that cater to diverse languages and cultures. Featherless AI stands out for its commitment to publishing high-quality research and translating it into production systems, making it an attractive choice for those who want to make a real impact.

Job Description

About the Role

As a remote AI researcher at Featherless AI, you will play a crucial role in developing and scaling next-generation language models across diverse languages and domains. Your primary focus will be on multilingual data, designing and executing research on datasets, including data collection, filtering, deduplication, and quality measurement. You will also develop strategies for low-resource and long-tail languages, researching and improving cross-lingual transfer, alignment, and robustness in large language models.

The role requires a strong background in NLP/ML research, with a focus on multilingual or cross-lingual modeling. You will have the opportunity to work closely with engineers and researchers on training pipelines and model architecture decisions, as well as publish research at top venues and contribute to open-source projects.

Featherless AI values innovation and collaboration, providing a dynamic and supportive work environment that encourages creativity and growth. As a remote team member, you will have the flexibility to work from anywhere, with access to modern infrastructure and large datasets.

What You Will Do

  • Design and execute research on multilingual datasets, including data collection, filtering, deduplication, and quality measurement
  • Develop strategies for low-resource and long-tail languages, including sampling, augmentation, and curriculum design
  • Research and improve cross-lingual transfer, alignment, and robustness in large language models
  • Build and maintain evaluation benchmarks for multilingual performance
  • Collaborate with engineers and researchers on training pipelines and model architecture decisions
  • Publish research at top venues, such as ACL, EMNLP, NeurIPS, ICML, and ICLR
  • Contribute to open-source projects and translate research insights into practical improvements in production models
  • Develop and implement data quality metrics, filtering, and dataset bias detection
  • Work with large-scale text datasets across multiple languages, using tokenization and vocabulary design for multilingual models
  • Utilize transfer learning and multilingual representation learning to enhance model performance

What We Are Looking For

  • Strong background in NLP/ML research, with a focus on multilingual or cross-lingual modeling
  • Publication record at respected conferences or journals, such as ACL, EMNLP, NeurIPS, ICML, and ICLR
  • Experience working with large-scale text datasets across multiple languages
  • Solid understanding of tokenization and vocabulary design for multilingual models
  • Familiarity with data quality metrics, filtering, and dataset bias detection
  • Comfortable prototyping in Python with modern ML frameworks, such as PyTorch and JAX
  • Ability to operate independently and ship research in a startup pace environment
  • Strong communication and collaboration skills, with the ability to work effectively with engineers and researchers

Nice to Have

  • Experience with low-resource languages or non-Latin scripts
  • Open-source contributions in NLP or data tooling
  • Experience training or evaluating large language models
  • Familiarity with multilingual benchmarks, such as XTREME, FLORES, and TyDi QA

Benefits and Perks

  • Competitive compensation and meaningful equity at an early stage
  • Access to meaningful scale, including large datasets, modern infrastructure, and fast iteration
  • Flexible remote work arrangement, with the ability to work from anywhere
  • Opportunity to work on cutting-edge language models and collaborate with a team of innovators
  • Professional development opportunities, including conference attendance and training
  • Comprehensive health insurance and retirement plan
  • Generous PTO and holiday policy, with a focus on work-life balance

How to Stand Out

  • Develop a strong portfolio showcasing your research experience and publications in NLP/ML, particularly in multilingual or cross-lingual modeling.
  • Familiarize yourself with popular ML frameworks, such as PyTorch and JAX, and practice prototyping in Python.
  • Prepare to discuss your experience working with large-scale text datasets and your understanding of data quality metrics and dataset bias detection.
  • Highlight your ability to operate independently and ship research in a fast-paced environment, and be prepared to provide examples of your work.
  • Research Featherless AI's current projects and be prepared to discuss how your skills and experience align with their goals and values.
  • Be prepared to discuss your experience with open-source contributions and your willingness to collaborate with engineers and researchers.
  • Practice your communication skills, as you will be working remotely and collaborating with a team of innovators.

This is a remote position listed on WFA Digital, the platform for professionals who work from anywhere. Browse more remote jobs across all categories.