Wikimedia logo

Staff Site Reliability Engineer at Wikimedia

Remote 🌍 Work from Anywhere Full time Lead USD129,347 - USD200,824 Posted  Apply before Oct 13, 2025

Job Description

Overview

The Wikimedia Foundation is hiring a Staff Site Reliability Engineer focused on Machine Learning Infrastructure. In this distributed role you will design, build, and scale the foundational infrastructure that enables Wikimedia's machine learning engineers and researchers to train, deploy, and monitor ML models in production. You will report to the Director of Machine Learning and collaborate with ML engineers, product teams, researchers, SREs, and community contributors across UTC-5 to UTC+3.

Key Responsibilities

  • Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models.
  • Improve reliability, availability, and scalability of ML systems to enable efficient ML workflows for internal teams.
  • Collaborate with ML engineers, product teams, researchers, and SREs to identify infrastructure requirements and resolve operational issues.
  • Proactively monitor and optimize system performance, capacity, and security to maintain high service quality.
  • Provide documentation, guidance, and best practices so teams can effectively use ML infrastructure.
  • Mentor team members and share knowledge on infrastructure management, operational excellence, and reliability engineering.

Skills and Experience

  • Based within UTC-5 to UTC+3 time zones to ensure collaboration overlap with the team.
  • 7+ years of experience in SRE, DevOps, or infrastructure engineering roles, with significant exposure to production-grade ML systems.
  • Proven expertise with on-premises and cloud infrastructure for ML workloads, including Kubernetes, Docker, GPU acceleration, and distributed training systems.
  • Strong proficiency with infrastructure automation and configuration management tools such as Terraform, Ansible, Helm, or Argo CD.
  • Experience implementing observability, monitoring, and logging for ML systems, using tools like Prometheus, Grafana, or ELK.
  • Familiarity with Python-based ML frameworks such as PyTorch, TensorFlow, or scikit-learn.

Additional Technical and Personal Qualities

  • Deep knowledge of scalable ML infrastructure design, reliability, and operations for distributed ML systems.
  • Experience building tooling and automation to simplify deployment, management, and monitoring of ML infrastructure.
  • Strong English communication skills, ability to work asynchronously across global teams, and a collaborative, proactive mindset.
  • Commitment to open-source software and community collaboration.

Preferred Areas of Excellence

  • Scalable ML infrastructure design for high-performance training and inference.
  • Reliability and operations experience for complex, distributed ML systems at scale.
  • Tooling and automation expertise that reduces operational overhead and improves developer experience.

Work Model and Location

This is a distributed, remote-first position. Candidates must be based within UTC-5 to UTC+3 time zones to ensure overlap with the team. The Wikimedia Foundation is remote-first and has staff and contractors across many countries. Non-US employees are typically hired through a local third-party Employer of Record.

Compensation

Anticipated US pay range - USD 129,347 to USD 200,824 annually. Final offers are adjusted by location, skills, and experience. For applicants outside the US the pay range will be adjusted to the country of hire. The Foundation does not consider salary history when making offers.

Hiring Locations

Please note that Wikimedia is currently able to hire in a defined list of countries. The Foundation periodically reviews this list to ensure alignment with hiring requirements. Candidates must confirm eligibility to be hired in their country of residence.

About the Wikimedia Foundation

The Wikimedia Foundation is the nonprofit organization that operates Wikipedia and the other Wikimedia free knowledge projects. Our vision is a world in which every human can freely share in the sum of all knowledge. We build software, support volunteer communities, and enable open data for research and public use.

Diversity, Equity, and Inclusion

The Wikimedia Foundation is an equal opportunity employer. We encourage applicants from diverse backgrounds and are committed to accessibility and reasonable accommodations during the hiring process. If you require accommodations, contact recruiting@wikimedia.org or the phone number listed in the job posting.

How to Apply

Apply via the Wikimedia jobs page. The application requests standard candidate details, including location, authorization to work, and technical experience. For questions or accommodation requests, contact recruiting@wikimedia.org.

Ready to Apply?

Take the next step in your career journey.

Apply Now

You will be redirected to the company's application page

💜 Please mention that you found the job on Remote World Jobs, this helps us grow. Thanks!

About Wikimedia

Wikimedia Foundation is a nonprofit organization that operates Wikipedia and 13 other free knowledge projects. Founded in 2003, it supports these platforms with technical infrastructure, legal advocacy, grants, and community growth, all fueled by donations.

View Company Profile

Share this Job