Job Title

Machine Learning Infrastructure Specialist

  • Type: Full Time
  • Location: Toronto
  • Employer Type: Business
  • Wage ($/hr): $100,600 - $125,800/year
  • Published on: 2026-03-23
  • Application Deadline: 2026-06-22
  • Job ID: 1056510326
  • Job Category: Other

Website Vector Institute

  • Full Time
  • Toronto
  • Retrieved on: 2026 March 23 05:50:42 PM EDT

Job Description

Machine Learning Infrastructure Specialist

POSITION SUMMARY

As an ML Infrastructure Specialist focused on systems and scalable AI infrastructure, you will build and improve efficient, reusable systems to train, deploy, monitor, and serve large-scale machine learning models, including large language models (LLMs). Working at the intersection of applied research and production systems, you will collaborate with Vector’s AI Engineering team members, researchers, and industry partners to bring advanced AI capabilities into real-world use. You will contribute to initiatives that strengthen software and systems supporting state-of-the-art AI development and deployment, owning well-scoped projects from end-to-end.

KEY RESPONSIBILITIES

Design and implement distributed systems for scalable ML training, inference, and serving on multi-GPU/multi-node environments, with a focus on large foundation models;
Configure and maintain LLM inference systems using modern serving frameworks (e.g., vLLM, TGI, SGLang, TritonRT-LLM), including performance tuning;
Collaborate with researchers and Applied ML Scientists to turn model innovations into production-ready services (i.e., containerized, tested, and observable);
Develop reusable, open-source-friendly modules and tooling that scale ML experimentation and deployment across diverse environments (e.g., Slurm, Kubernetes, and major cloud providers);
Provide code reviews, documentation, and mentorship to junior team members; collaborate with partner teams on best practices for reliability, reproducibility, CI/CD, and hardware-aware optimization;
Contribute to technical design discussions and road mapping related to ML infrastructure, serving pipelines, and research-to-production workflows;
Present technical work via demos and deep dives; contribute to open-source where appropriate; and,
Other responsibilities as assigned or amended from time to time.

KEY SUCCESS MEASURES

Delivery of high-performance, reliable, and maintainable ML infrastructure used across teams and projects;
Adoption and reuse of core infrastructure components across internal applications;
Clear design and technical documentation materials that improve developer experience; and,
Effective mentorship and collaboration that improves team capability and velocity.

PROFILE OF THE IDEAL CANDIDATE

Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related field; advanced degrees or relevant equivalent experience preferred;
3+ years of experience developing scalable systems or infrastructure for machine learning workflows, ideally involving large-scale models and GPU workloads;
Deep expertise in Python and systems programming; fluency with performance profiling, distributed computing, and containerized environments (e.g., Docker, Kubernetes);
Experience with one or more modern LLM inference/serving frameworks (e.g., vLLM, SGLang, TritonRT-LLM);
Familiarity with GPU-accelerated inference, memory optimization strategies, and batching/scheduling techniques;
Practical knowledge of cloud platforms (e.g., GCP, AWS, Azure) and orchestration of multi-node training or serving systems; and,
Strong understanding of software engineering best practices including CI/CD, testing, observability, and DevOps automation.

TOTAL REWARDS: The expected salary for this position will be $100,600 – $125,800 per year, plus benefits if applicable. The final salary offer will reflect the successful candidate’s experience, skills, and qualifications, in alignment with the Vector Institute’s Compensation Policy and may differ from above.

The Vector Institute’s Total Rewards approach extends beyond traditional compensation and benefits. Full-time employees are eligible for a comprehensive suite of supports that recognize and value employees, including vacation time, floater days, GRRSP, a Health Spending Account, a Summer Hours program, and flexible work arrangements.

POSITION STATUS: This posting is for an existing vacancy.

USE OF ARTIFICIAL INTELLIGENCE: Vector may use both internal and external third party AI-based tools to assist in the screening of applications for this posting. Any data collected will be used solely for recruitment purposes and handled in accordance with Vector’s External Privacy Policy and Use of AI-Based Tools in Recruitment and Selection Policy.

INCLUSION AND EQUAL OPPORTUNITY EMPLOYMENT: Vector believes AI powers possibility by advancing cutting-edge research and translating it into real-world impact through collaboration with research, industry, and government. Vector is committed to fostering a diverse and inclusive culture that reflects its values.
The Vector Institute welcomes applications from all qualified candidates, including those who are Indigenous, 2SLGBTQIA+, racialized persons/visible minorities, women, and people with disabilities.
If you require an accommodation at any stage of the recruitment or selection process, please contact [email protected]. The Vector Institute team will be happy to work with you to ensure your experience is as inclusive and accessible as possible.
JOIN OUR COMMUNITY: Check out the Vector Institute’s Careers Page to explore open opportunities at Vector and Follow Vector on X, LinkedIn, and Bluesky to stay connected with the latest developments in Ontario’s AI ecosystem and the Vector Institute.

Required languages: English

Education level: Bachelor's or Master’s degree in Computer Science, Electrical Engineering, or a related field; advanced degrees or relevant equivalent experience preferred;

Required skills: 3+ years of experience developing scalable systems or infrastructure for machine learning workflows, ideally involving large-scale models and GPU workloads; Deep expertise in Python and systems programming; fluency with performance profiling, distributed computing, and containerized environments (e.g., Docker, Kubernetes); Experience with one or more modern LLM inference/serving frameworks (e.g., vLLM, SGLang, TritonRT-LLM); Familiarity with GPU-accelerated inference, memory optimization strategies, and batching/scheduling techniques; Practical knowledge of cloud platforms (e.g., GCP, AWS, Azure) and orchestration of multi-node training or serving systems; and, Strong understanding of software engineering best practices including CI/CD, testing, observability, and DevOps automation.

Closest intersection: College Street and University Avenue

To apply for this job please visit workforcenow.adp.com.

  • Share