As a Senior MLOps Engineer, you will be at the forefront of deploying, optimizing, and monitoring LLMs in production environments. Your role will involve building and maintaining scalable pipelines, ensuring low-latency inference, and implementing best practices in monitoring and observability. You will also work with state-of-the-art tools like Hugging Face and MLFlow to fine-tune models and integrate them into robust AI solutions.
Responsibilities
- Model Deployment & Management
- Develop and maintain scalable pipelines for deploying LLMs, focusing on efficient, low-latency inference.
- Utilize tools like Hugging Face and MLFlow for seamless model integration and version control.
- Automate deployment processes, including model validation and continuous integration.
- Monitoring & Observability
- Implement comprehensive monitoring frameworks to track performance and reliability of models in production.
- Use advanced observability tools to proactively detect and address performance issues.
- Deploy alerting systems to ensure rapid response to anomalies in model behavior.
- Infrastructure Optimization
- Architect and optimize cloud and on-premise infrastructure to support large-scale LLM operations.
- Collaborate with cloud providers like AWS, Azure, and GCP to optimize costs and performance.
- Work with backend engineers to ensure smooth integration of AI models into conversational platforms.
- Collaboration & Documentation
- Partner with AI engineers and data scientists to align on project objectives and deployment strategies.
- Document MLOps processes, best practices, and tools to maintain operational excellence.
- Provide training and support to team members on MLOps methodologies and tools.
Requirements
- Experience
- 5+ years of experience in MLOps, DevOps, or related fields, with a focus on deploying and managing LLMs or other large-scale machine learning models.
- Proven experience with tools like Hugging Face, MLFlow, and containerization technologies (Docker, Kubernetes).
- Strong experience with cloud platforms (AWS, Azure, GCP) and infrastructure as code (Terraform).
- Hands-on experience in reducing inference latency and optimizing AI infrastructure.
- Technical Skills
- Proficiency in Python, with experience in ML libraries such as TensorFlow, PyTorch, and related frameworks.
- Expertise in CI/CD pipelines, version control (Git), and orchestration tools.
- Familiarity with Generative AI, prompt engineering, and deploying models at scale.
- Soft Skills
- Excellent problem-solving skills with the ability to tackle complex challenges independently.
- Strong communication skills, with the ability to translate technical concepts for non-technical stakeholders.
- A proactive mindset with a focus on continuous learning and staying updated with industry trends.