We are developing the world’s first enterprise-level Platform-as-a-Service (PaaS) for robots, creating a rare opportunity for an experienced, product-focused engineering professional. The PaaS aims to aid and offer innovative features to handle every part of the product lifecycle required to support and deliver consumer-facing connected machines and services.
Site Reliability Engineering combines skills of software and systems engineering. Your key responsibility is to focus on optimizing existing systems, building infrastructure, and eliminating work through automation to make them more reliable and ensure the highest possible uptime for all users and developers on the rapyuta platform.
Your responsibilities will include the following but not limited to:
- Support services before they go live through activities such as system design consulting, capacity planning, and launch reviews
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
- Practice sustainable incident response and postmortems
- Build and evolve the operations handbook
Requirements
- Bachelor’s degree in Computer Science or a similar technical field of study, or equivalent practical experience with an outstanding track record
- At least 2 years of experience in product development and/or supporting operations
- Mastery of one or more of the following programming languages including but not limited to Python, Golang, Ruby, Bash
- Experience with algorithms, data structures, complexity analysis and software design
- Familiar with Config Management, Docker, IaaS, PaaS, Continuous Delivery, Continuous Integration, DevOps, ChatOps
- Solid understanding of network fundamentals and practical experience troubleshooting networked services
- Demonstrated proficiency with: Linux systems, public cloud platforms, and associated tools/technologies
Nice to haves
These aren’t required, but be sure to mention them in your application if you have them.
- Experience with container management platforms like Kubernetes, OpenShift or Mesos
- Experience cloud platforms such as Google Cloud Platform/Amazon Web Services/Azure
- Open source contributions and projects.
- Experience with SQL and NoSQL databases, as well as queuing systems.
- Experience in designing, analyzing and troubleshooting distributed systems
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
- Ability to debug and optimize code and automate routine tasks