Micepad
The All-In-One Event Management System
Posted 7 months ago
As a Site Reliability Engineers (SREs), you are responsible for keeping all user-facing services and other Micepad production systems running smoothly. SREs specialize in systems, while implementing best practices for availability, reliability, and scalability.
Job Description
- Design and implement the high availability infrastructure, help improve reliability, stability, and scalability
- Everything as a code, if anything was done manually, find out the reason and fix it to ensure its never ever done manually again
- Build software to manage platform infrasture and applications
- Involved in the design and implementation of incident management and on-call rotation processes
- Partner with development teams, gather and analyze metrics to assist in performance tuning
- Participate in CI/CD process design
Must Have
- Preferably a degree in computer science, software engineering, information technology or related fields
- 3 years of relevant experience with DevOps
- Experience with provisioning and setup metric in monitoring tools (Prometheus, Grafana)
- Experience with provisioning log management service (ELK, Cloud Logging)
- Experience with designing and implementing CI/CD process
- Experience with one or more cloud environments (GCP, AWS preferred)
- Kubernetes, Docker basic understanding
- Experience with Infrastructure as Code (Terreform, Ansible)
- Experience in one or more of: Python, Node.js, or scripting experience in Shell
Nice to Have
- Experience with high availability design
- Experience with incident management and incident recovery
- Experience with handling high-volume traffic
Hiring Process
- You will be invited to schedule a 30 minutes screening call
- You will receive a technical questionnaire to complete
- You will discuss your technical skills for 60 minutes with at least one of the members of the Development Team and the Tech Lead