As a Reliability Engineer, you will be designing, building and operating features and services that makes Xi, Nutanix cloud services to be secure, reliable, completely elastic, scalable, and self-healing.
Delivering reliable and high-performance services and features. Nutanix requires engineers with exceptional expertise and boundless creativity.
Work in concert with engineering teams to evolve services for better scalability, reliability and development velocity
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Focus on improving Reliability
Practice sustainable incident response and blameless postmortems
Define and develop software for tasks associated with the developing, designing and debugging of applications
Develop tools to improve ability to rapidly deploy and effectively monitor custom applications in large scale environments
Participate in a 24x7 on call rotation
Highly skilled at one or more domains : Infrastructure As Code tools (Docker, Terraform, Puppet, Helm), Monitoring tools (Prometheus, Datadog, NewRelic), Container Orchestration tools (Kubernetes, Docker), Database technologies (Cassandra, Postgres), CI / CD tools(Jenkins, Spinnaker)
Be proficient in GCP, Azure and AWS Cloud
Experience automating tasks with scripting languages such as Python, Go and / or Shell
Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
Knowledge of Apache Kafka, Druid
Strong understanding of Linux operating systems
Familiar with setup and architecture of queuing, caching and service mesh systems
Experience in applying SRE principles and best practices
3-8 years of relevant work experience
Experience working in an operational environment with mission critical tier-one services with associated on-call support
Designed Monitoring, Logging and Reliability Processes for systems at scale
How do I know if this role is for me
Do you like thinking about large scale problems that have a lot of moving parts?
Do you like thinking about how to make large systems more reliable?
Are you okay with working on software that will likely never be overtly seen by an external user?
Do you enjoy the process of diagnosing and fixing a problem?
Do you like looking through metrics and logs as if it were a treasure hunt ?
Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.
g. ssh, proc, cron, swaps, etc.)?
Are you comfortable with the idea of being on-call in which you are likely to be in high-stakes scenario where something needs to be fixed?
Are you able to stay calm under pressure?
Do you approach problems in a logical, process-oriented way?
Are you comfortable attempting a problem that has never been solved before?
Are you someone who thinks about how you can make things better?