Reliability Engineer (DevOps Engineer)
Bangalore, India
1d ago

As a Reliability Engineer, you will be designing, building and operating features and services that makes Xi, Nutanix cloud services to be secure, reliable, completely elastic, scalable, and self-healing.

Delivering reliable and high-performance services and features. Nutanix requires engineers with exceptional expertise and boundless creativity.


  • Work in concert with engineering teams to evolve services for better scalability, reliability and development velocity
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Focus on improving Reliability
  • Practice sustainable incident response and blameless postmortems
  • Define and develop software for tasks associated with the developing, designing and debugging of applications
  • Develop tools to improve ability to rapidly deploy and effectively monitor custom applications in large scale environments
  • Participate in a 24x7 on call rotation
  • Skills

  • Highly skilled at one or more domains : Infrastructure As Code tools (Docker, Terraform, Puppet, Helm), Monitoring tools (Prometheus, Datadog, NewRelic), Container Orchestration tools (Kubernetes, Docker), Database technologies (Cassandra, Postgres), CI / CD tools(Jenkins, Spinnaker)
  • Be proficient in GCP, Azure and AWS Cloud
  • Experience automating tasks with scripting languages such as Python, Go and / or Shell
  • Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
  • Knowledge of Apache Kafka, Druid
  • Strong understanding of Linux operating systems
  • Familiar with setup and architecture of queuing, caching and service mesh systems
  • Experience in applying SRE principles and best practices
  • 3-8 years of relevant work experience
  • RHCE certified
  • Other Prerequisites

  • Experience working in an operational environment with mission critical tier-one services with associated on-call support
  • Designed Monitoring, Logging and Reliability Processes for systems at scale
  • How do I know if this role is for me

  • Do you like thinking about large scale problems that have a lot of moving parts?
  • Do you like thinking about how to make large systems more reliable?
  • Are you okay with working on software that will likely never be overtly seen by an external user?
  • Do you enjoy the process of diagnosing and fixing a problem?
  • Do you like looking through metrics and logs as if it were a treasure hunt ?
  • Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.
  • g. ssh, proc, cron, swaps, etc.)?

  • Are you comfortable with the idea of being on-call in which you are likely to be in high-stakes scenario where something needs to be fixed?
  • Are you able to stay calm under pressure?
  • Do you approach problems in a logical, process-oriented way?
  • Are you comfortable attempting a problem that has never been solved before?
  • Are you someone who thinks about how you can make things better?
  • Report this job

    Thank you for reporting this job!

    Your feedback will help us improve the quality of our services.

    My Email
    By clicking on "Continue", I give neuvoo consent to process my data and to send me email alerts, as detailed in neuvoo's Privacy Policy . I may withdraw my consent or unsubscribe at any time.
    Application form