Senior Member of Technical Staff (Site Reliability Engineer - Python Scripting, Kubernetes, AWS, Incident management)
VMware
Bangalore, India
5d ago

Job Description

Cloud Services Business Unit delivers the full VMware portfolio of enterprise capabilities as an integrated set of cloud services, to enable consistent infrastructure and operations across every major public cloud, or service provider environment.

Our team enables Cloud Providers across the globe to consume VMware products. By offering a wide range of VMware-based cloud services on a geographical basis, Providers can offer cloud services that quickly and seamlessly extend their customer’s data center into the cloud using the same VMware products and tools they already use on premise.

Role : As a Senior Member of Technical Staff, Site Reliability, you will collaborate closely with product development teams on management and deployment of multiple SaaS offerings.

You have a background running large scale applications in Public Cloud(AWS, GCP, Azure) deployed over Kubernetes. You are excited about helping teams be successful in building reliable, self-healing services. Responsibilities :

  • Participate in architectural reviews with reliability and resiliency in mind.
  • Recommend preventive and corrective actions for incidents.
  • Collaborate with teams on improving deployment automation, improving resiliency and security of our cloud products. You’re intimately familiar with CI / CD tools and methodologies and know how to get the most out of them
  • Comfortable working with development teams on addressing reliability and scale concerns across the stack. You’re just as much dev as ops and flourish working in an Agile model
  • Help teams improve the observability of their services through application and infrastructure instrumentation. Monitoring, alerting, metrics, and deep introspection of applications is a must and an area you’re passionate about
  • Troubleshoot complex operational issues within a microservices based architecture
  • Develop tooling to enhance development and troubleshooting efficiency
  • Participate in the on-call rotation in keeping the Availability as per SLA.
  • Requirements :

  • 5-8 years of SRE / DevOps experience working on highly scalable distributed systems
  • Experience with metric and log aggregation tools (Prometheus, ELK, etc.)
  • Experience with Monitoring tools like Grafana / Wavefront
  • Experience working on Terraform / Ansible / Helm
  • Knowledge of relational and non-relational databases, networking, Linux internals, filesystems, web architecture, CI / CD principles
  • Experience with programming languages such as Python / Go / Java / Node.js
  • A solid understanding of cloud-based architectures and concepts, with hands-on experience using Public Clouds and Kubernetes
  • Experience using Git
  • Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc.
  • A "team-player attitude" : rather than celebrating heroic effort pulled off to resolve an incident, you prefer engaging in engineering practices that avoid the incidents in the first place
  • Report this job
    checkmark

    Thank you for reporting this job!

    Your feedback will help us improve the quality of our services.

    Apply
    My Email
    By clicking on "Continue", I give neuvoo consent to process my data and to send me email alerts, as detailed in neuvoo's Privacy Policy . I may withdraw my consent or unsubscribe at any time.
    Continue
    Application form