Cloud Services Business Unit delivers the full VMware portfolio of enterprise capabilities as an integrated set of cloud services, to enable consistent infrastructure and operations across every major public cloud, or service provider environment.
Our team enables Cloud Providers across the globe to consume VMware products. By offering a wide range of VMware-based cloud services on a geographical basis, Providers can offer cloud services that quickly and seamlessly extend their customer’s data center into the cloud using the same VMware products and tools they already use on premise.
Role : As a Senior Member of Technical Staff, Site Reliability, you will collaborate closely with product development teams on management and deployment of multiple SaaS offerings.
You have a background running large scale applications in Public Cloud(AWS, GCP, Azure) deployed over Kubernetes. You are excited about helping teams be successful in building reliable, self-healing services. Responsibilities :
Participate in architectural reviews with reliability and resiliency in mind.
Recommend preventive and corrective actions for incidents.
Collaborate with teams on improving deployment automation, improving resiliency and security of our cloud products. You’re intimately familiar with CI / CD tools and methodologies and know how to get the most out of them
Comfortable working with development teams on addressing reliability and scale concerns across the stack. You’re just as much dev as ops and flourish working in an Agile model
Help teams improve the observability of their services through application and infrastructure instrumentation. Monitoring, alerting, metrics, and deep introspection of applications is a must and an area you’re passionate about
Troubleshoot complex operational issues within a microservices based architecture
Develop tooling to enhance development and troubleshooting efficiency
Participate in the on-call rotation in keeping the Availability as per SLA.
5-8 years of SRE / DevOps experience working on highly scalable distributed systems
Experience with metric and log aggregation tools (Prometheus, ELK, etc.)
Experience with Monitoring tools like Grafana / Wavefront
Experience working on Terraform / Ansible / Helm
Knowledge of relational and non-relational databases, networking, Linux internals, filesystems, web architecture, CI / CD principles
Experience with programming languages such as Python / Go / Java / Node.js
A solid understanding of cloud-based architectures and concepts, with hands-on experience using Public Clouds and Kubernetes
Experience using Git
Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc.
A "team-player attitude" : rather than celebrating heroic effort pulled off to resolve an incident, you prefer engaging in engineering practices that avoid the incidents in the first place