Site Reliability Engineer
Designation : Associate
Department : Enterprise Security Platforms (TECHNOLOGY & OPERATIONS RISK)
Job Location : Bengaluru
Morgan Stanley is a leading global financial services firm providing a wide range of investment banking, securities, investment management and wealth management services.
We advise, originate, trade, manage and distribute capital for governments, institutions and individuals. As a market leader, the talent and passion of our people is critical to our success.
Together, we share a common set of values rooted in integrity, excellence and strong team ethic. We provide you a superior foundation for building a professional career where you can learn, achieve and grow.
Technology / Role / Department at Morgan Stanley
The mission of TECHNOLOGY & OPERATIONS RISK is to deliver first-line defenses to manage risks to Firm technology, information and cyber threats through risk identification, control management and assurance.
This allows the business to operate and grow in a secure and legally compliant manner.
Our vision is to deliver Programs that protect and enable the business, ensure secure delivery of services to our clients, adjust to address the risks presented by an evolving threat landscape, meet regulatory expectations, and offer highly attractive career opportunities.
The Enterprise Security Products (ESP) team is - amongst other things - responsible for developing and engineering the Firm’s core security controls.
The technology and solution stack spans all Firm employees as well as external clients of the Institutional Security and Wealth Management Businesses.
It consists of home-grown software, 3rd party software, open source products, appliances, and auxiliary services and solutions.
We are looking for a fungible, enthusiastic technologist with excellent communication skills and a passion for reliable solutions and solving operational challenges.
The successful candidate will help to evolve the way ESP squads design, build, and maintain services to improve reliability by adoption of SRE principles, practices, and methodologies.
This is an ideal role for someone looking to broaden their experience in application management / delivery working in a complex, mission critical, security focused enterprise environment.
The candidate will be part of the global team with global responsibilities split into two categories as described below :
Responsibilities : Consulting
Working with ESP squads in a consulting capacity to help them identify and implement incremental improvements to the reliability and operability of their products.
Hosting workshops / gamedays with squads / dissemination of information to develop a current state view of their SRE maturity and identify improvement opportunities with automation and cloud native focus, in an iterative fashion.
Collaborating with squads to develop a future state design and prioritization of tasks to make a measurable difference to their SRE maturity.
Growing the SRE skills of the individuals within the squad to be able to maintain any improvements beyond the scope of the engagement.
Methodically work through problems and have a level of rigor around decision making and documentation.
Evangelizing and teaching best practices and promoted standards to spread knowledge and standardization within the larger organization.
Implementation of your SRE recommendations in collaboration with the squad, such as SLI / SLOs, observability items like metrics or tracing, identifying and measuring toil, automation to enable zero-touch production and increased efficiency, code changes for simplicity, scalability, and resilience improvements
Developing Ansible Playbooks and other solutions to automate manual operational processes and integrate into CICD pipelines.
Implementing automated testing for resilience and reliability, e.g. load testing, chaos engineering, synthetic monitoring, real-user monitoring
Developing configuration-as-code models that align to CICD Pipelines and SRE best practices.
Implementing software development lifecycle best practices through the adoption of standard tools / services for reliability.
Implementing self-service tools for standards and services that are shared across squads to reduce toil for the larger organization.
Building dashboards, alerts, and optimized queries for observability of system reliability with a focus on SLOs, error budgets, and toil management
Investigating / researching / PoC-ing newer technologies and emerging patterns in SRE to continually evolve SRE capabilities within Squads
Building documentation for both tribal knowledge and SRE processes within the squad through documentation-as-code framework.
2+ years of experience with building, shipping, securing, monitoring and managing containerized applications from dev to production hosted in Public Cloud environments or managed Kubernetes platforms
2+ years of development experience using a scripting language at an OO level (e.g. Python) or an OO language (e.g. Java)
2+ years of solid SRE experience implementing metrics, alerting, SLOs, error budgets, automations for configuration-as-code or infra-as-code, CICD pipelines, scalability reviews and operations, incident management, postmortems, and GitOps
2+ years of experience using Prometheus and / or building dashboards using Grafana with Prometheus backend
2+ years of experience with following CICD technologies : Git, Jenkins
Demonstrable experience with system configuration management tools (Chef, Puppet, Ansible, etc.) or with infrastructure as code tools : Terraform, Helm, Nomad, CloudFormation, Azure Resource Manager, etc.
Strong DevOps experience with gradually increasing responsibilities across the breadth and depth of application delivery : automation, orchestration, configuration management, CI / CD, monitoring, and operations
Strong interpersonal, written and verbal communication skills, ability to communicate at all levels and influence others
Strong design and architecture skills, e.g. configuration schema design, process re-engineering and automation etc.
Automation first approach, proven track record of automating large scale, complex distributed software delivery systems
Experience working in Agile teams using Scrum, Kanban, or Agile frameworks
Demonstrated good time management, ability to prioritize tasks, and to meet deadlines across multiple projects
Confident and articulate self-starter
A team player with a high level of commitment and enthusiasm