Oracle is seeking motivated Senior Site Reliability Engineer who thrives in a fast-paced rapidly evolving technology environment.
This position requires wide and overall knowledge in Linux administration, software development, cloud computing, networking, cloud security, performance analysis and monitoring to provide the stability, security, performance, and reliability for our infrastructure.
Site Reliability Engineer expected to work with multiple service and product development teams, identifying cross-team issues that create risk for operations across the organization and resolving those issues with a mixture of engineering, development, troubleshooting expertise, and general operational guidance.
This role also requires excellent communication and organizational skills. The candidate is expected to collaborate with service owners, other engineers and developers to deliver a superior support experience to development community
Responsibilities
Solve complex problems related to cloud infrastructure, Linux infrastructure and build automation to prevent problem recurrence.
Identify opportunities and drive the implementation of automation to improve service health, availability and reliability
Architect, design, configure, deploy, and script end-to-end service monitoring, alerting and self-healing capabilities for production services
Understand the end-to-end configuration, technical dependencies, characteristics of production infrastructure and services
Quickly grasp and analyze new technologies that are complex and rapidly changing and integrate those into automation and infrastructure support
Act as escalation point for complex or critical issues that may not have a documented procedure and provide cause analysis (RCA)
Author functional and technical documentation and standard operating producers (SOP)
Collaborate with development teams in defining and implementing improvements in service architecture.
Partner with DevOps teams, Oracle Cloud Infrastructure deployment, development teams to identify and resolve issues.
Articulate technical characteristics of services and technology areas and guide cross-functional teams to engineer and add capabilities to internal tools.
Responsible for the design and delivery of the mission critical automation, with focus on security, resiliency, scale, and performance.
Work with Global SRE team of Database Engineering and lead global projects.
Knowledge Skills
6- 12 years of experience in Site Reliability Engineering and in implementing automation.
Experience in Linux administration with good knowledge on Kernel level debugging
Experience in debugging operating system performance issues and performance tuning
Excellent troubleshooting skills for resolving critical application, networking and system administration issues
Experience working with fault tolerant, highly available, high throughput, distributed, scalable systems
Expertise in developing scripts, utilities and tools to automate routine or manual intensive tasks
Experience in application, compute, storage and database troubleshooting for improving application reliability, scalability, availability
Experience in cloud infrastructure technologies
Experience in operations, problem management
Experience with ML and AI based development
Experience with monitoring tools such as Prometheus, Grafana
Development experience using Python and building Infrastructure using Terraform
Solid experience with Configuration Management tools such as Ansible, Chef
Experience in managing 24 7 high-availability production applications
Experience of working with global teams across different time zones.
Possess and demonstrates strong logical-thinking skill, full of intellectual curiosity and high for self-development.
Aptitude to be a good team player and the desire to learn and implement new Cloud technologies as needed
Good understanding of Agile software development principles including using common tools such as JIRA
Good understanding of cloud security, compliance management including patching
Multi-OS knowledge and expertise is preferred
Excellent organizational, verbal, and written communication skills
Qualifications required
6 to 12 years of experience working in IT Operations Infrastructure team
Bachelor degree in Computer Science, Computer Engineering, Software Engineering, or related areas is preferred