J.P. Morgan is a leader in financial services, offering innovative and intelligent solutions to clients in more than 100 countries with one of the most comprehensive global product platforms available.
We have been helping our clients to do business and manage their wealth for more than 200 years and we keep their interests foremost in our minds at all times.
This combination of product strength, intellectual capital and character sets us apart as an industry leader. J.P. Morgan is part of J.
P. Morgan Chase & Co. (NYSE : JPM), a global financial services firm with assets of $2.0 trillion.
The Chief Technology Office (CTO) aims to deliver technology efficiently and effectively with the right capabilities and the best talent for the firm, while removing friction that slows delivery.
The AI / Machine learning group within the Chief Technology Office owns the Strategy and Pattern for AI and Machine Learning for the firm.
We enable advanced analytics for the Lines of Business through the use of AI and Machine Learning design patterns and common services.
We are seeking an experienced software engineering lead in our global Site Reliability Engineering (SRE) team supporting our AI / ML platform.
integration tools. The toolsets developed must pass the rigor of JPMC s cyber security standards.
The SRE team runs, maintains and improves the AI / ML Platform against established Service Level Objectives by applying software engineering practices.
It is responsible for the availability, performance, change management, monitoring, and capacity management of their services, with special emphasis being placed on the automation of the processes / workload in support of the above.
The SRE team is also responsible for the operational support of the AI / ML infrastructure, with emphasis being placed on the ability to submit outage / issue / incident data into a design and SDLC feedback loop to ensure maximum automation and outage avoidance.
Manage a team of software engineers focused on improving and promoting reliability, availability and security of our infrastructure, systems and applications.
Accountable for supporting Cloud and On premise based AI / ML systems and infrastructure
Key contributor to SRE, core infrastructure and functional development teams throughout the life cycle to help support software for reliability and scale, ensuring minimal refactoring or changes
Cultivates trust through personal and team relationships with senior management and key stakeholders inclusive of senior management (MD s).
Troubleshoots priority incidents, conducts blameless post-mortems and ensures permanent closure of the incidents
Apply industry standard change management, incident management and problem management principles
Engages with development team throughout the life cycle to help develop software for reliability
Designs and conducts the performance tests, identifies the bottlenecks, opportunities for optimization and the capacity demand
Contributes to the definition of the strategic roadmap and its execution; inclusive of R&D of emerging industry trends
Work with Cyber team to ensure systems are safe and resolve / prioritize vulnerability fixes
Applies periodic analytics and reporting on incidents and patterns for issues and takes proactive actions
Defines and drives adoption of a best in class monitoring frameworks to accomplish end to end flow monitoring and noiseless alerting
Deploys the software and product upgrades
Adds value to team delivery and works with team to complete tasks to high quality and actively learns new skills
Facilitates maximum speed of delivery by objectively binding to error budgets of the service
Manages the effort split between manual operational work and engineering work
Be part of the 24x7 support coverage as needed
Embrace & promote cultural embodiment of group and firm.
BS or MS degree in computer science
A minimum of 8 years of hands-on leadership of high-performing, agile-based engineering teams
6+ years of experience architecting integrated stack solutions (storage, network, compute) within an enterprise scale production environment
6+ years of experience in performance engineering and monitoring using tools such as AppDynamics, Splunk, Apica, Jmeter, data dog etc.
6+ years of incident management, change management and problem management experience in an large scale operations environment
Experience in Anaconda, Jupyter, open source framework.
Experience in conda packaging of python libraries
Cloud computing : Amazon Web Service, Azure, Docker, Kubernetes.
Experience working in an Agile Development environment
Experience in setting CI / CD pipeline.
Proven ability to understand and troubleshoot complex problems under pressure
Familiarity with AWS ML / Sagemaker, Azure ML, Google AI would be preferred.
Experience in big data technologies
Experienced in implementing GIT, BitBucket, Jenkins, SONAR, SPLUNK, Maven, AIM and / or Continuous Delivery tools