SRE Manager for AI/ML - Vice President
3d ago
source : Shine

J.P. Morgan is a leader in financial services, offering innovative and intelligent solutions to clients in more than 100 countries with one of the most comprehensive global product platforms available.

We have been helping our clients to do business and manage their wealth for more than 200 years and we keep their interests foremost in our minds at all times.

This combination of product strength, intellectual capital and character sets us apart as an industry leader. J.P. Morgan is part of J.

P. Morgan Chase & Co. (NYSE : JPM), a global financial services firm with assets of $2.0 trillion.

The Chief Technology Office (CTO) aims to deliver technology efficiently and effectively with the right capabilities and the best talent for the firm, while removing friction that slows delivery.

The AI / Machine learning group within the Chief Technology Office owns the Strategy and Pattern for AI and Machine Learning for the firm.

We enable advanced analytics for the Lines of Business through the use of AI and Machine Learning design patterns and common services.

We are seeking an experienced software engineering lead in our global Site Reliability Engineering (SRE) team supporting our AI / ML platform.

  • This individual will be expected to lead a team of software engineers who will grow into subject manage experts, work with functional application development teams, partner with infrastructure engineers and production support analysts to determine requirements for designing and developing automation, SDLC and development environment testing &
  • integration tools. The toolsets developed must pass the rigor of JPMC s cyber security standards.

    The SRE team runs, maintains and improves the AI / ML Platform against established Service Level Objectives by applying software engineering practices.

    It is responsible for the availability, performance, change management, monitoring, and capacity management of their services, with special emphasis being placed on the automation of the processes / workload in support of the above.

    The SRE team is also responsible for the operational support of the AI / ML infrastructure, with emphasis being placed on the ability to submit outage / issue / incident data into a design and SDLC feedback loop to ensure maximum automation and outage avoidance.


    Manage a team of software engineers focused on improving and promoting reliability, availability and security of our infrastructure, systems and applications.

    Accountable for supporting Cloud and On premise based AI / ML systems and infrastructure

    Key contributor to SRE, core infrastructure and functional development teams throughout the life cycle to help support software for reliability and scale, ensuring minimal refactoring or changes

    Cultivates trust through personal and team relationships with senior management and key stakeholders inclusive of senior management (MD s).

    Troubleshoots priority incidents, conducts blameless post-mortems and ensures permanent closure of the incidents

    Apply industry standard change management, incident management and problem management principles

    Engages with development team throughout the life cycle to help develop software for reliability

    Designs and conducts the performance tests, identifies the bottlenecks, opportunities for optimization and the capacity demand

    Contributes to the definition of the strategic roadmap and its execution; inclusive of R&D of emerging industry trends

    Work with Cyber team to ensure systems are safe and resolve / prioritize vulnerability fixes

    Applies periodic analytics and reporting on incidents and patterns for issues and takes proactive actions

    Defines and drives adoption of a best in class monitoring frameworks to accomplish end to end flow monitoring and noiseless alerting

    Deploys the software and product upgrades

    Adds value to team delivery and works with team to complete tasks to high quality and actively learns new skills

    Facilitates maximum speed of delivery by objectively binding to error budgets of the service

    Manages the effort split between manual operational work and engineering work

    Be part of the 24x7 support coverage as needed

    Embrace & promote cultural embodiment of group and firm.

    BS or MS degree in computer science

    A minimum of 8 years of hands-on leadership of high-performing, agile-based engineering teams

    6+ years of experience architecting integrated stack solutions (storage, network, compute) within an enterprise scale production environment

    6+ years of experience in performance engineering and monitoring using tools such as AppDynamics, Splunk, Apica, Jmeter, data dog etc.

    6+ years of incident management, change management and problem management experience in an large scale operations environment

    Experience in Anaconda, Jupyter, open source framework.

    Experience in conda packaging of python libraries

    Cloud computing : Amazon Web Service, Azure, Docker, Kubernetes.

    Experience working in an Agile Development environment

    Experience in setting CI / CD pipeline.

    Proven ability to understand and troubleshoot complex problems under pressure

    Familiarity with AWS ML / Sagemaker, Azure ML, Google AI would be preferred.

    Experience in big data technologies

    Experienced in implementing GIT, BitBucket, Jenkins, SONAR, SPLUNK, Maven, AIM and / or Continuous Delivery tools

    Report this job

    Thank you for reporting this job!

    Your feedback will help us improve the quality of our services.

    My Email
    By clicking on "Continue", I give neuvoo consent to process my data and to send me email alerts, as detailed in neuvoo's Privacy Policy . I may withdraw my consent or unsubscribe at any time.
    Application form