About the company
Acko is India's first and only all-digital insure tech product company. It was founded in 2016 by Mr. Varun Dua- a serial fintech entrepreneur who set simplicity, affordability and innovation as its cornerstones.
Through innovative digital products, customized pricing, and use of data and technology, we are changing how insurance works, and is perceived by users in India.
Although we are solving for the Indian market and are based out of Bangalore, we are part of a global wave of insurtech startups that are creating success through technology and business model disruption.
We are a well-funded series-D company backed by a slate of marquee investors including Binny Bansal, Amazon, Ascent capital, Accel, SAIF and Catamaran.
Through partnerships with large internet players such as Amazon, Ola, RedBus, Oyo, Lendingkart, ZestMoney, GOMMT group etc, our micro-insurance product has reached 50M unique users.
As a Reliability engineer you will own the production engineering of the Acko’s online services of all the business verticals.
You will also support and drive the reliability, availability and performance engineering processes ensuring the 99.9999% of availability of all the technology and business services.
Responsible for the availability, performance, scaling, monitoring and incident response of Acko’s technology platform and services.
Ensure the site and services are up 24*7 with no unplanned downtimes.
Troubleshooting of exceptions, performance issues and latencies / errors across multiple technologies.
Debugging of the code issues based on web service and API responses, errors, events, logs, etc.
Work on / triage of the daily tickets related to uptime and production issues.
Automate the critical jobs across the entire platform to minimise manual errors and human intervention.
Work closely with the Technology stakeholders, Product, Application development, Devops, QA, etc. and offer right feedback on the Java stack or Enterprise stack to make the services highly performing and reliable.
Implementation of effective monitoring for all the events and logs with right alerting / escalations for the critical alerts.
Capacity planning and Infrastructure upgrades timely for best reliability of the site.
Ensure proper reviews are built to minimise the Mean Time to Recover (MTTR) and Mean Time to Failure (MTTF).
Implementation of ITIL processes like Incident management, problem management and change management.
Documentation of run books, incident response, post-mortem reports, RCA, etc. with clear mitigation steps and action items.
Understand the business flow and map the technology problems to get the right solutions out.
Qualification & skills
BS degree in Computer Science or related engineering disciplines
2 - 14 years of relevant reliability engineering work experience in any of the Online technology companies.
Proven 24*7 support through Oncall / pager duty of systems in scale.
Ability to understand the business services and map it to the reliability engineering design and review.
Support the technology and business services of the entire technology platforms from the scaling and performance perspective.
Manage the uptime of each of the micro services by building and implementation of the right monitoring and alerts.
Good understanding of object oriented programming, relational databases, NOSQL, caching systems, etc.
Proven ITIL support on incident, problem and change management with less response and resolution times.
Strong problem management abilities by automating any repeatable jobs and working with the stakeholders to ensure the incidents do not repeat again.
OS, Database and Java stack administration at intermediate level or more.
Ability to review the code and suggest inputs to the Development teams.
Ability to use both open source and commercial APM / logging and monitoring tools and troubleshoot the issues.