Sr. Lead Infrastructure SRE
Personal Qualities
To undertake this role, you will need:
- Keenness to learn new technologies, concepts and techniques.
- Demonstratable critical thinking skills – we deal with complex issues, and this requires clear thought processes.
- Synthesise an approach to an issue from existing knowledge and the new techniques we teach you.
- Drive and determination – the issues we deal with often take twists and turns demanding real stamina from our SREs.
- Confidence to stand your ground using data to explain your conclusions and recommendations.
Qualifications
- The following qualifications are essential for this role:
- At least 5 years’ experience in either;
- Java EE / Jakarta EE application software development, or
- Java EE / Jakarta EE application support
- A demonstratable understanding of distributed systems.
- A working knowledge of containerised applications.
- A demonstratable basic understanding of TCP/IP.
- A demonstrable understanding of an application layer protocol such as HTTP.
Nice to Have
The following qualifications would be beneficial to this role:
- Experience developing or supporting applications based on Tomcat application servers.
- Experience developing or supporting applications based on WebLogic application servers.
- Experience developing or providing support in a microservice environment.
- Knowledge of a messaging technology such as MQ (Message Queue), Solace or Kafka.
- Experience in full stack support (application, data and infrastructure).
- Knowledge of Oracle or Microsoft SQL Server relational database technologies.
- Experience in analysing data logs using Elastic Kibana.
- Experience in analysing data logs using Azure Log Analytics.
- Experience in the use of Wireshark for the capture and analysis of network packet traces.
- Experience (past or present) in the use of an automation platform such as Ansible, Puppet, Chef, Salt or vRA.
- Experience developing or supporting applications based on Pivotal Cloud Foundry (Tanzu Application Service).
- Knowledge of SRE concepts and techniques.
- Experience with DevOps-related tasks; in particular, BAU support.
- Experience in using ServiceNow.
- An understanding of the regulatory landscape for financial services.
Tasks & Responsibilities
General
- Attend weekly team meetings.
- Submit time records at the end of each week.
- Undertake general tasks that may be allocated from time-to-time.
Recurring Problem Diagnosis (RPD)
The investigations will be based on our RPR problem diagnosis method which we will teach you. The tasks and responsibilities are:
- Conduct Discovery Calls to obtain;
- a problem statement,
- a high-level understanding of the moving parts of the system to investigate,
- how the data flows around the system, and
- the diagnostic data sources available.
- Produce a Diagnostic Capture Plan that describes how the data needed will be captured.
- Help app and infra people to execute the Diagnostic Capture Plan.
- Analyse the data that results to determine the root cause of the problem, or the next steps.
- Issue periodic email-based status reports.
- Attend investigation progress meetings with stakeholders.
- Notify the team leader of blockers or other issues that may arise.
- Assist other SREs in investigations.
- Handle multiple RPDs at any time – this is possible as there can be long pauses in the investigations.
- Undertake projects to improve our ability to solve problems.
Golden Signal Monitoring
- Use Site Reliability Core (SRC) to identify app and infrastructure services that are missing their Service Availability target or in danger of doing so.
- For the services identified as having a problem, investigate using SRC and other data sources.
- Assess the underlying issue against criteria that we have establish and, where appropriate, create a ServiceNow problem record with details of the problem and assign to the service owner.
- Work with the team that owns the service to help them understand our findings and explain to vendors.
- Assist the service support team in determining the cause of the problem.
- Assist in the operation of our SRC system including onboarding of services, setting availability metrics and fine tuning SLOs.
- Undertake projects to improve our ability to monitor systems and deliver service availability information.