24 days old

Principal Software Site Reliability Engineer - Problem Management & RCA

Dallas, TX 75219
Apply Now
Apply on the Company Site
At AT&T, were connecting the world through the latest tech, top-of-the-line communications and the best in entertainment. Our groundbreaking digital solutions provide an intuitive and integrated experience for customers across online, retail and care channels. Join our mission to deliver compelling communication and entertainment experiences to customers around the world. Youll drive how we deliver a seamless and fast customer experience with digital at the center of AT&T's distribution channels. Were offering an opportunity to revolutionize the digital space and the chance to create a career that will propel your future.


**Principal Software/Site Reliability Engineer - Problem Mgmt & RCA**


**Position Overview**


This position is responsible for driving 24x7 Problem/Incident Mgmt impact and RCA assessment and communication for Consumer online Sales, Account Management, and Support websites and mobile apps. This position will define Service Level Objectives (SLOs) and also track & drive availability & service metrics, and accomplishment of operational SLOs.


**Responsibilities**


+ Analysis of GTOC enterprise Incidents including implementing automated tracking and reporting of system, customer & business impacts from site outages, incidents, and critical defects.

+ Weekly and monthly analysis of progress & accomplishment against Service Level Objectives (SLOs) and identifying/driving gap closures where necessary.

+ Coordinating with GTOC, Digital Product Delivery (PO/PM, Dev, QA), Operations, Site Reliability Engineers, Infrastructure/Network & 3rd Party vendors to drive resolution of reported problems.

+ Leading Root-Cause Analysis (RCA) for complex outages, incidents, and critical/major defects, and tracking resolution through completion.

+ Provide training to teams and audit RCAs to ensure blameless post-mortems are conducted per established principles and the resulting information is actionable to ensure the same problems do not occurs more than once.

+ Developing tools, scripts, queries and performing data analysis of weekly/month/YTD incidents/problems to determine chronic/recurring root causers and applications with high frequency of incidents.

+ Partnering with Site Reliability Engineers (SREs), DevOps teams, Network, Infrastructure, Security & Fraud services to establish proactive and automated monitoring/alerting for chronic root causers, establish get-well/ improvement plans and driving established improvement plans through to resolution.


**Minimum Qualifications**


+ 8+ years related experience with a bachelors degree in Computer Science, Information Systems or related field.

+ 6+ years of progressive experience in one or more of the following areas: application delivery; subject matter expertise in building Java-based high-volume/high-transaction e-commerce applications

+ 6+ years of experience building web applications using HTML5/CSS3/Javascript

+ 3+ years of experience working with front end frameworks such as React, Angular


**Preferred Qualifications**


+ 4+ years of experience in architecture and design of systems using Micro services architecture

+ 4+ years of experience in a leadership capacity - coaching and mentoring engineers, developers

+ 2+ years of experience working with SPA/PWA architectures

+ 2+ years of experience with server-side rendering technologies and architectures

+ 2+ years of experience in cloud technologies: AWS, Azure, OpenStack, Docker, Kubernetes, Ansible, Chef or Terraform

+ 2+ years of experience in build and CICD technologies: GitHub, Maven, Jenkins, Nexus or Sonar

+ 4+ years of experience in Unit and Function testing using Junit, Spock, Mockito/JMock, Selenium, Cucumber, SoapUI or Postman

+ Proficiency in Unix/Linux command line

+ Expert knowledge and experience working with asynchronous message processing, stream processing and event driven computing.

+ Experience working within Agile/Scrum/Kanban development team

+ Excellent written and verbal communication skills with demonstrated ability to present complex technical information in a clear manner to peers, developers, and senior leaders


**Technical Skills**


HTML5, CSS3, Javascript, React, Nextjs, Angular, Nodejs, REST services, NoSql technologies (Cassandra/MongoDb), Kafka/MQ/Rabbit, Redis/Hazelcast, Git, Jira, Jenkins, Docker, Kubernetes


AT&T is leading the way to the future for customers, businesses and the industry. We're developing new technologies to make it easier for our customers to stay connected to their world. Together, weve built a premier integrated communications and entertainment company and an amazing place to work and grow. Team up with industry innovators every time you walk into work, creating the world you always imagined. Ready to #transformdigital with us? Apply now!
We expect employees to be honest, trustworthy, and operate with integrity. Discrimination and all unlawful harassment (including sexual harassment) in employment is not tolerated. We encourage success based on our individual merits and abilities without regard to race, color, religion, national origin, gender, sexual orientation, gender identity, age, disability, marital status, citizenship status, military status, protected veteran status or employment status.
Posted: 2021-05-20 Expires: 2021-06-19
Sponsored by:
ADP Logo

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Principal Software Site Reliability Engineer - Problem Management & RCA

AT&T
Dallas, TX 75219

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast