24 days old

Principal System Site Reliability Engineer - Chaos Engineering

Dallas, TX 75219
Apply Now
Apply on the Company Site
_At AT&T, were connecting the world through the latest tech, top-of-the-line communications and the best in entertainment. Our groundbreaking digital solutions provide an intuitive and integrated experience for customers across online, retail and care channels. Join our mission to deliver compelling communication and entertainment experiences to customers around the world. Youll drive how we deliver a seamless and fast customer experience with digital at the center of AT&T's distribution channels. Were offering an opportunity to revolutionize the digital space and the chance to create a career that will propel your future._

**Sr. System/Site Reliability Engineer - Chaos Engineering**

**Position Overview**

This position is responsible for implementing and managing pro-active Chaos Engineering and Chaos Testing practices to discover system behaviors, properties, and performance, enabling improvements that drive optimum production site experience and operations even during higher-than-expected site traffic, network outages, security attacks, hardware/memory failures, or software defects. This position implements new capabilities to drive scale, resilience, performance and reliability at all times.


+ Evaluating & implementing best practices for Chaos Engineering and Chaos Testing to enable industry-leading reliability and resiliency for mission-critical customer experiences and back-end systems.

+ Applying software engineering to automate all aspects of the software release and operations process from build/test/deploy, monitoring and alerting, service level reporting, to automatic failover and capacity management.

+ Defining steady state that represents normal behavior of the site/system, hypothesizing expected outcomes when something goes wrong, and designing experiments with variables to reflect real-world events like dependency failures, server failures, network or memory malfunctions, etc.

+ Measuring the impact of tests and observing difference of steady state across test groups

+ Based on learnings, developing results and architecture designs where individual components can fail without affecting the availability of the entire system.

+ Partnering with SREs, Architects, and Product Managers to ensure software they produce meets reliability, serviceability, and resiliency standards our customers deserve.

+ Drive best practices and patterns that will contribute to AT&Ts reputation as an industry leader for running highly reliable Digital applications and experiences.

+ Solving the hard problems of running large-scale services at the highest levels of reliability and resiliency.

+ Establishing great rapport with other DevOps teams, Product Managers, and Operations teams to maintain high levels of visibility, efficiency, and collaboration.

**Minimum Qualifications**

+ 8+ years related experience with a bachelors degree in Computer Science, Information Systems or related field.

+ 6+ years of progressive experience in one or more of the following areas: application delivery; subject matter expertise in building Java-based high-volume/high-transaction e-commerce applications

+ 6+ years of experience building web applications using HTML5/CSS3/Javascript

+ 3+ years of experience working with front end frameworks such as React, Angular

**Preferred Qualifications**

+ 4+ years of experience in architecture and design of systems using Micro services architecture

+ 4+ years of experience in a leadership capacity - coaching and mentoring engineers, developers

+ 2+ years of experience working with SPA/PWA architectures

+ 2+ years of experience with server-side rendering technologies and architectures

+ 2+ years of experience in cloud technologies: AWS, Azure, OpenStack, Docker, Kubernetes, Ansible, Chef or Terraform

+ 2+ years of experience in build and CICD technologies: GitHub, Maven, Jenkins, Nexus or Sonar

+ 4+ years of experience in Unit and Function testing using Junit, Spock, Mockito/JMock, Selenium, Cucumber, SoapUI or Postman

+ Proficiency in Unix/Linux command line

+ Expert knowledge and experience working with asynchronous message processing, stream processing and event driven computing.

+ Experience working within Agile/Scrum/Kanban development team

+ Excellent written and verbal communication skills with demonstrated ability to present complex technical information in a clear manner to peers, developers, and senior leaders

**Technical Skills**

HTML5, CSS3, Javascript, React, Nextjs, Angular, Nodejs, REST services, NoSql technologies (Cassandra/MongoDb), Kafka/MQ/Rabbit, Redis/Hazelcast, Git, Jira, Jenkins, Docker, Kubernetes
We expect employees to be honest, trustworthy, and operate with integrity. Discrimination and all unlawful harassment (including sexual harassment) in employment is not tolerated. We encourage success based on our individual merits and abilities without regard to race, color, religion, national origin, gender, sexual orientation, gender identity, age, disability, marital status, citizenship status, military status, protected veteran status or employment status.
Posted: 2021-05-20 Expires: 2021-06-19
Sponsored by:
ADP Logo

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Principal System Site Reliability Engineer - Chaos Engineering

Dallas, TX 75219

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast