1+ months

Senior Site Reliability Engineer

Humana
San Diego, CA 92108
Apply Now
Apply on the Company Site
Description Site Reliability Engineers are software development experts who handle the following responsibilities in a company: improving application lifecycle, evolving software systems to increase their reliability, monitoring application performance, and ensuring overall system health such as: high availability, low latency, top performance, high efficiency, effective change management, continuous monitoring & alarming, emergency response, and capacity planning. They act as a bridge between development and operations teams by applying a software engineering mindset to system administration topics. Responsibilities Job Description Overview: Site Reliability Engineers are software development experts who handle the following responsibilities in a company: improving application lifecycle, evolving software systems to increase their reliability, monitoring application performance, and ensuring overall system health such as: high availability, low latency, top performance, high efficiency, effective change management, continuous monitoring & alarming, emergency response, and capacity planning. They act as a bridge between development and operations teams by applying a software engineering mindset to system administration topics. + Building software to help operations and support teams: SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management. + Fixing support escalation issues: A site reliability engineer can expect to spend time fixing support escalation cases. Because an SRE team touches so many different parts of the engineering and IT organization, they can be a great source of knowledge and can be helpful for routing issues to the right people and teams. + Optimizing on-call rotations and processes: Site reliability engineers will need to take on-call responsibilities. The SRE role will have a lot of say in how the team can improve system reliability through the optimization of on-call processes. SRE teams will help add automation and context to alerts leading to better real-time collaborative response from on-call responders. Additionally, site reliability engineers can update runbooks, tools and documentation to help prepare on-call teams for future incidents. + Documenting tribal knowledge: SRE teams gain exposure to systems in both staging and production, as well as all technical teams. They take part in work with software development, support, IT operations and on-call duties meaning they build up a great amount of historical knowledge over time. Instead of siloing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it. + Conducting post-incident reviews: SRE teams need to keep teams honest and ensure that everyone software developers and IT professionals are conducting post-incident reviews, documenting their findings and taking action on their learnings. Then, site reliability engineers are often tasked with action items for building or optimizing some part of the SDLC or incident lifecycle to bolster the reliability of their service. Responsibilities (representative examples): + Capacity planning and management create, use, maintain a capacity model for cloud based implementations. + Performing continuous integration and delivery as well as to Implement, test and monitor new microservices & trouble shooting of related deployment issues on Linux systems. + Collect and maintain a complete inventory of all systems. Identify and retire unused systems to recycle resources and reduce maintenance costs. + Create and maintain documentation of systems and processes for existing and new systems; as well as Configure and maintain Puppet/Ansible/Chef cookbooks for all deployed environments + Deploy and monitor instances and services in cloud based environments as well as to Identify and correct the root cause of various system alarms; as well as recommend changes to avoid their recurrence. + Provide systems support by participating in rotational on-call support by executing emergency recovery, maintenance and upgrades during weekend and evening hours when required. + Serve as an escalation point for other Systems Administrators, Engineers, and other technology teams in the resolution of server and system problems. + Lead & contribute in the proof-of-concept, implementation and maintenance of automation tools used in the management of our infrastructure. + Plan, schedule, test and perform software installation and upgrades. + Build, administer, and troubleshoot all mission critical environments (Production, Stage, Dev, Test, QA) + Leverage automation tools, especially Bash, Powershell and Puppet, in order to decrease end-to-end deployment times, reduce downtime, and increase reliability. + Implement and maintain monitoring solutions at the server and application level in order to increase visibility into day-to-day operations and issues, utilizing Nagios & Elk/Splunk + Lead initiatives to transition critical software services into the Cloud, and provide training for other employees on the Cloud transition process for other portions of the product/organization. + Generating well defined and documented standard processes for the enterprise. + Provide solutions for performance management, disaster recovery, monitoring and access management + Work/support business users to understand issues, develop root cause analysis and work with the team for the development of enhancements/fixes + Provide engineering design across different workloads including incident & problem management, change management, security and compliance + Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth Required Skills: + 5+ years Industry (post-graduation) experience in designing/developing, testing and supporting a highly scalable, highly available online service. + 5+ years Industry (post-graduation) experience in working with a cloud based environments (AWS and/or Google and/or Azure) + 5+ years Industry (post-graduation) experience working with Linux and the Windows operation systems. + 2 + years Industry (post-graduation) experience in configuration management frameworks and experience using tools such as Puppet, Ansible and Chef. + 2 + years Industry (post-graduation) experience in distributing processing frameworks like Spark and orchestration frameworks like Kubernetes and Docker Swarm for microservices. + 2 + years Industry (post-graduation) experience in scripting languages (Bash, Python & PowerShell). + Working knowledge of TCP/IP, TCP/UDP as well as working knowledge of routers, switches, firewalls/VPNs and higher-level protocols like HTTP and DNS. + Working knowledge of monitoring & alarming tools like Nagios and Ele/Splunk + Working knowledge of relational and non-relational databases: MS SQL, MySQL, Postgres, Oracle & Mongo + Ability to troubleshoot run time service issues (memory leaks, race conditions, etc.) with appropriate tools (Dynatrace, JMeter, etc.). + Ability to define, document & explain technical architecture of complex and highly scalable products. Required Education: + Bachelor of Science in an engineering discipline (Preferred: Computer Science, Computer Engineering, Computer Technology, Software Engineering, etc.) or equivalent experience Desired Certifications: + Linux: Linux Foundation Certified System Administrator (LFCS) and/or Linux Foundation Certified Engineer (LFCE); Red Hat Certified System Administrator (RHCSA) and/or Red Hat Certified Engineer (RHCE) and/or Red Hat Certified Technician (RHCT) + Windows: Microsoft Certified Systems Administrator (MCSA) and/or Microsoft Certified Systems Engineer (MCSE) + Cloud: AWS Certified SysOps Administrator and/or AWS DevOps Engineer + Cloud: Azure Solution Architect; Azure DevOps Engineer; Azure Administrator Associate; Azure Developer Associate; Azure Security Engineer Associate + Network: Cisco Certified Network Associate or Professional -CCNA/ CCNP MCITP Server. + CompTIA Server+ and/or CompTIA Cloud+ Additional Information Scheduled Weekly Hours 40 About Us Mission: At Humana, our cultural foundation is aligned to helping members achieve their best health by delivering personalized, simplified, whole-person healthcare experiences. Recognizing healthcare needs continue to evolve for each person, for each family and for each community, Humana continuously creates innovative solutions and resources that help people live their healthiest lives on their terms when and where they need it. Our employees are at the heart of making this happen and thats why we are dedicated to building an organization of dynamic talent whose experience and passion center on putting the customer first. Equal Opportunity Employer It is our policy to recruit, hire, train, and promote people without regard to race, color, religion, sex, national origin, age, sexual orientation, gender identity or expression, disability, or veteran status, except where age, sex, or physical status is a bona fide occupational qualification. View the EEO is the Law poster. If you are an individual with a disability and require a reasonable accommodation to complete any part of the application process, or are limited in the ability or unable to access or use this online application process and need an alternative method for applying, you may contact mailbox_tas_recruit@humana.com for assistance. Humana Safety and Security Humana will never ask, nor require a candidate provide money for work equipment and network access during the application process. If you become aware of any instances where you as a candidate are asked to provide information and do not believe it is a legitimate request from Humana or affiliate, please contact mailbox_tas_recruit@humana.com to validate the request. <>
Posted: 2019-06-28 Expires: 2019-11-09

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Senior Site Reliability Engineer

Humana
San Diego, CA 92108

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast