Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

11,309 Site Reliability jobs in the United States

DevOps/Site Reliability Engineer

32806 Orlando, Florida Insight Global

Posted 2 days ago

Tap Again To Close

Job Description

Job Description
The DevOps & Site Reliability Engineer position is responsible for implementing and maintaining the continuous integration / continuous deployment (CI/CD) pipeline that meets Development, IT Security, and Program Governance requirements. This role also leads the design and automation of infrastructure using tools like Terraform and Ansible, and ensures system reliability through proactive monitoring, observability, and incident response practices.
In addition, the role is responsible for implementing and supporting infrastructure and processes that enable comprehensive monitoring and observability for Servers, Applications, and Network using tools such as Prometheus, Elastic Stack, and Grafana. The engineer will also work with messaging and data systems including Redis, RabbitMQ, Kafka, and CouchDB, and contribute to the development of internal developer portals to enhance engineering productivity and platform usability.
Essential Duties and Responsibilities:
Infrastructure & Reliability Engineering (30%)
- Collaborate with leadership, technical leads, and teams to design and maintain infrastructure that maximizes availability and ensures configuration consistency across environments.
- Implement and maintain SRE best practices such as SLIs, SLOs, and error budgets.
CI/CD & Developer Enablement (25%)
- Maintain and enhance CI/CD pipelines and tooling.
- Analyze onboarding requests and automate developer workflows.
- Contribute to the development and maintenance of internal developer portals to streamline access to tools, documentation, and services.
Automation & Transformation Projects (25%)
- Lead design and development efforts that drive the transformation to a DevOps and SRE culture.
- Automate infrastructure provisioning, configuration management, and application deployment using tools like Terraform and Ansible.
Monitoring & Observability (10%)
- Maintain and improve observability tools and practices using Prometheus, Elastic Stack, Grafana, and other monitoring solutions.
- Ensure proactive alerting and actionable insights into system performance and reliability.
Support & Collaboration (10%)
- Assist technical staff in resolving complex issues.
- Participate in incident response and postmortem processes to improve system resilience.
- Understand and actively participate in Environmental, Health & Safety responsibilities by following established UO policy, procedures, training, and team member involvement activities.
- Perform other duties as assigned.
$47/hr - $75/hr - Exact compensation may vary based on several factors, including skills, experience, and education.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: and Requirements
4-6 years of experience in software engineering, DevOps, or SRE roles.
Strong understanding of Agile development methodologies and DevOps/SRE principles.
Specific Skills & Abilities:
- Extensive experience with CI/CD tools and pipelines (e.g., GitLab, GitHub Actions, Jenkins).
- Deep knowledge of containerization and orchestration (Kubernetes, Docker, Rancher).
- Proficient in scripting and automation (Ansible, Python, PowerShell).
- Experience with infrastructure as code tools such as Terraform.
- Experience with observability tools: Prometheus, Elastic Stack, Grafana, Elastic APM.
- Familiarity with message brokers and data streaming platforms: RabbitMQ, Kafka.
- Experience with caching and NoSQL databases: Redis, CouchDB.
- Proven ability to debug and troubleshoot distributed systems.
- Experience with internal developer portals and platform engineering concepts.
- Strong analytical, planning, and organizational skills.
- Excellent communication and collaboration skills across technical and non-technical teams.
- Experience with Linux systems; Windows/IIS experience is a plus.
- Familiarity with tools such as Atlassian Suite (Jira, Confluence), ServiceNow, SolarWinds, SQL.
- Agile certifications (e.g., Scrum Master) are a plus.
-

View Now

Site Reliability Engineer

30081 Smyrna, Georgia Insight Global

Posted 1 day ago

Tap Again To Close

Job Description

Job Description
A client of Insight Global is looking for an SRE to join their infrastructure team working within a heavy API environment. The individual would be responsible for working on Home Depot's internal proxy called Vantage. This person would be responsible for working on OSI networking layers as well as automating the deployment and visualization of the proxy. Additionally, this individual will also have additional responsibilities including debugging, test building, and occasional customer support. Pay rate for this position will be between $65 and $70/hr.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: and Requirements
· 7-10+ years of experience as an SRE in the networking space
· Heavy Terraform experience
o Must know how to use variables within Terraform
· Experience with Kubernetes
o Must be able to build, optimize and modify containers within Docker
· GitHub experience
· Ansible experience
○ Building Ansible templates
· Experience with Bash/Shell scripting as well as low testing
· Experience building Grafana dashboards
· GCP experience
· Candidate must have a public GitHub account · Golang experience
Envoy proxy experience

View Now

Site Reliability Engineer

99811 Juneau, Alaska Oracle

Posted 2 days ago

Tap Again To Close

Job Description

**Job Description**
This role aligns to work done for the US Federal Government and requires US citizenship among other qualification outlined below. Including a Federal Investigation into your background to gain Public Trust.
RTHS DevOps is responsible for the CareAware Cloud Saas across all our cloud regions internal and client facing. The team is responsible for keeping the lights on as well as other needed deployments, projects, and new implementations.
As a member of the RTHS DevOps team you will be responsible for daily operational tasks required to run it for all our cloud clients. You will monitor and maintain server performance, availability, and ensure compliance to Service Level Agreements. You will address operational systems issues as needed. You will deploy new code, onboard new clients or new solutions and complete technology upgrades. As we move into the future projects, we have critical involvement in our OCI cloud build out and client migrations giving an opportunity to get involved from the ground of these new regions and apply dev ops thinking from the beginning.
**Responsibilities**
As a member of the RTHS DevOps team you will be responsible for daily operational tasks required to run it for all our cloud clients. You will monitor and maintain server performance, availability, and ensure compliance to Service Level Agreements. You will address operational systems issues as needed. You will deploy new code, onboard new clients or new solutions and complete technology upgrades. As we move into the future projects, we have critical involvement in our OCI cloud build out and client migrations giving an opportunity to get involved from the ground of these new regions and apply dev ops thinking from the beginning.
Qualifications:
+ Deep Linux Knowledge
+ Strong knowledge of Kubernetes
+ System Monitoring and troubleshooting
+ Networking Monitoring and troubleshooting
+ Cloud experience OCI or AWS preferred
Disclaimer:
**Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.**
**Range and benefit information provided in this posting are specific to the stated locations only**
US: Hiring Range in USD from: $79,800 to $178,100 per annum. May be eligible for bonus and equity.
Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business.
Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.
Oracle US offers a comprehensive benefits package which includes the following:
1. Medical, dental, and vision insurance, including expert medical opinion
2. Short term disability and long term disability
3. Life insurance and AD&D
4. Supplemental life insurance (Employee/Spouse/Child)
5. Health care and dependent care Flexible Spending Accounts
6. Pre-tax commuter and parking benefits
7. 401(k) Savings and Investment Plan with company match
8. Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
9. 11 paid holidays
10. Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
11. Paid parental leave
12. Adoption assistance
13. Employee Stock Purchase Plan
14. Financial planning and group legal
15. Voluntary benefits including auto, homeowner and pet insurance
The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.
Career Level - IC3
**About Us**
As a world leader in cloud solutions, Oracle uses tomorrow's technology to tackle today's challenges. We've partnered with industry-leaders in almost every sector-and continue to thrive after 40+ years of change by operating with integrity.
We know that true innovation starts when everyone is empowered to contribute. That's why we're committed to growing an inclusive workforce that promotes opportunities for all.
Oracle careers open the door to global opportunities where work-life balance flourishes. We offer competitive benefits based on parity and consistency and support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.
We're committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing or by calling in the United States.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

View Now

Site Reliability Engineer

99811 Juneau, Alaska iCIMS

Posted 2 days ago

Tap Again To Close

Job Description

**Job Overview**
We are seeking a skilled Engineer, Site Reliability (SRE) to contribute to the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. This role involves hands-on technical work in incident response, system monitoring, automation, and continuous improvement of our platform reliability. The successful candidate will work within a global SRE team to ensure optimal system performance and customer satisfaction.
**About Us**
When you join iCIMS, you join the team helping global companies transform business and the world through the power of talent. Our customers do amazing things: design rocket ships, create vaccines, deliver consumer goods globally, overnight, with a smile. As the Talent Cloud company, we empower these organizations to attract, engage, hire, and advance the right talent. We're passionate about helping companies build a diverse, winning workforce and about building our home team. We're dedicated to fostering an inclusive, purpose-driven, and innovative work environment where everyone belongs.
**Responsibilities**
+ **System Monitoring & Reliability:**
+ Monitor multi-cloud infrastructure (AWS, Azure, GCP) using New Relic, Grafana, and Sumo Logic
+ Maintain reliability of AWS resources, Auth0/Okta authentication, databases, and legacy applications
+ Implement monitoring, alerting, and dashboards for assigned systems
+ **Incident Management & Response:**
+ Respond to alerts and incidents within SLA timeframes
+ Perform root cause analysis and document findings
+ Create and maintain runbooks and troubleshooting procedures
+ Participate in 24/7 on-call rotation
+ **Automation & Improvement:**
+ Develop scripts to reduce manual operational overhead
+ Build monitoring and alerting solutions
+ Support infrastructure-as-code initiatives
+ Implement automated remediation where possible
+ **Success Metrics:**
+ **Customer Impact** : Reduced MTTR and improved customer satisfaction scores
+ **Reliability** : Achievement of 99.9%+ uptime SLAs across all products and regions
+ **Proactive Prevention:** Reduction in incident frequency through automated detection and prevention
+ **Cross-functional Collaboration:** Improved partnership metrics with Product, Engineering, and Customer Success teams
+ **Automation Delivery:** Complete assigned automation projects to reduce manual tasks
+ **Knowledge Sharing:** Contribute to team knowledge base and mentor junior engineers
**Qualifications**
+ 4+ years experience in SRE, DevOps, or Infrastructure Engineering
+ Hands-on experience with AWS (required) and Azure (preferred)
+ Strong Linux system administration skills
+ Experience with monitoring tools (New Relic, Grafana, Prometheus)
+ Scripting skills in Python, Bash, or similar
+ Knowledge of databases (SQL Server, PostgreSQL, MongoDB)
**Preferred**
**Technical Experience:**
+ SaaS experience in a global environment
+ Authentication and identity management systems knowledge
+ Cloud certifications (AWS, Azure, or Google Cloud)
+ Infrastructure-as-code tools (Terraform, CloudFormation)
**Education/Certifications/Licenses:**
+ Bachelor's degree in computer science, Engineering, Information Systems, or related technical field
+ Equivalent combination of education and experience will be considered
**Working Conditions:**
+ Global role requiring flexibility for incident response and team coordination across time zones
+ Occasional client-facing responsibilities during critical incidents
+ Travel may be required for team building
+ Hybrid work environment with team members distributed globally
**EEO Statement**
iCIMS is a place where everyone belongs. We celebrate diversity and are committed to creating an inclusive environment for all employees. Our approach helps us to build a winning team that represents a variety of backgrounds, perspectives, and abilities. So, regardless of how your diversity expresses itself, you can find a home here at iCIMS.
We are proud to be an equal opportunity and affirmative action employer. We prohibit discrimination and harassment of any kind based on race, color, religion, national origin, sex (including pregnancy), sexual orientation, gender identity, gender expression, age, veteran status, genetic information, disability, or other applicable legally protected characteristics. If you would like to request an accommodation due to a disability, please contact us at
**Compensation and Benefits**
We accept applications for this position on an ongoing basis until the position is filled. Applications will be reviewed as they are received, and qualified candidates may be contacted throughout the posting period.
The anticipated base pay range for this position is $100,000-140,000.00 annually. Final compensation will be based on factors such as relevant experience, skills, education, internal equity, and market data. This range aligns with our commitment to equitable and transparent compensation practices, as required by applicable law.
Competitive health and wellness benefits include medical, dental, vision, 401(k), dependent care, short term and long-term disability, life and AD&D insurance, bonding and parental leave, mindfulness resources, an open vacation policy, sick days, paid holidays, quiet hours each workday, and tuition reimbursement. Benefits and eligibility may vary by location, role, and tenure. Learn more here:

View Now

Industry

View All Site Reliability Jobs

Menu

Search Suggestions

Recent Searches

Popular Searches

Location Suggestions

Popular Locations

Nearby Locations

Other Jobs Near Me

Industry

11,309 Site Reliability jobs in the United States

DevOps/Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Be The First To Know

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Nearby Locations

Other Jobs Near Me

Industry