Cloud Operations Lead and SRE Manager

Job ID 2026-8622
Job Locations
US-MD-Camp Springs
Category
IT: Administrator / Analyst / Architect / Engineer
Type
Regular Full-Time

Overview

Empower AI is AI for government. Empower AI gives federal agency leaders the tools to elevate the potential of their workforce with a direct path for meaningful transformation. Headquartered in Reston, Va., Empower AI leverages three decades of experience solving complex challenges in Health, Defense, and Civilian missions. Our proven Empower AI Platform® provides a practical, sustainable path for clients to achieve transformation that is true to who they are, what they do, how they work, with the resources they have. The result is a government workforce that is exponentially more creative and productive. For more information, visit www.Empower.ai.

Empower AI is proud to be recognized as a 2024 Military Friendly Employer by Viqtory, the publisher of G.I. Jobs. This designation reflects the company’s commitment to hiring and supporting active-duty and veteran employees.

Responsibilities

The Cloud Operations Lead / SRE Manager (Cloud/SRE Mgr) provides enterprise-level operational management of cloud operations and Site Reliability Engineering (SRE) leadership for the Department of Homeland Security (DHS), U.S. Citizenship and Immigration Services (USCIS) information technology (IT) infrastructure.  USCIS has over 27,000 Government employees and contractors working at over 250 offices worldwide.  

The USCIS Enterprise Infrastructure Division (EID) of the Office of Information Technology (OIT) provides IT infrastructure engineering, design, testing, implementation and operational support services for all USCIS enterprise components, to include networks, server rooms, data storage, telecommunications, video conferencing services and infrastructure security.  The Cloud/SRE Mgr directly supports EID to coordinate, direct, manage, and oversee the design, development, integration, standards, operation and maintenance of cloud operations and SRE of the enterprise IT infrastructure that supports USICS operations.

The Cloud/SRE Mgr shall oversee the Cloud Operation Team (est. 5 technicians) responsible for executing cloud operations of the USCIS IT infrastructure.  This position is responsible for the delivery of the reliability, availability, and operational excellence of USCIS cloud platforms This role combines hands-on technical leadership with people management (est 5 technicians).  The Cloud/SRE Mgr will also apply Site Reliability Engineering (SRE) principles to ensure highly available, secure, and compliant production systems. The ideal candidate brings a strong background in cloud infrastructure, automation, and DevOps, paired with proven experience leading operational teams, managing incidents, and driving reliability at scale in regulated environments.

 

 Overall  Responsibilities:

  • Own the reliability, availability, and performance of production cloud platforms and services.
  • Define, monitor, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical systems.
  • Lead incident response, including coordination during outages, root cause analysis, and blameless postmortems.
  • Establish and manage on-call rotations, escalation paths, and operational readiness standards.
  • Drive continuous reduction of operational toil through automation and process improvement.

Cloud Architecture & Platform Engineering

  • Design and operate secure, scalable, and highly available AWS infrastructure, including multi-AZ and multi-region architectures.
  • Ensure platforms are resilient, fault-tolerant, and aligned with disaster recovery and business continuity requirements.
  • Partner with application teams to ensure production readiness and reliability by design.

Security & Compliance

  • Implement and enforce cloud security best practices, including IAM, encryption, logging, and audit controls.
  • Ensure compliance with government and regulatory frameworks such as FedRAMP.
  • Collaborate closely with security and compliance stakeholders to meet accreditation and audit requirements.

Automation, DevOps & Observability

  • Lead development of infrastructure-as-code (IaC) using Terraform and/or AWS CloudFormation.
  • Build and maintain CI/CD pipelines supporting reliable, repeatable deployments.
  • Design and operate monitoring, alerting, logging, and observability solutions to ensure actionable insights and reduce alert fatigue.

Team Leadership & Management

  • Lead, mentor, and develop a team of Cloud / SRE engineers.
  • Support hiring, onboarding, performance feedback, and career growth.
  • Set technical direction, operational priorities, and reliability goals for the team.
  • Foster a culture of ownership, learning, and continuous improvement.

Collaboration & Communication

  • Partner with development, security, compliance, and business stakeholders to align reliability goals with delivery timelines.
  • Communicate reliability risks, incident outcomes, and improvement plans to senior leadership.
  • Produce and maintain clear operational documentation, runbooks, and architectural standards.

Qualifications

Minimum Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
  • 10+ years of experience in cloud engineering, SRE, or cloud operations roles, with deep hands-on expertise in AWS.
  • Proven experience operating production, mission-critical systems at scale.
  • Strong background in automation and IaC (Terraform and/or CloudFormation).
  • Experience with CI/CD pipelines (e.g., AWS CodePipeline, GitLab, Jenkins).
  • Deep knowledge of cloud security, IAM, encryption, monitoring, and compliance frameworks.
  • Experience designing high availability, disaster recovery, and fault tolerance.
  • Excellent communication skills with the ability to influence technical and non-technical stakeholders.

 

Additional Experience

  • Prior experience managing SRE or Cloud Operations teams.
  • Experience supporting regulated or government environments (FedRAMP, SOC 2, ISO 27001).
  • Familiarity with SRE practices such as error budgets, capacity planning, and toil reduction.
  • Multi-cloud awareness (Azure, GCP).
  • Strong scripting or programming skills (e.g., Python, Bash).

 

Work Environment: 

 

  • May require occasional travel between work centers or to client sites.
  • Some after-hours or on-call support may be necessary.
  • On-site work in a secure government or contractor facility.
  • Must be able to work in a high-security environment and comply with all relevant security protocols.
  • Standing for long periods.
  • Ambulate throughout an office and between several buildings be
  • Stoop, kneel, crouch, or crawl as required
  • Repeatedly lift and carry weight up to 40 pounds
  • Travel by land or air transportation up to 25 %

About Empower AI

All hiring and promotion decisions at Empower AI are based on merit to bring the best talent available to contribute to our firm’s overall success. It is the policy of Empower AI not to discriminate against any applicant for employment, or employee because of age, color, sex, disability, national origin, race, religion, or veteran status. Empower AI is a VEVRAA Federal Contractor.

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed

Need help finding the right job?

We can recommend jobs specifically for you! Click here to get started.