Principal / Staff Site Reliability Engineer, Federal

  • Full Time
  • Anywhere
  • Dec 1, 2021

Website Okta Inc

The World’s #1 Identity Platform

We are looking for an experienced Site Reliability Engineer (SRE) to join Okta’s Technical Operations (TechOps) Team. At Okta our motto is “Always On”, and nowhere do we embrace that more than in Technical Operations. We strive to build the most reliable and performant systems on the planet through the skillful use of automation. We’ve created an integrated system that securely connects any person via any device to the technologies they need to do their most significant work.

This role is ideal for someone who not only enjoys working on large scale cloud production systems but gains pride from seeing those systems handle anything the internet can throw at them.

If you like to be challenged and have a passion for solving problems at scale with automation, testing, and tuning, then we would love to hear from you. The ideal candidate is someone who exemplifies the ethics of, “If you have to do something more than once, automate it,” and who can rapidly self-educate on new concepts and tools. This engineer can be in our Bellevue, San Francisco, and/or San Jose office, or can be remote anywhere in the United States!


Design, build and monitor Okta’s global production infrastructure
Respond to production incidents and determine preventive solutions
Troubleshoot complex reliability and performance issues
Automate manual processes, evolve our monitoring tools, and develop technical documentation
Support a highly available and large scale online environment as part of an on-call rotation once per quarter
Qualifications & Requirements

US Person Status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee)*
Experience with Federal and DoD compliance requirements – FedRAMP, IL
Background using and supporting Splunk, Zabbix, Wavefront, Elasticsearch, Logstash, Kibana, or related tools
Passionate about automation
Experience in chatops tooling, Slack automation, PagerDuty integration
Background with Linux systems administration and strong scripting skills in Bash, Ruby, Python, Go, etc.
Experience supporting Docker containers and web applications running on Java / Apache / Tomcat in a live production environment
Strong expertise with production services in AWS such as EC2, ECS, KMS, Kinesis, CloudWatch
Previous experience with automating systems and infrastructure via Ansible, Chef or Terraform
Solid understanding of networking concepts and IP protocols
Experience with multi-cloud infrastructure is desired
Computer Science (plus) or relevant experience
Education and Training:

B.S. Computer Science (plus) or relevant experience
*This position requires the ability to access Impact Level 4 (IL4) data, as defined by the Department of Defense (DoD) Cloud Computing Security Requirements Guide. As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.

Tagged as:

To apply for this job please visit