Site Reliability Engineer (Cloud Foundations Team)
Job role insights
-
Date posted
May 14, 2026
-
Closing date
June 7, 2026
-
Offered salary
Negotiable Price
-
Career level
Junior Middle Senior
Description
Company: PostHog
Job Category: Site Reliability Engineering / Platform / AWS / Kubernetes
Contract Type: Full-Time, Permanent Location: Remote — Americas (GMT -3 to GMT -8)
Salary: Location-adjusted via public calculator; open to exceeding ranges for top talent
Application Link: https://posthog.com/careers/site-reliability-engineer
Posted: Live as of May 14, 2026
Job Description: This is not a keep-the-lights-on SRE role. You'll turn a fast-growing, stateful system processing petabytes of data across thousands of cores into a predictable, well-automated platform. The work is about designing safe automation for traffic-heavy workloads, reducing operational stress, and building the tooling that lets the system scale without scaling human effort.
Key Responsibilities:
- Operate EKS clusters across multiple environments with Karpenter autoscaling, Cilium networking, and ArgoCD-driven GitOps deployments
- Manage and evolve a multi-AWS account organization — provisioning, networking, access control, cross-account connectivity
- Maintain the Terraform/Terragrunt IaC platform including modules, automated plan-on-PR/apply-on-merge pipelines
- Improve operational tooling around deploys, schema changes, backups, restores, and incident response
- Reduce operational load by identifying repeat pain points and eliminating them through code and self-healing automation
- Optimize cloud spend continuously
- Participate in on-call and incident response, with a strong focus on making incidents rarer over time
- Build AI agent-enabled infrastructure services using LLM tooling to automate alert management and observability
Requirements:
- Deep hands-on Kubernetes production experience (EKS preferred), including debugging node pressure, networking issues, and deployment failures at scale (thousands of nodes)
- Strong AWS infrastructure experience across multi-account organizations, IAM, and cross-account networking
- Experience automating infrastructure with Terraform or Terragrunt at scale including module design and state management
- Solid Linux systems knowledge: disk, memory, networking, failure modes
- Experience supporting stateful systems (databases, queues, storage)
- Comfortable owning systems end-to-end including on-call responsibilities
Nice to have: GitOps with ArgoCD; multi-region infrastructure experience; AI-enabled infra services.
Interested in this job?
24 days left to apply