Site Reliability Engineer (SRE), Automation, & Incident Response
Ready to redefine offensive security? At XBOW, we're not just reacting to threats – we're building the future, leveraging cutting-edge AI to stay ahead of the curve. While attackers wield AI to accelerate their exploits, our platform empowers defenders with an AI-powered system that autonomously discovers, validates, and even exploits vulnerabilities. The result? Proof-backed insights delivered in hours, not weeks, giving organizations an unparalleled advantage.
Founded by Oege de Moor, the visionary behind GitHub Copilot, and backed by industry giants like Sequoia and Altimeter, XBOW is tackling one of the world's most critical challenges. In just over a year, our world-class AI team and legendary security researchers have unleashed an AI that has uncovered thousands of real-world zero-days in software relied upon by billions, catapulting us to the #1 ranking on HackerOne’s global leaderboard.
We are a collective of builders, hackers, and researchers united by a passion for solving the impossible. If you're driven to push the boundaries of AI, fundamentally reshape cybersecurity, and join the elite group defining this new era of defense, we want to hear from you.
Your Mission: Architecting Resilience for AI-Powered Security
As a Site Reliability Engineer (SRE) focused on Automation and Incident Response, you will be the guardian of XBOW’s production systems. Your daily work will ensure our groundbreaking platform remains stable, observable, and resilient as we scale our impact globally. You’ll play a pivotal role in constructing and maintaining the automated tooling that underpins our reliability—spanning monitoring, alerting, and self-healing systems. Beyond the code, you'll define and track critical Service Level Objectives (SLOs) across both production and development environments, ensuring our systems consistently meet ambitious performance and availability targets.
This role demands deep collaboration with our infrastructure and feature teams. You’ll manage cloud systems through robust Infrastructure-as-Code (IaC) practices, meticulously review architectural changes for their reliability and capacity implications, and swiftly respond to incidents during local working hours as part of our "follow the sun" model, ensuring continuous operational excellence without the burden of off-hours paging.
When issues arise, you’ll be at the forefront, leading or contributing to thorough root-cause investigations. Your analytical prowess will turn incident trends into actionable insights, driving proactive improvements that fortify our defenses and drastically reduce future risk. Furthermore, you’ll be instrumental in maintaining both internal and customer-facing status dashboards, providing transparent and clear communication on system health and uptime.
Key Responsibilities:
Pioneer Automation: Design, implement, and maintain advanced site reliability infrastructure, monitoring, and self-healing systems that empower our platform to operate with minimal human intervention.
Define & Own SLOs: Establish and rigorously manage Service Level Objectives (SLOs) for all production and development deployments, fostering a culture of performance and reliability.
IaC Mastery: Drive Infrastructure-as-Code (IaC) initiatives for our production and development environments, collaborating closely with the infrastructure engineering team to build scalable and reproducible systems.
Strategic Incident Response:
Respond to in-hours alerts promptly and effectively, contributing to our seamless "follow-the-sun" operational model.
Lead and collaborate on comprehensive Root Cause Analysis (RCA) investigations with feature teams, transforming outages into learning opportunities.
Proactively build resilience into our systems to prevent future outages and enhance platform robustness.
Incident Intelligence: Conduct organization-wide analysis of incident causes, frequency, and severity, translating data into strategic insights that prioritize future reliability improvements.
Architectural Stewardship: Participate in design reviews for architectural changes, providing critical input on scalability, reliability, and capacity planning.
Transparent Communication: Develop and maintain public and internal status and uptime dashboards, ensuring clear communication of system health to all stakeholders.
Skills & Qualifications:
Essential:
TypeScript Expertise: Strong hands-on experience developing robust systems with TypeScript.
AWS Proficiency: Demonstrable experience designing, deploying, and managing infrastructure within Amazon Web Services (AWS).
Linux & DevOps Acumen: Solid expertise in Linux environments coupled with practical experience across core infrastructure & DevOps tooling such as Kubernetes, Docker, Terraform, and CI/CD pipelines (especially GitHub Actions).
Reliability Foundation: A background in infrastructure automation and/or incident response, showcasing your dedication to system stability (depth of experience may vary by candidate).
Observability & Monitoring Tools: Familiarity with leading monitoring and observability tools including OpenTelemetry, Prometheus, VictoriaMetrics, Grafana, and Datadog.
Advantageous:
Polyglot Programmer: Experience with Python and/or Go.
Multi-Cloud Experience: Exposure to additional cloud providers beyond AWS.
What We Offer:
Impactful Compensation & Equity: A highly competitive salary combined with a generous equity package, making you a true owner and partner in our collective success.
Uncharted Career Growth: Seize the unique opportunity to shape your role, lead critical functions, and grow exponentially with a company that is fundamentally redefining the cybersecurity landscape.
Meaningful & Challenging Work: Tackle technically complex challenges at the forefront of AI and cybersecurity. Play a pivotal role in scaling our business alongside an amazing team and some of the world’s foremost experts.
What Else You Should Know:
Location: This is a fully remote position. While our team is distributed globally, we foster strong connections through regular virtual collaboration and support travel for in-person team gatherings.
Contract: Full-time.
At XBOW, we value impact and capability over rigid seniority titles. Don't let traditional "leveling" concerns hold you back—we care deeply about mission fit, your problem-solving prowess, and the tangible impact you can make.
We thrive on curiosity and a relentless willingness to learn. Even if you don't check every single box, if you're excited by this role and our audacious mission, we strongly encourage you to apply!