Sr Cloud Engineer

  • San José
  • Zuora

Over the past 15 years, we have seen a shift in the focus of business models across every industry - from selling physical products via one-time transactions to monetizing services via ongoing customer (aka subscriber) relationships. This is the "Subscription Economy" a phrase coined by our CEO, Tien Tzuo, he even wrote the book on it: Subscribed .

Companies have realized that the path to growth going forward is to establish direct, digital relationships with their customers, and monetize these relationships through an ever growing set of digital services.

Our vision is simple :we call it "The World Subscribed." It's the idea that one day every company will join the Subscription Economy - a $1.5 Trillion opportunity by 2025 according to UBS.

Our mission : to power the world's best companies to win in the Subscription Economy.

The Team & Role

Zuora's TechOps teams are responsible for Data Center and Cloud infrastructures, monitoring performance and uptime, managing internal and external shared services, infrastructure services and more - for Zuora's customer facing SaaS products and platforms. Our technologists sit across US, Beijing, India and remotely, using a follow-the-sun model to provide 24x7x365 coverage for critical functions and partner closely with our Engineering, Customer Support, Security, Global Services and Sales teams on a daily basis to keep our customers front and center.

In this role you'll get to

  • Ensure Service Availability, Scalability, Security & Capacity
  • Run our global infrastructure using Ansible, Terraform, CI/CD & Kubernetes in a multi-cloud platform
  • Automation - continue to push for new levels of efficiencies
  • Proactive, preventative enablement driving high reliability
  • Architecting and enabling solutions that drive preventative, proactive solutions & Infrastructure services
  • Be on an on-call (PagerDuty) rotation to respond to incidents that impact Zuora's products and services availability, and provide leadership and drive restoration outcomes for service engineers with customer incidents.
  • Drive and coordinate the critical impacting issues bridges and collaboration to root cause & restoration.
  • Use your on-call shift to prevent incidents from ever happening.
  • Run our infrastructure with Puppet, Ansible, Terraform, GIT CI/CD, Jenkins, ECS, and Kubernetes.
  • Incorporate feedback from incidents back into monitoring that alerts on symptoms rather than on outages.
  • Work with engineering teams on maintaining and improving runbooks, including documenting cases where runbooks are missing and needed.
  • Support and maintain core infrastructure that enables Zuora scale to support all of our customer's needs.
  • Help debug production issues across services and levels of the stack.

Job Involves

  • Take every task that requires a person to execute it, strip it down & automate it
  • Take on capacity planning head on, shaping the multi-cloud world
  • Resolution of complex and critical issues, participation in Major incidents as a SME
  • Configure monitoring and alerting to ensure integrity, reliability & the performance that skeptics thought couldn't be done (line for problem solving)
  • Service expert ensuring expertise is reflected in SOP's documentation are shared
  • End-to-end tuning needs, optimizing resource utilization, as load patterns fluctuate
  • Instrumentation and metrics that clearly describe the service behaviors
  • End-to-end tuning needs, optimizing resource utilization, as load patterns fluctuate
  • Consult on new capabilities ensuring a scalable infrastructure
  • Resiliency and recoverability, ensuring that backup / restore and disaster recovery capabilities are implemented, tested and maintained

Who we're looking for

  • 3+ years of overall experience
  • You bring your excellent communication, problem solving, critical thinking & passion to the table each day to disrupt, make an impact & rewrite the rulebook
  • You have expertise in one or more Cloud providers: AWS, Azure, GCP
  • Infrastructure as Code tools: Terraform, Ansible, Kubernetes, Helm
  • Experience with IaC: Python, Golang
  • Understanding and experience with Observability stacks - Open telemetry, open tracing: Grafana, Prometheus
  • IT Security and compliance: Access control policy, security groups, ACL's
  • AWS Services like EC2, EKS, RDS, EBS
  • Azure Services: Cosmos DB, Azure VM, AKS, Blob
  • CI/CD pipelines using tools such as GIT, Jenkins, Spinnaker, Argo
  • Experience with the following operating systems (CentOS, RHEL, Ubuntu)
  • Experience supporting storage and networking subsystems (AWS and Azure)
  • Experience in infrastructure services (DNS, Mail Relays, NTP, CDN, SSL Certificates)
  • Experience running and leading command center bridges
  • Experience driving Incident issues to isolation and alignment with the corresponding service

Benefits

  • Competitive compensation, company equity, and retirement programs
  • Medical, dental and vision insurance
  • Paid holidays and "wellness" days