Site Reliability Engineer, Big Data Platform
Consumer Electronics | Shanghai, China
Our client is the world's No.1 electric car and power train desinger，developer，manufacturer and distributor headed by the legendary entrepreneur. Now they are hiring top talents globally to work in their R&D center located in their gigafactory in Shanghai, China. A top company with top work environment as well as top compensation and benefits.
Tesla is on mission is to accelerate the world's transition to sustainable energy. We are looking for a Site Reliability Engineer to join our data platform team. We are a small, expert, cross-functional team made of data engineers and SREs. We are building a fairly large platform that other teams ranging from autopilot to service technicians leverage to collect and access data. We are already processing trillions of events per day but with more and more cars on the road, more and more Superchargers, Powerwalls, solar installations, new factories in China and Europe, and, of course, new products coming to markets, we need to think about scaling up our hybrid infrastructure to the next level.
What we are building
- We are building a hybrid platform with on-prem datacenters and public clouds in multiple regions.
- Our stack is fairly modern. Our services run in containers and we are taking advantage of everything we can in the cloud.
- The infrastructure is completely described and managed using infrastructure-as-code (IAC) and configuration management tools.
- We build and deploy over a hundred microservices using Bazel.
- At the data-layer, the stack is built on top of technologies including Spark, Kafka, and related.
- We develop tools in Python and Go to support predominately Java/Scala-based applications.
- For monitoring we use Prometheus, Grafana, Splunk and Cloudwatch
You’ll work on high impact projects that improve data availability, scalability, and reliability of our data infrastructure.
- These days we are trying to expand our cloud infrastructure footprint
- We are also continuing our work around our builds and deployments pipelines.
- Occasionally, you will assist the Datacenter operation team with our on-prem presence as we plan for expansion and day to day operations.
- We also need to think about the next steps for our metrics systems which is currently under heavy load.
- Of course, you will also design, architect, improve and support new and existing tools to help us operate at scale.
- And, finally, join us in our oncall rotation.
- You have a strong understanding of Linux, networking and production systems
- You have 3+ years of experience building and maintaining infrastructure and services or are a quick learner
- You are proficient at scripting and programming
- You have strong problem solving skills, optimizing for the simplest, most robust yet practical solutions
- You are reliable, dependable, trustworthy and, a participating team member
- You are smart but humble with a bias for action
- You are proficient in both Mandarin and English so you can communicate freely with teammates in US and China
- You have strong proficiency with Go, Python, and/or Java
- You have experience with AWS or other cloud providers
- You have experienced large scale infrastructure
- You have used infrastructure as code tools such as Terraform, Cloudformation, Ansible, Puppet or Chef
- You have experience with observability tools such as Prometheus, Thanos, Cortex or Sensu
- You have experience with orchestration systems such as Kubernetes, Mesos or ECS
- You have worked in a service-oriented or microservice architecture
- You have experience with security and hardening, especially in a large or complex environment.
- You have experience designing, building and operating distributed systems at scale and/or willing to learn the stack: Kafka, Hadoop, HBase, Spark, etc.