Production / SRE / infrastructure engineer focused on reliable systems, Kubernetes/cloud operations, observability, and automation that is safe to run in production.
I like work where the result is reviewable: smaller diffs, clear failure modes, tests or gates that prove behavior, and evidence a future on-call engineer can trust.
Best fit: SRE, Production Engineering, Infrastructure, Platform, Cloud/DevOps, and backend infrastructure roles close to real operations.
- Verified upstream PRs merged in Google and Google Cloud Platform maintained repositories.
- Recent upstream work across
gVisor,syzkaller,KHI,go-containerregistry,google/benchmark,stellar-engine, andvertex-ai-creative-studio. - Built a GKE AI inference reliability lab with OpenTelemetry traces, Kubernetes manifests, incident replay, and SLO-style evidence gates.
- Production context includes Meta monetization data infrastructure and SHEIN gateway infrastructure work.
- Experience around production gateways, Kubernetes/AKS-style platforms, Kafka, ZooKeeper, Elasticsearch, Terraform, runbooks, dashboards, and operational automation.
Projects where my upstream PRs have been merged: google/gvisor, google/syzkaller, GoogleCloudPlatform/khi, google/go-containerregistry, google/benchmark, google/stellar-engine, and GoogleCloudPlatform/vertex-ai-creative-studio.
| Area | Evidence |
|---|---|
| Container/runtime reliability | google/gvisor#13276 - set swap for precreated cgroups |
| Kernel fuzzing / report parsing | google/syzkaller#7420, google/syzkaller#7376 |
| Kubernetes troubleshooting | GoogleCloudPlatform/khi#708, GoogleCloudPlatform/khi#692 |
| Container image tooling | google/go-containerregistry#2318 |
| C++ build/test infrastructure | google/benchmark#2198, #2199, #2204 |
| Safer cloud defaults | google/stellar-engine#68, GoogleCloudPlatform/vertex-ai-creative-studio#1445 |
Live searches:
org:google merged PRs /
org:GoogleCloudPlatform merged PRs
A runnable infrastructure lab for AI inference reliability:
- OpenTelemetry trace collection and Kubernetes resource context
- incident replay for baseline traffic, cache-miss latency, dependency timeout, and rollout regression
- SLO-style reliability gate with published evidence reports
- GKE-shaped manifests for collector RBAC, PVC-backed queue storage, and sample workloads
- Production changes that can be rolled out, observed, and rolled back.
- Automation with explicit inputs, validation, state, side effects, and retry boundaries.
- Reliability evidence: runbooks, dashboards, audit trails, tests, and incident reports.
- Practical open-source changes that reduce ambiguity for maintainers and users.
Python Go C++ Java SQL Bash Linux Kubernetes AKS GKE OpenTelemetry Terraform Ansible Nginx/APISIX Kafka ZooKeeper Elasticsearch CMake pkg-config GitHub Actions
- GitHub: Haihan-Jiang
- Engineering profile: haihan-jiang.github.io
- LinkedIn: haihan-jiang
- Email: haihanj99@gmail.com
Merged PR status was verified from GitHub on 2026-06-14. I keep merged work separate from review-in-progress work.



