feat(tools): Slack monitor on SR votes #105
Conversation
|
|
||
| body, err := io.ReadAll(resp.Body) | ||
| if err != nil { | ||
| return "" |
There was a problem hiding this comment.
It's better to handle error messages by writing them to logs.
There was a problem hiding this comment.
Already add logs file in logs/sr_monitor.log
|
|
||
| go 1.25.5 | ||
|
|
||
| require github.com/joho/godotenv v1.5.1 |
There was a problem hiding this comment.
Please check the security of this package
There was a problem hiding this comment.
This package is the de-facto standard for environment variable management in the Go community (10k+ Stars), repo: https://github.com/joho/godotenv. Its logic is minimal and transparent (reading local files only).
I've scanned the project using the official Go vulnerability tool govulncheck, and no vulnerabilities were detected in this package.
| FROM golang:1.25-alpine AS builder | ||
|
|
||
| WORKDIR /app | ||
|
|
There was a problem hiding this comment.
[MUST] Builder base image missing digest pinning
Problem:
FROM golang:1.25-alpine AS builder lacks @sha256 digest.
Why this is P0 (Critical):
- Non-reproducible builds — Docker Hub may update golang:1.25-alpine tag without notice. Building today vs tomorrow produces different binaries.
- Supply chain traceability lost — Cannot audit which exact version was used in each build. Critical for security compliance.
- Security vulnerability — New golang versions may have breaking changes or removed dependencies. Without pinned digest, application may fail to start in production.
- No rollback capability — If an image breaks production, cannot roll back to previous version because the tag floats.
- Historical precedent — tron-docker has SA-002 (base image not pinned), this is a known vulnerability in the project.
Fix:
docker pull golang:1.25-alpine
docker inspect golang:1.25-alpine --format='{{index .RepoDigests 0}}'
# Replace: FROM golang:1.25-alpine@sha256:abc123...There was a problem hiding this comment.
Fixed in f330c1a. Pinned to golang:1.25-alpine@sha256:8d22e29d960bc50cd025d93d5b7c7d220b1ee9aa7a239b3c8f55a57e987e8d45. Thanks for the catch.
| FROM alpine:latest | ||
|
|
||
| RUN apk --no-cache add ca-certificates tzdata | ||
|
|
There was a problem hiding this comment.
[MUST] Runtime base image using :latest tag (no digest)
Problem:
FROM alpine:latest is a floating tag with no @sha256 digest.
Why this is P0 (Critical):
- Worst-case version management — :latest is the most unstable tag. Rebuild tomorrow and get completely different image.
- Zero image consistency guarantee — Same Dockerfile produces different image on each rebuild.
- Supply chain attack surface — Anyone (including attackers) could upload new alpine:latest to Docker Hub. Containers auto-pull the poisoned image.
- Production safety compromised — Cannot rollback to previous version if new alpine breaks the app. Cannot diagnose which version caused the problem.
- Deployment uncertainty — Image contents are non-deterministic. Violates infrastructure-as-code principle.
Fix:
docker pull alpine:3.20
docker inspect alpine:3.20 --format='{{index .RepoDigests 0}}'
# Replace: FROM alpine:3.20@sha256:xyz789...There was a problem hiding this comment.
Fixed in f330c1a. Switched to alpine:3.20@sha256:d9e853e87e55526f6b2917df91a2115c36dd7c696a35be12163d44e6e2a4b6bc. Agree that :latest provides no reproducibility guarantee.
|
|
||
| // Build rank map for previous SRs | ||
| prevRankMap := make(map[string]int) | ||
| for i, addr := range prevSRs { |
There was a problem hiding this comment.
[MUST] JSON marshal error discarded (blank import)
Problem:
jsonPayload, _ := json.Marshal(payload) // Error ignored!Why this is P0 (Critical):
- Silent failure — worst error handling pattern — Not a panic, not an error return. Silently continues execution.
- Sends nil payload — If payload contains non-serializable field (circular reference, unexported field, etc.), marshal fails and jsonPayload becomes nil.
- Slack request fails invisibly — nil payload creates invalid HTTP request. Slack returns 400/422, but root cause (marshal failure) is invisible in logs.
- Nearly impossible to diagnose — Developer sees "failed to send to Slack: status 400" but cannot see the real cause (json.Marshal failed).
- Future breaking change risk — As code evolves and payload structure becomes complex, this bug suddenly explodes in production.
Standard Go error handling:
// ❌ Current (P0)
jsonPayload, _ := json.Marshal(payload) // Error disappears
// ✅ Correct
jsonPayload, err := json.Marshal(payload)
if err != nil {
log.Printf("ERROR: failed to marshal Slack payload: %v", err)
return fmt.Errorf("marshal failed: %w", err)
}Fix: Handle the error explicitly or return (string, error) from the function.
There was a problem hiding this comment.
Fixed in f330c1a. The marshal error is now logged and the function returns an empty name, consistent with the existing error paths further down in getAccountName. Agreed this was a silent-failure anti-pattern.
| container_name: slack-sr-monitor | ||
| restart: always | ||
| environment: | ||
| - SLACK_WEBHOOK=${SLACK_WEBHOOK} |
There was a problem hiding this comment.
[SHOULD] Restart policy 'always' risks infinite restart loop
Problem:
restart: alwaysContainer will be restarted infinitely, even on persistent failures.
Why this is P1 (Should Fix):
Scenario: Invalid Slack Webhook
10:00:00 - Container starts → checks SLACK_WEBHOOK env var → invalid → os.Exit(1)
10:00:01 - Docker: container crashed, restart: always → restart container
10:00:02 - Container starts → checks SLACK_WEBHOOK → invalid → os.Exit(1)
10:00:03 - Docker: container crashed → restart
...
10:05:00 - 300+ restart attempts, logs flooded, CPU spinning
Consequences:
- Problem masking — Error is hidden in restart loop. Developer sees "container restarting" but doesn't see the root cause (invalid webhook URL).
- Log spam — Logs filled with restart attempts, making debugging harder. Real errors buried under restart messages.
- Resource waste — Container keeps starting/stopping, wasting CPU and I/O.
- Delayed incident response — Takes longer to discover and diagnose the actual problem.
- Hard to stop gracefully — Docker stop doesn't prevent auto-restart; need
docker update --restart=noto halt it.
Comparison of restart policies:
| Policy | Behavior | Use Case |
|---|---|---|
always |
Restart forever, even after intentional stop | ❌ Masks failures |
unless-stopped |
Restart unless manually stopped (recommended) | ✅ Failures visible, can diagnose |
on-failure:5 |
Restart max 5 times | ✅ Expose persistent failures after N retries |
Fix:
services:
slack-sr-monitor:
# Option 1 (Recommended): Stop on failure, allow manual restart
restart: unless-stopped
# Option 2: Auto-restart but stop after 5 failed attempts
# restart: on-failure:5Benefit of 'unless-stopped': When the service crashes, it stays down until manual intervention. This forces developers to investigate instead of ignoring the logs.
There was a problem hiding this comment.
Agreed, switched to restart: unless-stopped in f330c1a. Your reasoning about failure visibility makes sense — a misconfigured webhook should surface as a stopped container rather than be hidden in a restart loop.
- Handle json.Marshal error in getAccountName instead of discarding it - Pin base image digests (golang:1.25-alpine, alpine:3.20) for reproducible builds - Run container as non-root user 'monitor' under /home/monitor - Tighten log file permissions from 0666 to 0600 - Mask Slack webhook token in startup log - Switch docker-compose restart policy to unless-stopped
The loop polls getnextmaintenancetime in a 2-minute cycle. When the node is lagging behind the chain head (e.g. after a connection outage), the API returns a past maintenance time over and over, causing the monitor to send a duplicate Slack message every 2 minutes until the node recovers. Before triggering a push, check the timestamp of the latest block on the node and skip the cycle if it is more than 30 seconds old. Once the node catches up, getnextmaintenancetime returns the next future maintenance time and the normal wait path resumes.
What does this PR do?
Add an SR votes monitor and push the results to Slack channel
Why are these changes required?
To get SR votes information ontime
This PR has been tested by:
Follow up
Extra details