Skip to content

feat(tools): Slack monitor on SR votes #105

Open
warku123 wants to merge 15 commits into
tronprotocol:developfrom
warku123:Slack_SR_post
Open

feat(tools): Slack monitor on SR votes #105
warku123 wants to merge 15 commits into
tronprotocol:developfrom
warku123:Slack_SR_post

Conversation

@warku123
Copy link
Copy Markdown

What does this PR do?
Add an SR votes monitor and push the results to Slack channel

Why are these changes required?
To get SR votes information ontime

This PR has been tested by:

  • Manual Testing

Follow up

Extra details


body, err := io.ReadAll(resp.Body)
if err != nil {
return ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to handle error messages by writing them to logs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already add logs file in logs/sr_monitor.log


go 1.25.5

require github.com/joho/godotenv v1.5.1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the security of this package

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This package is the de-facto standard for environment variable management in the Go community (10k+ Stars), repo: https://github.com/joho/godotenv. Its logic is minimal and transparent (reading local files only).

I've scanned the project using the official Go vulnerability tool govulncheck, and no vulnerabilities were detected in this package.

@warku123 warku123 changed the title feat(toolkit): monitor on SR votes feat(tools): monitor on SR votes Mar 27, 2026
@warku123 warku123 changed the title feat(tools): monitor on SR votes feat(tools): Slack monitor on SR votes Apr 1, 2026
bladehan1

This comment was marked as off-topic.

FROM golang:1.25-alpine AS builder

WORKDIR /app

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MUST] Builder base image missing digest pinning

Problem:

FROM golang:1.25-alpine AS builder lacks @sha256 digest.

Why this is P0 (Critical):

  1. Non-reproducible builds — Docker Hub may update golang:1.25-alpine tag without notice. Building today vs tomorrow produces different binaries.
  2. Supply chain traceability lost — Cannot audit which exact version was used in each build. Critical for security compliance.
  3. Security vulnerability — New golang versions may have breaking changes or removed dependencies. Without pinned digest, application may fail to start in production.
  4. No rollback capability — If an image breaks production, cannot roll back to previous version because the tag floats.
  5. Historical precedent — tron-docker has SA-002 (base image not pinned), this is a known vulnerability in the project.

Fix:

docker pull golang:1.25-alpine
docker inspect golang:1.25-alpine --format='{{index .RepoDigests 0}}'
# Replace: FROM golang:1.25-alpine@sha256:abc123...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f330c1a. Pinned to golang:1.25-alpine@sha256:8d22e29d960bc50cd025d93d5b7c7d220b1ee9aa7a239b3c8f55a57e987e8d45. Thanks for the catch.

FROM alpine:latest

RUN apk --no-cache add ca-certificates tzdata

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MUST] Runtime base image using :latest tag (no digest)

Problem:

FROM alpine:latest is a floating tag with no @sha256 digest.

Why this is P0 (Critical):

  1. Worst-case version management — :latest is the most unstable tag. Rebuild tomorrow and get completely different image.
  2. Zero image consistency guarantee — Same Dockerfile produces different image on each rebuild.
  3. Supply chain attack surface — Anyone (including attackers) could upload new alpine:latest to Docker Hub. Containers auto-pull the poisoned image.
  4. Production safety compromised — Cannot rollback to previous version if new alpine breaks the app. Cannot diagnose which version caused the problem.
  5. Deployment uncertainty — Image contents are non-deterministic. Violates infrastructure-as-code principle.

Fix:

docker pull alpine:3.20
docker inspect alpine:3.20 --format='{{index .RepoDigests 0}}'
# Replace: FROM alpine:3.20@sha256:xyz789...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f330c1a. Switched to alpine:3.20@sha256:d9e853e87e55526f6b2917df91a2115c36dd7c696a35be12163d44e6e2a4b6bc. Agree that :latest provides no reproducibility guarantee.


// Build rank map for previous SRs
prevRankMap := make(map[string]int)
for i, addr := range prevSRs {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MUST] JSON marshal error discarded (blank import)

Problem:

jsonPayload, _ := json.Marshal(payload)  // Error ignored!

Why this is P0 (Critical):

  1. Silent failure — worst error handling pattern — Not a panic, not an error return. Silently continues execution.
  2. Sends nil payload — If payload contains non-serializable field (circular reference, unexported field, etc.), marshal fails and jsonPayload becomes nil.
  3. Slack request fails invisibly — nil payload creates invalid HTTP request. Slack returns 400/422, but root cause (marshal failure) is invisible in logs.
  4. Nearly impossible to diagnose — Developer sees "failed to send to Slack: status 400" but cannot see the real cause (json.Marshal failed).
  5. Future breaking change risk — As code evolves and payload structure becomes complex, this bug suddenly explodes in production.

Standard Go error handling:

// ❌ Current (P0)
jsonPayload, _ := json.Marshal(payload)  // Error disappears

// ✅ Correct
jsonPayload, err := json.Marshal(payload)
if err != nil {
    log.Printf("ERROR: failed to marshal Slack payload: %v", err)
    return fmt.Errorf("marshal failed: %w", err)
}

Fix: Handle the error explicitly or return (string, error) from the function.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f330c1a. The marshal error is now logged and the function returns an empty name, consistent with the existing error paths further down in getAccountName. Agreed this was a silent-failure anti-pattern.

container_name: slack-sr-monitor
restart: always
environment:
- SLACK_WEBHOOK=${SLACK_WEBHOOK}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD] Restart policy 'always' risks infinite restart loop

Problem:

restart: always

Container will be restarted infinitely, even on persistent failures.

Why this is P1 (Should Fix):

Scenario: Invalid Slack Webhook

10:00:00 - Container starts → checks SLACK_WEBHOOK env var → invalid → os.Exit(1)
10:00:01 - Docker: container crashed, restart: always → restart container
10:00:02 - Container starts → checks SLACK_WEBHOOK → invalid → os.Exit(1)
10:00:03 - Docker: container crashed → restart
...
10:05:00 - 300+ restart attempts, logs flooded, CPU spinning

Consequences:

  1. Problem masking — Error is hidden in restart loop. Developer sees "container restarting" but doesn't see the root cause (invalid webhook URL).
  2. Log spam — Logs filled with restart attempts, making debugging harder. Real errors buried under restart messages.
  3. Resource waste — Container keeps starting/stopping, wasting CPU and I/O.
  4. Delayed incident response — Takes longer to discover and diagnose the actual problem.
  5. Hard to stop gracefully — Docker stop doesn't prevent auto-restart; need docker update --restart=no to halt it.

Comparison of restart policies:

Policy Behavior Use Case
always Restart forever, even after intentional stop ❌ Masks failures
unless-stopped Restart unless manually stopped (recommended) ✅ Failures visible, can diagnose
on-failure:5 Restart max 5 times ✅ Expose persistent failures after N retries

Fix:

services:
  slack-sr-monitor:
    # Option 1 (Recommended): Stop on failure, allow manual restart
    restart: unless-stopped
    
    # Option 2: Auto-restart but stop after 5 failed attempts
    # restart: on-failure:5

Benefit of 'unless-stopped': When the service crashes, it stays down until manual intervention. This forces developers to investigate instead of ignoring the logs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, switched to restart: unless-stopped in f330c1a. Your reasoning about failure visibility makes sense — a misconfigured webhook should surface as a stopped container rather than be hidden in a restart loop.

warku123 added 3 commits May 14, 2026 16:33
- Handle json.Marshal error in getAccountName instead of discarding it
- Pin base image digests (golang:1.25-alpine, alpine:3.20) for reproducible builds
- Run container as non-root user 'monitor' under /home/monitor
- Tighten log file permissions from 0666 to 0600
- Mask Slack webhook token in startup log
- Switch docker-compose restart policy to unless-stopped
The loop polls getnextmaintenancetime in a 2-minute cycle. When the node is
lagging behind the chain head (e.g. after a connection outage), the API
returns a past maintenance time over and over, causing the monitor to send
a duplicate Slack message every 2 minutes until the node recovers.

Before triggering a push, check the timestamp of the latest block on the
node and skip the cycle if it is more than 30 seconds old. Once the node
catches up, getnextmaintenancetime returns the next future maintenance
time and the normal wait path resumes.
@warku123 warku123 requested a review from bladehan1 May 18, 2026 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants