Skip to content

fix: allow GitHub scraper to work without a token#2264

Draft
adityathebe wants to merge 1 commit into
mainfrom
feat/github-tokenless-scrape
Draft

fix: allow GitHub scraper to work without a token#2264
adityathebe wants to merge 1 commit into
mainfrom
feat/github-tokenless-scrape

Conversation

@adityathebe

@adityathebe adityathebe commented Jun 25, 2026

Copy link
Copy Markdown
Member

Public GitHub repository scraping can run without a token, but unauthenticated clients only get 60 core requests/hour.

The scraper used a fixed 100-request pause threshold, so tokenless clients were treated as rate-limited immediately.

Compute the pause threshold from the actual core limit, clamped between 1 and 100, and handle missing rate-limit data defensively.

Summary by CodeRabbit

  • Bug Fixes

    • Improved GitHub rate-limit handling so the app pauses more intelligently when API quota is running low.
    • Updated status logging to show the current limit, remaining requests, and the pause threshold for clearer visibility during syncs.
  • Tests

    • Added coverage for several rate-limit scenarios, including unauthenticated, authenticated, and edge-case values.

Public repository scraping can run without a token, but unauthenticated GitHub clients only have a 60-request core limit.

A fixed 100-request pause threshold caused tokenless scrapes to pause immediately even when requests remained.

Compute the pause threshold from the reported core limit with min and max bounds, and skip pausing when rate-limit data is missing.
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Benchstat

Base: d602f199b424599a20e9ea7797e913c3e68d5ad6
Head: 34a403794caf66ffc710defef806aabe6aaa97ce

📊 1 minor regression(s) (all within 5% threshold)

Benchmark Base Head Change p-value
RunTemplateBool/smallEnv-4 9.000Ki 9.001Ki +0.01% 0.015
✅ 2 improvement(s)
Benchmark Base Head Change p-value
LocationFilter/largeEnv-4 47.96µ 46.78µ -2.46% 0.002
RunTemplateBool/smallEnv-4 11.20µ 11.05µ -1.36% 0.015
Full benchstat output
goos: linux
goarch: amd64
pkg: github.com/flanksource/config-db/bench
cpu: AMD EPYC 7763 64-Core Processor                
                                         │ bench-base.txt │           bench-head.txt           │
                                         │     sec/op     │    sec/op     vs base              │
LocationFilter/smallEnv-4                    19.20µ ± 21%   20.83µ ± 13%       ~ (p=0.937 n=6)
LocationFilter/largeEnv-4                    47.96µ ± 16%   46.78µ ±  1%  -2.46% (p=0.002 n=6)
RunTemplateBool/smallEnv-4                   11.20µ ±  2%   11.05µ ±  2%  -1.36% (p=0.015 n=6)
RunTemplateBool/largeEnv-4                   20.54µ ±  1%   20.56µ ±  2%       ~ (p=0.461 n=6)
BenchSaveResultsSeed/N=1000-4                 4.100 ± 11%    4.139 ±  9%       ~ (p=0.937 n=6)
BenchSaveResultsUpdateUnchanged/N=1000-4      3.415 ±  3%    3.429 ±  2%       ~ (p=0.589 n=6)
BenchSaveResultsUpdateChanged/N=1000-4        7.537 ±  4%    7.486 ±  3%       ~ (p=0.818 n=6)
geomean                                      4.182m         4.212m        +0.72%

                                         │ bench-base.txt │            bench-head.txt             │
                                         │      B/op      │     B/op       vs base                │
LocationFilter/smallEnv-4                    15.61Ki ± 0%    15.61Ki ± 0%       ~ (p=1.000 n=6) ¹
LocationFilter/largeEnv-4                    21.03Ki ± 0%    21.03Ki ± 0%       ~ (p=1.000 n=6) ¹
RunTemplateBool/smallEnv-4                   9.000Ki ± 0%    9.001Ki ± 0%  +0.01% (p=0.015 n=6)
RunTemplateBool/largeEnv-4                   10.80Ki ± 0%    10.80Ki ± 0%       ~ (p=1.000 n=6) ¹
BenchSaveResultsSeed/N=1000-4                1.277Gi ± 0%    1.277Gi ± 0%       ~ (p=1.000 n=6)
BenchSaveResultsUpdateUnchanged/N=1000-4     32.32Mi ± 0%    32.32Mi ± 0%       ~ (p=0.485 n=6)
BenchSaveResultsUpdateChanged/N=1000-4       796.6Mi ± 1%    797.5Mi ± 1%       ~ (p=0.065 n=6)
geomean                                     1020.6Ki        1020.8Ki       +0.02%
¹ all samples are equal

                                         │ bench-base.txt │            bench-head.txt            │
                                         │   allocs/op    │  allocs/op    vs base                │
LocationFilter/smallEnv-4                     284.0 ±  0%    284.0 ±  0%       ~ (p=1.000 n=6) ¹
LocationFilter/largeEnv-4                     528.0 ±  0%    528.0 ±  0%       ~ (p=1.000 n=6) ¹
RunTemplateBool/smallEnv-4                    222.0 ±  0%    222.0 ±  0%       ~ (p=1.000 n=6) ¹
RunTemplateBool/largeEnv-4                    303.0 ±  0%    303.0 ±  0%       ~ (p=1.000 n=6) ¹
BenchSaveResultsSeed/N=1000-4                446.7k ±  0%   446.7k ±  0%       ~ (p=0.974 n=6)
BenchSaveResultsUpdateUnchanged/N=1000-4     404.9k ±  0%   404.9k ±  0%       ~ (p=0.195 n=6)
BenchSaveResultsUpdateChanged/N=1000-4       1.003M ± 13%   1.003M ± 13%       ~ (p=0.061 n=6)
geomean                                      7.846k         7.846k        +0.00%
¹ all samples are equal

                                         │ bench-base.txt │           bench-head.txt           │
                                         │      MB/s      │    MB/s     vs base                │
BenchSaveResultsSeed/N=1000-4                0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BenchSaveResultsUpdateUnchanged/N=1000-4     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BenchSaveResultsUpdateChanged/N=1000-4       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                                 ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

GitHub rate-limit pause checks now compute their threshold from the API limit, clamp it to 1..100, and use that value in the pause decision and log output. A table-driven test covers helper edge cases.

Changes

GitHub rate-limit pause logic

Layer / File(s) Summary
Computed pause threshold
scrapers/github/client.go, scrapers/github/client_test.go
githubRateLimitPauseThreshold derives a clamped threshold from the rate-limit limit, ShouldPauseForRateLimit uses it in the pause check and log message, and the new table-driven test covers zero and negative inputs.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly matches the main change: improving GitHub scraping for unauthenticated, tokenless use.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/github-tokenless-scrape
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/github-tokenless-scrape

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scrapers/github/client.go`:
- Around line 184-186: The pause check in the GitHub client rate-limit logic
returns a possibly negative wait duration from time.Until(core.Reset.Time),
which can make downstream reset timing invalid. Update the logic in the
rate-limit helper that compares core.Remaining against threshold so it only
returns shouldPause=true when the computed waitDuration is positive; if the
duration is zero or negative, return no pause instead and avoid passing a past
reset time onward.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73098b4a-5427-449b-a63e-50875e2adaaf

📥 Commits

Reviewing files that changed from the base of the PR and between d602f19 and 34a4037.

📒 Files selected for processing (2)
  • scrapers/github/client.go
  • scrapers/github/client_test.go

Comment thread scrapers/github/client.go
@adityathebe adityathebe changed the title fix: scale GitHub rate-limit pause threshold fix: allow GitHub scraper to work without a token Jun 25, 2026
@adityathebe adityathebe requested a review from moshloop June 25, 2026 15:07
@adityathebe adityathebe marked this pull request as draft June 25, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant