Skip to content

feat(workers): implement Reddit OAuth client credentials flow to bypa…#2887

Open
kunal-rathore-111 wants to merge 2 commits into
karakeep-app:mainfrom
kunal-rathore-111:fix/2885-reddit-crawling-auth
Open

feat(workers): implement Reddit OAuth client credentials flow to bypa…#2887
kunal-rathore-111 wants to merge 2 commits into
karakeep-app:mainfrom
kunal-rathore-111:fix/2885-reddit-crawling-auth

Conversation

@kunal-rathore-111

Copy link
Copy Markdown

Description

Fixes #2885

Reddit has recently updated their crawler policies, causing our unauthenticated .json requests to frequently get blocked.

To resolve this, this PR implements the official Reddit OAuth Client Credentials flow for the metascraper-reddit plugin.

Specifically:

  • Added optional REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET environment variables to serverConfig.
  • The scraper now fetches an access token and uses an in-memory cache (with a 5-minute safety buffer before expiration) to avoid rate limits on the authentication endpoint.
  • If the credentials are provided, requests are routed through oauth.reddit.com using the Bearer token and a custom User-Agent.
  • If credentials are not configured, it gracefully falls back to the previous unauthenticated .json polling, ensuring the change is fully backward-compatible.

How Has This Been Tested?

  • Verified that if REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET are not present, it correctly falls back to unauthenticated .json requests.
  • Verified that when credentials are provided, the system correctly fetches a token, routes the request to oauth.reddit.com, and successfully fetches the Reddit metadata without being blocked.
  • Verified that the token cache correctly caches the token and reuses it until expiration.

Screenshots (if appropriate)

Checklist:

  • I have carefully read CONTRIBUTING.md
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation if applicable
  • I have no unrelated changes in the PR.
  • I have confirmed that any new dependencies are strictly necessary.
  • I have written tests for new code (if applicable)

Please describe to which degree, if any, an LLM was used in creating this pull request.

I collaborated with an AI coding assistant to help design the caching logic and integrate the standard OAuth Client Credentials flow.

@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds Reddit OAuth client-credentials support to the metascraper-reddit plugin to bypass the recent API blocking of unauthenticated .json requests, with a graceful fallback when credentials are absent.

  • Introduces getRedditAccessToken with an in-memory token cache; fetches from oauth.reddit.com when credentials are present, otherwise falls back to the existing unauthenticated path.
  • Adds optional REDDIT_CLIENT_ID / REDDIT_CLIENT_SECRET env vars to serverConfig following existing patterns.

Confidence Score: 3/5

The fallback path is safe, but the OAuth token cache has two bugs that need fixing before enabling credentials in production.

The buffer subtraction in redditAccessTokenExpiresAt can produce a negative offset if Reddit ever returns a short-lived token, permanently bypassing the cache and hammering the auth endpoint. Separately, the token-refresh function has no in-flight deduplication, so concurrent scrape workers will each issue their own refresh request when the token expires — the URL-level cache already solves this correctly with a stored Promise, but that pattern wasn't applied to the token refresh. Both issues are in the OAuth path only; the unauthenticated fallback is unaffected.

apps/workers/metascraper-plugins/metascraper-reddit.ts — specifically the token caching logic around lines 109–152

Important Files Changed

Filename Overview
apps/workers/metascraper-plugins/metascraper-reddit.ts Adds OAuth client-credentials token caching and oauth.reddit.com routing; has an underflow bug in the buffer calculation and a concurrent-refresh race condition.
packages/shared/config.ts Adds optional REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET env vars following existing patterns; no issues.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/workers/metascraper-plugins/metascraper-reddit.ts:144-146
If `expires_in` is less than 300 (a short-lived token or unexpected server response), `(data.expires_in - 300)` is negative, making `redditAccessTokenExpiresAt` a timestamp in the past. Every subsequent call would skip the cache and issue a new token request, likely triggering Reddit's rate limit on the auth endpoint.

```suggestion
    redditAccessToken = data.access_token;
    // Expire 5 minutes before the actual expiration to be safe
    redditAccessTokenExpiresAt = now + Math.max(0, data.expires_in - 300) * 1000;
```

### Issue 2 of 2
apps/workers/metascraper-plugins/metascraper-reddit.ts:112-152
**Concurrent token refresh race condition**

`getRedditAccessToken` has no concurrency guard. When multiple scrape jobs run in parallel and the cached token has just expired, all of them simultaneously pass the `redditAccessTokenExpiresAt > now` check before any one has written the new token. Each will then issue its own token-refresh request to Reddit's auth endpoint, potentially triggering rate limiting.

The existing URL-level cache in `fetchRedditPostData` avoids this correctly by storing the `Promise` before it resolves. The same pattern should be applied here — store a single in-flight `Promise<string | null>` and return it to all concurrent callers until it resolves.

Reviews (1): Last reviewed commit: "feat(workers): implement Reddit OAuth cl..." | Re-trigger Greptile

Comment thread apps/workers/metascraper-plugins/metascraper-reddit.ts Outdated
Comment thread apps/workers/metascraper-plugins/metascraper-reddit.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Crawler] Reddit crawling is now getting blocked

1 participant