refactor: optimize trial user monitor with batch query and deactivate helper#214
refactor: optimize trial user monitor with batch query and deactivate helper#214dan2k3k4 wants to merge 4 commits into
Conversation
afeecec to
c382ea3
Compare
alagoa
left a comment
There was a problem hiding this comment.
Sorry for the late review, need to figure out my GH notifications 🤓
Mostly just questions to make sure I understand why we're doing what we are doing here, no major blockers on my side.
Some more doubts (related and unrelated) that came up during review:
- This job is not being scheduled in
main, is that coming on a later PR? - From what I've seen this does not clash with the other recon jobs we have running, right?
2a. The main monitor only updatesPRODUCTorSYSTEMlimits and the trial limits are marked asMANUAL.
2b. Is the trial team exempt from the soft delete that runs on the main monitor? Even if it isn't explicitly, it probably is in practice, since the team soft delete requires all users of a team to be quiet for >76 days.
| users = db.query(DBUser).filter( | ||
| DBUser.team_id == trial_team.id, | ||
| DBUser.is_active, | ||
| DBUser.role == "user" |
There was a problem hiding this comment.
thought (separate PR): could be worth to have an enum for DBUser.role.
There was a problem hiding this comment.
Pull request overview
Adds a “trial recon” workflow to automatically deactivate trial users who have exhausted their budget and expire their LiteLLM keys, along with a manual script to trigger the job and tests covering core scenarios.
Changes:
- Added
monitor_trial_users(db)background task to deactivate over-budget trial users and expire their keys in LiteLLM. - Added
scripts/trigger_trial_recon_job.pyto manually run the new trial recon job with a DB lock. - Added
tests/test_monitor_trial_users.pyto validate over/under budget behavior and admin-skipping.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
app/core/worker.py |
Introduces monitor_trial_users to find trial-team users over budget, deactivate them, and expire their keys. |
scripts/trigger_trial_recon_job.py |
Adds a CLI script to run the job manually with locking to avoid concurrent execution. |
tests/test_monitor_trial_users.py |
Adds unit tests for “no overage”, “overage disables user + expires key”, and “admin skipped”. |
Comments suppressed due to low confidence (3)
app/core/worker.py:1198
monitor_trial_usersswallows exceptions: in the outerexceptyou rollback but never re-raise/return an error. This will make callers (including the manual trigger script) think the job succeeded even when it failed, and can hide production issues. Afterdb.rollback()eitherraisethe exception (matching other worker jobs in this file) or return a failure signal that the caller checks.
# Calculate cutoff date (60 days ago)
cutoff_date = datetime.now(UTC) - timedelta(days=60)
# Query all teams that have been soft-deleted for 60+ days
app/core/worker.py:1150
- The trial team lookup doesn’t filter on
DBTeam.is_active. Elsewhere trial-team selection includesDBTeam.is_active(e.g.app/api/auth.py:745), so this job may deactivate users for an inactive/archived trial team. AddDBTeam.is_activeto the query filter to align behavior.
last_spend_calculation=current_time,
regions=region_names,
last_updated=current_time,
)
db.add(team_metrics)
app/core/worker.py:1188
- External side effects (expiring keys via LiteLLM) are performed before
db.commit(). If the DB commit later fails and the session rolls back, keys may already be expired while the user remains active in the DB, leaving the system in an inconsistent state. Consider committing the user deactivation before calling LiteLLM (or otherwise separating DB state changes from external calls) to make failures recoverable and consistent.
team_id=old_label[0], team_name=old_label[1]
).set(0)
# Update active team labels for next run
active_team_labels.clear()
active_team_labels.update(current_team_labels)
except Exception as e:
logger.error(f"Error in team monitoring task: {str(e)}")
raise e
@hard_delete_teams_duration.time()
async def hard_delete_expired_teams(db: Session):
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… helper Agent-Logs-Url: https://github.com/amazeeio/amazee.ai/sessions/0b7151c4-cab2-42b4-b53b-59968f4f4f14 Co-authored-by: dan2k3k4 <158704+dan2k3k4@users.noreply.github.com>
dspachos
left a comment
There was a problem hiding this comment.
🔴 High: Exception swallowed in monitor_trial_users
# worker.py line ~1442
except Exception as e:
logger.error(f"Error in trial user monitoring: {e}")
db.rollback()
# ⬆️ exception is NOT re-raisedOther worker jobs in this file re-raise (raise e). The trigger script expects monitor_trial_users to raise on failure (it has its own try/except/raise around
the call), but the exception will never propagate — the script will always log "completed successfully" even when the job failed. Should add raise after
db.rollback().
🟡 Medium: External side effects before commit (ordering issue)
In deactivate_trial_user(), LiteLLM keys are expired via the external API before the DB commit happens in monitor_trial_users. If the commit fails:
- Keys are already expired in LiteLLM
- But the user remains active in the DB
- Inconsistent state
Suggestion: Either commit the user deactivation first, then expire keys (accepting that key expiry might fail but can be retried), or at minimum add a comment
acknowledging the trade-off.
🟡 Medium: Missing DBTeam.is_active filter
trial_team = db.query(DBTeam).filter(
DBTeam.admin_email == settings.AI_TRIAL_TEAM_EMAIL
).first()Copilot flagged this — elsewhere (e.g. app/api/auth.py:745) the trial team lookup also filters on is_active. If the trial team is ever deactivated, this job
would still process its users. Add DBTeam.is_active == True to the filter.
| except Exception as e: | ||
| logger.error(f"Error in trial recon job execution: {str(e)}") | ||
| raise | ||
| finally: | ||
| # Always release the lock when done | ||
| release_lock(lock_name, db) | ||
| logger.info("Released monitor_trial_users lock") | ||
| else: | ||
| logger.warning("Another process has the monitor_trial_users lock, cannot execute job") | ||
| return False | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error in trial recon job trigger: {str(e)}") | ||
| # Try to release lock in case of error | ||
| try: | ||
| release_lock(lock_name, db) | ||
| logger.info("Released lock after error") | ||
| except Exception as release_error: | ||
| logger.error(f"Error releasing lock: {str(release_error)}") | ||
| raise | ||
| finally: |
There was a problem hiding this comment.
Double lock release on error path
When monitor_trial_users raises, the inner finally block (line 48) already calls release_lock before re-raising. The outer except block (line 55) then calls release_lock a second time. release_lock sets lock.value = "false" and commits, so two calls are idempotent in practice — but the second call triggers a superfluous DB round-trip and makes the intent unclear. The outer except release guard was presumably added for cases where the lock was acquired but the inner finally never ran, which cannot actually happen here given the control flow.
| async def deactivate_trial_user(db: Session, user: DBUser): | ||
| """Deactivate a trial user and expire all their LiteLLM keys.""" | ||
| user.is_active = False | ||
| user.updated_at = datetime.now(UTC) | ||
|
|
||
| keys = db.query(DBPrivateAIKey).filter(DBPrivateAIKey.owner_id == user.id).all() | ||
| for key in keys: | ||
| if key.litellm_token and key.region: | ||
| try: | ||
| litellm_service = LiteLLMService( | ||
| api_url=key.region.litellm_api_url, | ||
| api_key=key.region.litellm_api_key | ||
| ) | ||
| await litellm_service.update_key_duration(key.litellm_token, "0d") | ||
| logger.info(f"Set duration to 0d for key {key.id}") | ||
| except Exception as e: | ||
| logger.error(f"Failed to expire key {key.id}: {e}") |
There was a problem hiding this comment.
Residual per-user key query inside the helper
The batch query PR fixed the N+1 for DBLimitedResource, but deactivate_trial_user still issues one db.query(DBPrivateAIKey) per deactivated user. For normal operation this is fine, but if a large cohort of trial users hits their limit simultaneously (end of a promo period, bulk import), this becomes an N-query sequence again. Prefetching keys in the same batch pattern used for limits — then passing the relevant keys into the helper — would make the function fully O(1) at the DB level.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Review feedback on the trial recon job PR addressing three items: missing
db.refresh()in test fixtures, per-user N+1 query, and inline deactivation logic.Changes
db.refresh()after everydb.commit()in all five fixtures (trial_team,trial_user,trial_region,trial_key,user_budget_limit), consistent withconftest.pypatternDBLimitedResourcelookup inside the loop with a single query + dict lookup:deactivate_trial_user()helper — extracted disable-user + expire-keys logic into a reusableasync def deactivate_trial_user(db, user), keepingmonitor_trial_usersfocused on orchestrationGreptile Summary
This PR refactors the trial user monitor by extracting a
deactivate_trial_userhelper, replacing the per-userDBLimitedResourcelookup with a single batch query + dict, and addingdb.refresh()to all five test fixtures.worker.py:monitor_trial_usersnow fetches all relevant budget limits in one query and dispatches todeactivate_trial_user; the helper modifies user state and expires LiteLLM keys, leaving the singledb.commit()to the caller.trigger_trial_recon_job.py: New standalone script that acquires the distributed lock, invokesmonitor_trial_users, then releases the lock; the outer exception handler redundantly callsrelease_lockeven though the innerfinallyalready covers that path.tests/test_monitor_trial_users.py: Three async tests cover the no-overage, over-budget, and admin-skip scenarios with properdb.refresh()in every fixture.Confidence Score: 4/5
Safe to merge; the batch-query refactor is correct and the new tests cover the main scenarios.
The core logic is sound — the batch budget-limit query eliminates the N+1,
deactivate_trial_usercleanly separates concerns, and all fixtures properly calldb.refresh(). Two small issues remain: the trigger script has a redundant secondrelease_lockcall in its error path (harmless but confusing), and the key-expiry query insidedeactivate_trial_useris still per-user, which could become noticeable if many users are deactivated at once.scripts/trigger_trial_recon_job.pydeserves a second look at the lock-release logic in the exception handler.Important Files Changed
deactivate_trial_userhelper andmonitor_trial_usersfunction with a batch budget-limit query; residual per-user key N+1 inside the helper is a minor concern.release_lockcall in the error path is redundant but idempotent.db.refresh()in all fixtures; missing newline at EOF.Sequence Diagram
sequenceDiagram participant S as trigger_trial_recon_job.py participant L as Locking (DB) participant W as monitor_trial_users participant DB as Database participant LLM as LiteLLMService S->>L: try_acquire_lock("monitor_trial_users") L-->>S: True (lock acquired) S->>W: await monitor_trial_users(db) W->>DB: query DBTeam (trial team) DB-->>W: trial_team W->>DB: "query DBUser (active, role=user)" DB-->>W: users[] W->>DB: batch query DBLimitedResource IN (user IDs) DB-->>W: user_limits[] loop for each user over budget W->>W: deactivate_trial_user(db, user) W->>DB: "query DBPrivateAIKey (owner_id=user.id)" DB-->>W: keys[] loop for each key with litellm_token + region W->>LLM: update_key_duration(token, "0d") LLM-->>W: "ok / error (caught & logged)" end Note over W: user.is_active = False (not yet committed) end W->>DB: db.commit() S->>L: release_lock("monitor_trial_users")Reviews (1): Last reviewed commit: "refactor: optimize trial user monitor wi..." | Re-trigger Greptile