Skip to content

refactor: optimize trial user monitor with batch query and deactivate helper#214

Open
dan2k3k4 wants to merge 4 commits into
devfrom
add-cleanup-trial-recon-job
Open

refactor: optimize trial user monitor with batch query and deactivate helper#214
dan2k3k4 wants to merge 4 commits into
devfrom
add-cleanup-trial-recon-job

Conversation

@dan2k3k4
Copy link
Copy Markdown
Member

@dan2k3k4 dan2k3k4 commented Dec 23, 2025

Review feedback on the trial recon job PR addressing three items: missing db.refresh() in test fixtures, per-user N+1 query, and inline deactivation logic.

Changes

  • Test fixtures — added db.refresh() after every db.commit() in all five fixtures (trial_team, trial_user, trial_region, trial_key, user_budget_limit), consistent with conftest.py pattern
  • Batch query — replaced per-user DBLimitedResource lookup inside the loop with a single query + dict lookup:
    user_limits = db.query(DBLimitedResource).filter(
        DBLimitedResource.owner_type == OwnerType.USER,
        DBLimitedResource.owner_id.in_([user.id for user in users]),
        DBLimitedResource.resource == ResourceType.BUDGET
    ).all()
    user_limit_map = {limit.owner_id: limit for limit in user_limits}
  • deactivate_trial_user() helper — extracted disable-user + expire-keys logic into a reusable async def deactivate_trial_user(db, user), keeping monitor_trial_users focused on orchestration

Greptile Summary

This PR refactors the trial user monitor by extracting a deactivate_trial_user helper, replacing the per-user DBLimitedResource lookup with a single batch query + dict, and adding db.refresh() to all five test fixtures.

  • worker.py: monitor_trial_users now fetches all relevant budget limits in one query and dispatches to deactivate_trial_user; the helper modifies user state and expires LiteLLM keys, leaving the single db.commit() to the caller.
  • trigger_trial_recon_job.py: New standalone script that acquires the distributed lock, invokes monitor_trial_users, then releases the lock; the outer exception handler redundantly calls release_lock even though the inner finally already covers that path.
  • tests/test_monitor_trial_users.py: Three async tests cover the no-overage, over-budget, and admin-skip scenarios with proper db.refresh() in every fixture.

Confidence Score: 4/5

Safe to merge; the batch-query refactor is correct and the new tests cover the main scenarios.

The core logic is sound — the batch budget-limit query eliminates the N+1, deactivate_trial_user cleanly separates concerns, and all fixtures properly call db.refresh(). Two small issues remain: the trigger script has a redundant second release_lock call in its error path (harmless but confusing), and the key-expiry query inside deactivate_trial_user is still per-user, which could become noticeable if many users are deactivated at once.

scripts/trigger_trial_recon_job.py deserves a second look at the lock-release logic in the exception handler.

Important Files Changed

Filename Overview
app/core/worker.py Adds deactivate_trial_user helper and monitor_trial_users function with a batch budget-limit query; residual per-user key N+1 inside the helper is a minor concern.
scripts/trigger_trial_recon_job.py New manual-trigger script for the trial recon job; double release_lock call in the error path is redundant but idempotent.
tests/test_monitor_trial_users.py New test file covering no-overage, overage, and admin-skip scenarios with proper db.refresh() in all fixtures; missing newline at EOF.

Sequence Diagram

sequenceDiagram
    participant S as trigger_trial_recon_job.py
    participant L as Locking (DB)
    participant W as monitor_trial_users
    participant DB as Database
    participant LLM as LiteLLMService

    S->>L: try_acquire_lock("monitor_trial_users")
    L-->>S: True (lock acquired)

    S->>W: await monitor_trial_users(db)

    W->>DB: query DBTeam (trial team)
    DB-->>W: trial_team

    W->>DB: "query DBUser (active, role=user)"
    DB-->>W: users[]

    W->>DB: batch query DBLimitedResource IN (user IDs)
    DB-->>W: user_limits[]

    loop for each user over budget
        W->>W: deactivate_trial_user(db, user)
        W->>DB: "query DBPrivateAIKey (owner_id=user.id)"
        DB-->>W: keys[]
        loop for each key with litellm_token + region
            W->>LLM: update_key_duration(token, "0d")
            LLM-->>W: "ok / error (caught & logged)"
        end
        Note over W: user.is_active = False (not yet committed)
    end

    W->>DB: db.commit()

    S->>L: release_lock("monitor_trial_users")
Loading

Reviews (1): Last reviewed commit: "refactor: optimize trial user monitor wi..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

@dan2k3k4 dan2k3k4 force-pushed the add-cleanup-trial-recon-job branch from afeecec to c382ea3 Compare December 23, 2025 19:50
@dan2k3k4 dan2k3k4 requested a review from alagoa December 30, 2025 15:32
Copy link
Copy Markdown

@alagoa alagoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, need to figure out my GH notifications 🤓

Mostly just questions to make sure I understand why we're doing what we are doing here, no major blockers on my side.

Some more doubts (related and unrelated) that came up during review:

  1. This job is not being scheduled in main, is that coming on a later PR?
  2. From what I've seen this does not clash with the other recon jobs we have running, right?
    2a. The main monitor only updates PRODUCT or SYSTEM limits and the trial limits are marked as MANUAL.
    2b. Is the trial team exempt from the soft delete that runs on the main monitor? Even if it isn't explicitly, it probably is in practice, since the team soft delete requires all users of a team to be quiet for >76 days.

Comment thread app/core/worker.py
Comment thread app/core/worker.py
users = db.query(DBUser).filter(
DBUser.team_id == trial_team.id,
DBUser.is_active,
DBUser.role == "user"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (separate PR): could be worth to have an enum for DBUser.role.

Comment thread app/core/worker.py Outdated
Comment thread app/core/worker.py Outdated
@dan2k3k4 dan2k3k4 requested a review from Copilot April 7, 2026 16:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a “trial recon” workflow to automatically deactivate trial users who have exhausted their budget and expire their LiteLLM keys, along with a manual script to trigger the job and tests covering core scenarios.

Changes:

  • Added monitor_trial_users(db) background task to deactivate over-budget trial users and expire their keys in LiteLLM.
  • Added scripts/trigger_trial_recon_job.py to manually run the new trial recon job with a DB lock.
  • Added tests/test_monitor_trial_users.py to validate over/under budget behavior and admin-skipping.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
app/core/worker.py Introduces monitor_trial_users to find trial-team users over budget, deactivate them, and expire their keys.
scripts/trigger_trial_recon_job.py Adds a CLI script to run the job manually with locking to avoid concurrent execution.
tests/test_monitor_trial_users.py Adds unit tests for “no overage”, “overage disables user + expires key”, and “admin skipped”.
Comments suppressed due to low confidence (3)

app/core/worker.py:1198

  • monitor_trial_users swallows exceptions: in the outer except you rollback but never re-raise/return an error. This will make callers (including the manual trigger script) think the job succeeded even when it failed, and can hide production issues. After db.rollback() either raise the exception (matching other worker jobs in this file) or return a failure signal that the caller checks.
        # Calculate cutoff date (60 days ago)
        cutoff_date = datetime.now(UTC) - timedelta(days=60)

        # Query all teams that have been soft-deleted for 60+ days

app/core/worker.py:1150

  • The trial team lookup doesn’t filter on DBTeam.is_active. Elsewhere trial-team selection includes DBTeam.is_active (e.g. app/api/auth.py:745), so this job may deactivate users for an inactive/archived trial team. Add DBTeam.is_active to the query filter to align behavior.
                        last_spend_calculation=current_time,
                        regions=region_names,
                        last_updated=current_time,
                    )
                    db.add(team_metrics)

app/core/worker.py:1188

  • External side effects (expiring keys via LiteLLM) are performed before db.commit(). If the DB commit later fails and the session rolls back, keys may already be expired while the user remains active in the DB, leaving the system in an inconsistent state. Consider committing the user deactivation before calling LiteLLM (or otherwise separating DB state changes from external calls) to make failures recoverable and consistent.
                team_id=old_label[0], team_name=old_label[1]
            ).set(0)

        # Update active team labels for next run
        active_team_labels.clear()
        active_team_labels.update(current_team_labels)

    except Exception as e:
        logger.error(f"Error in team monitoring task: {str(e)}")
        raise e


@hard_delete_teams_duration.time()
async def hard_delete_expired_teams(db: Session):
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_monitor_trial_users.py
Copilot AI changed the title chore: add recon job to cleanup trial accounts refactor: optimize trial user monitor with batch query and deactivate helper Apr 7, 2026
@dan2k3k4 dan2k3k4 requested a review from a team April 10, 2026 17:40
@dan2k3k4 dan2k3k4 requested a review from dspachos May 4, 2026 06:35
Copy link
Copy Markdown
Contributor

@dspachos dspachos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 High: Exception swallowed in monitor_trial_users

  # worker.py line ~1442
  except Exception as e:
      logger.error(f"Error in trial user monitoring: {e}")
      db.rollback()
      # ⬆️ exception is NOT re-raised

Other worker jobs in this file re-raise (raise e). The trigger script expects monitor_trial_users to raise on failure (it has its own try/except/raise around
the call), but the exception will never propagate — the script will always log "completed successfully" even when the job failed. Should add raise after
db.rollback().

🟡 Medium: External side effects before commit (ordering issue)

In deactivate_trial_user(), LiteLLM keys are expired via the external API before the DB commit happens in monitor_trial_users. If the commit fails:

  • Keys are already expired in LiteLLM
  • But the user remains active in the DB
  • Inconsistent state

Suggestion: Either commit the user deactivation first, then expire keys (accepting that key expiry might fail but can be retried), or at minimum add a comment
acknowledging the trade-off.

🟡 Medium: Missing DBTeam.is_active filter

  trial_team = db.query(DBTeam).filter(
      DBTeam.admin_email == settings.AI_TRIAL_TEAM_EMAIL
  ).first()

Copilot flagged this — elsewhere (e.g. app/api/auth.py:745) the trial team lookup also filters on is_active. If the trial team is ever deactivated, this job
would still process its users. Add DBTeam.is_active == True to the filter.

Comment on lines +42 to +62
except Exception as e:
logger.error(f"Error in trial recon job execution: {str(e)}")
raise
finally:
# Always release the lock when done
release_lock(lock_name, db)
logger.info("Released monitor_trial_users lock")
else:
logger.warning("Another process has the monitor_trial_users lock, cannot execute job")
return False

except Exception as e:
logger.error(f"Error in trial recon job trigger: {str(e)}")
# Try to release lock in case of error
try:
release_lock(lock_name, db)
logger.info("Released lock after error")
except Exception as release_error:
logger.error(f"Error releasing lock: {str(release_error)}")
raise
finally:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Double lock release on error path

When monitor_trial_users raises, the inner finally block (line 48) already calls release_lock before re-raising. The outer except block (line 55) then calls release_lock a second time. release_lock sets lock.value = "false" and commits, so two calls are idempotent in practice — but the second call triggers a superfluous DB round-trip and makes the intent unclear. The outer except release guard was presumably added for cases where the lock was acquired but the inner finally never ran, which cannot actually happen here given the control flow.

Comment thread app/core/worker.py
Comment on lines +1383 to +1399
async def deactivate_trial_user(db: Session, user: DBUser):
"""Deactivate a trial user and expire all their LiteLLM keys."""
user.is_active = False
user.updated_at = datetime.now(UTC)

keys = db.query(DBPrivateAIKey).filter(DBPrivateAIKey.owner_id == user.id).all()
for key in keys:
if key.litellm_token and key.region:
try:
litellm_service = LiteLLMService(
api_url=key.region.litellm_api_url,
api_key=key.region.litellm_api_key
)
await litellm_service.update_key_duration(key.litellm_token, "0d")
logger.info(f"Set duration to 0d for key {key.id}")
except Exception as e:
logger.error(f"Failed to expire key {key.id}: {e}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Residual per-user key query inside the helper

The batch query PR fixed the N+1 for DBLimitedResource, but deactivate_trial_user still issues one db.query(DBPrivateAIKey) per deactivated user. For normal operation this is fine, but if a large cohort of trial users hits their limit simultaneously (end of a promo period, bulk import), this becomes an N-query sequence again. Prefetching keys in the same batch pattern used for limits — then passing the relevant keys into the helper — would make the function fully O(1) at the DB level.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants