refactor: optimize trial user monitor with batch query and deactivate helper by dan2k3k4 · Pull Request #214 · amazeeio/amazee.ai

dan2k3k4 · 2025-12-23T19:41:58Z

Review feedback on the trial recon job PR addressing three items: missing db.refresh() in test fixtures, per-user N+1 query, and inline deactivation logic.

Changes

Test fixtures — added db.refresh() after every db.commit() in all five fixtures (trial_team, trial_user, trial_region, trial_key, user_budget_limit), consistent with conftest.py pattern

Batch query — replaced per-user DBLimitedResource lookup inside the loop with a single query + dict lookup:

user_limits = db.query(DBLimitedResource).filter(
    DBLimitedResource.owner_type == OwnerType.USER,
    DBLimitedResource.owner_id.in_([user.id for user in users]),
    DBLimitedResource.resource == ResourceType.BUDGET
).all()
user_limit_map = {limit.owner_id: limit for limit in user_limits}

deactivate_trial_user() helper — extracted disable-user + expire-keys logic into a reusable async def deactivate_trial_user(db, user), keeping monitor_trial_users focused on orchestration

Greptile Summary

This PR refactors the trial user monitor by extracting a deactivate_trial_user helper, replacing the per-user DBLimitedResource lookup with a single batch query + dict, and adding db.refresh() to all five test fixtures.

worker.py: monitor_trial_users now fetches all relevant budget limits in one query and dispatches to deactivate_trial_user; the helper modifies user state and expires LiteLLM keys, leaving the single db.commit() to the caller.
trigger_trial_recon_job.py: New standalone script that acquires the distributed lock, invokes monitor_trial_users, then releases the lock; the outer exception handler redundantly calls release_lock even though the inner finally already covers that path.
tests/test_monitor_trial_users.py: Three async tests cover the no-overage, over-budget, and admin-skip scenarios with proper db.refresh() in every fixture.

Confidence Score: 4/5

Safe to merge; the batch-query refactor is correct and the new tests cover the main scenarios.

The core logic is sound — the batch budget-limit query eliminates the N+1, deactivate_trial_user cleanly separates concerns, and all fixtures properly call db.refresh(). Two small issues remain: the trigger script has a redundant second release_lock call in its error path (harmless but confusing), and the key-expiry query inside deactivate_trial_user is still per-user, which could become noticeable if many users are deactivated at once.

scripts/trigger_trial_recon_job.py deserves a second look at the lock-release logic in the exception handler.

Important Files Changed

Filename	Overview
app/core/worker.py	Adds `deactivate_trial_user` helper and `monitor_trial_users` function with a batch budget-limit query; residual per-user key N+1 inside the helper is a minor concern.
scripts/trigger_trial_recon_job.py	New manual-trigger script for the trial recon job; double `release_lock` call in the error path is redundant but idempotent.
tests/test_monitor_trial_users.py	New test file covering no-overage, overage, and admin-skip scenarios with proper `db.refresh()` in all fixtures; missing newline at EOF.

Sequence Diagram

sequenceDiagram
    participant S as trigger_trial_recon_job.py
    participant L as Locking (DB)
    participant W as monitor_trial_users
    participant DB as Database
    participant LLM as LiteLLMService

    S->>L: try_acquire_lock("monitor_trial_users")
    L-->>S: True (lock acquired)

    S->>W: await monitor_trial_users(db)

    W->>DB: query DBTeam (trial team)
    DB-->>W: trial_team

    W->>DB: "query DBUser (active, role=user)"
    DB-->>W: users[]

    W->>DB: batch query DBLimitedResource IN (user IDs)
    DB-->>W: user_limits[]

    loop for each user over budget
        W->>W: deactivate_trial_user(db, user)
        W->>DB: "query DBPrivateAIKey (owner_id=user.id)"
        DB-->>W: keys[]
        loop for each key with litellm_token + region
            W->>LLM: update_key_duration(token, "0d")
            LLM-->>W: "ok / error (caught & logged)"
        end
        Note over W: user.is_active = False (not yet committed)
    end

    W->>DB: db.commit()

    S->>L: release_lock("monitor_trial_users")

_{Reviews (1): Last reviewed commit: "refactor: optimize trial user monitor wi..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

alagoa

Sorry for the late review, need to figure out my GH notifications 🤓

Mostly just questions to make sure I understand why we're doing what we are doing here, no major blockers on my side.

Some more doubts (related and unrelated) that came up during review:

This job is not being scheduled in main, is that coming on a later PR?
From what I've seen this does not clash with the other recon jobs we have running, right?
2a. The main monitor only updates PRODUCT or SYSTEM limits and the trial limits are marked as MANUAL.
2b. Is the trial team exempt from the soft delete that runs on the main monitor? Even if it isn't explicitly, it probably is in practice, since the team soft delete requires all users of a team to be quiet for >76 days.

alagoa · 2025-12-31T05:52:45Z

+        users = db.query(DBUser).filter(
+            DBUser.team_id == trial_team.id,
+            DBUser.is_active,
+            DBUser.role == "user"


thought (separate PR): could be worth to have an enum for DBUser.role.

Copilot

Pull request overview

Adds a “trial recon” workflow to automatically deactivate trial users who have exhausted their budget and expire their LiteLLM keys, along with a manual script to trigger the job and tests covering core scenarios.

Changes:

Added monitor_trial_users(db) background task to deactivate over-budget trial users and expire their keys in LiteLLM.
Added scripts/trigger_trial_recon_job.py to manually run the new trial recon job with a DB lock.
Added tests/test_monitor_trial_users.py to validate over/under budget behavior and admin-skipping.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`app/core/worker.py`	Introduces `monitor_trial_users` to find trial-team users over budget, deactivate them, and expire their keys.
`scripts/trigger_trial_recon_job.py`	Adds a CLI script to run the job manually with locking to avoid concurrent execution.
`tests/test_monitor_trial_users.py`	Adds unit tests for “no overage”, “overage disables user + expires key”, and “admin skipped”.

Comments suppressed due to low confidence (3)

app/core/worker.py:1198

monitor_trial_users swallows exceptions: in the outer except you rollback but never re-raise/return an error. This will make callers (including the manual trigger script) think the job succeeded even when it failed, and can hide production issues. After db.rollback() either raise the exception (matching other worker jobs in this file) or return a failure signal that the caller checks.

        # Calculate cutoff date (60 days ago)
        cutoff_date = datetime.now(UTC) - timedelta(days=60)

        # Query all teams that have been soft-deleted for 60+ days

app/core/worker.py:1150

The trial team lookup doesn’t filter on DBTeam.is_active. Elsewhere trial-team selection includes DBTeam.is_active (e.g. app/api/auth.py:745), so this job may deactivate users for an inactive/archived trial team. Add DBTeam.is_active to the query filter to align behavior.

                        last_spend_calculation=current_time,
                        regions=region_names,
                        last_updated=current_time,
                    )
                    db.add(team_metrics)

app/core/worker.py:1188

External side effects (expiring keys via LiteLLM) are performed before db.commit(). If the DB commit later fails and the session rolls back, keys may already be expired while the user remains active in the DB, leaving the system in an inconsistent state. Consider committing the user deactivation before calling LiteLLM (or otherwise separating DB state changes from external calls) to make failures recoverable and consistent.

                team_id=old_label[0], team_name=old_label[1]
            ).set(0)

        # Update active team labels for next run
        active_team_labels.clear()
        active_team_labels.update(current_team_labels)

    except Exception as e:
        logger.error(f"Error in team monitoring task: {str(e)}")
        raise e


@hard_delete_teams_duration.time()
async def hard_delete_expired_teams(db: Session):
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… helper Agent-Logs-Url: https://github.com/amazeeio/amazee.ai/sessions/0b7151c4-cab2-42b4-b53b-59968f4f4f14 Co-authored-by: dan2k3k4 <158704+dan2k3k4@users.noreply.github.com>

dspachos

🔴 High: Exception swallowed in monitor_trial_users

  # worker.py line ~1442
  except Exception as e:
      logger.error(f"Error in trial user monitoring: {e}")
      db.rollback()
      # ⬆️ exception is NOT re-raised

Other worker jobs in this file re-raise (raise e). The trigger script expects monitor_trial_users to raise on failure (it has its own try/except/raise around
the call), but the exception will never propagate — the script will always log "completed successfully" even when the job failed. Should add raise after
db.rollback().

🟡 Medium: External side effects before commit (ordering issue)

In deactivate_trial_user(), LiteLLM keys are expired via the external API before the DB commit happens in monitor_trial_users. If the commit fails:

Keys are already expired in LiteLLM
But the user remains active in the DB
Inconsistent state

Suggestion: Either commit the user deactivation first, then expire keys (accepting that key expiry might fail but can be retried), or at minimum add a comment
acknowledging the trade-off.

🟡 Medium: Missing DBTeam.is_active filter

  trial_team = db.query(DBTeam).filter(
      DBTeam.admin_email == settings.AI_TRIAL_TEAM_EMAIL
  ).first()

Copilot flagged this — elsewhere (e.g. app/api/auth.py:745) the trial team lookup also filters on is_active. If the trial team is ever deactivated, this job
would still process its users. Add DBTeam.is_active == True to the filter.

greptile-apps · 2026-06-02T15:58:58Z

+            except Exception as e:
+                logger.error(f"Error in trial recon job execution: {str(e)}")
+                raise
+            finally:
+                # Always release the lock when done
+                release_lock(lock_name, db)
+                logger.info("Released monitor_trial_users lock")
+        else:
+            logger.warning("Another process has the monitor_trial_users lock, cannot execute job")
+            return False
+
+    except Exception as e:
+        logger.error(f"Error in trial recon job trigger: {str(e)}")
+        # Try to release lock in case of error
+        try:
+            release_lock(lock_name, db)
+            logger.info("Released lock after error")
+        except Exception as release_error:
+            logger.error(f"Error releasing lock: {str(release_error)}")
+        raise
+    finally:


Double lock release on error path

When monitor_trial_users raises, the inner finally block (line 48) already calls release_lock before re-raising. The outer except block (line 55) then calls release_lock a second time. release_lock sets lock.value = "false" and commits, so two calls are idempotent in practice — but the second call triggers a superfluous DB round-trip and makes the intent unclear. The outer except release guard was presumably added for cases where the lock was acquired but the inner finally never ran, which cannot actually happen here given the control flow.

greptile-apps · 2026-06-02T15:58:59Z

+async def deactivate_trial_user(db: Session, user: DBUser):
+    """Deactivate a trial user and expire all their LiteLLM keys."""
+    user.is_active = False
+    user.updated_at = datetime.now(UTC)
+
+    keys = db.query(DBPrivateAIKey).filter(DBPrivateAIKey.owner_id == user.id).all()
+    for key in keys:
+        if key.litellm_token and key.region:
+            try:
+                litellm_service = LiteLLMService(
+                    api_url=key.region.litellm_api_url,
+                    api_key=key.region.litellm_api_key
+                )
+                await litellm_service.update_key_duration(key.litellm_token, "0d")
+                logger.info(f"Set duration to 0d for key {key.id}")
+            except Exception as e:
+                logger.error(f"Failed to expire key {key.id}: {e}")


Residual per-user key query inside the helper

The batch query PR fixed the N+1 for DBLimitedResource, but deactivate_trial_user still issues one db.query(DBPrivateAIKey) per deactivated user. For normal operation this is fine, but if a large cohort of trial users hits their limit simultaneously (end of a promo period, bulk import), this becomes an N-query sequence again. Prefetching keys in the same batch pattern used for limits — then passing the relevant keys into the helper — would make the function fully O(1) at the DB level.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

dan2k3k4 added 2 commits December 23, 2025 19:49

Trigger a build...

9f116de

chore: add recon job to cleanup trial accounts

c382ea3

dan2k3k4 force-pushed the add-cleanup-trial-recon-job branch from afeecec to c382ea3 Compare December 23, 2025 19:50

dan2k3k4 requested a review from alagoa December 30, 2025 15:32

alagoa approved these changes Dec 31, 2025

View reviewed changes

dan2k3k4 requested a review from Copilot April 7, 2026 16:52

Merge branch 'dev' into add-cleanup-trial-recon-job

f0cfe74

Copilot started reviewing on behalf of dan2k3k4 April 7, 2026 16:53 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread tests/test_monitor_trial_users.py

Copilot started work on behalf of dan2k3k4 April 7, 2026 19:52 View session

refactor: optimize trial user monitor with batch query and deactivate…

0b066e8

… helper Agent-Logs-Url: https://github.com/amazeeio/amazee.ai/sessions/0b7151c4-cab2-42b4-b53b-59968f4f4f14 Co-authored-by: dan2k3k4 <158704+dan2k3k4@users.noreply.github.com>

Copilot AI changed the title ~~chore: add recon job to cleanup trial accounts~~ refactor: optimize trial user monitor with batch query and deactivate helper Apr 7, 2026

Copilot finished work on behalf of dan2k3k4 April 7, 2026 19:56

dan2k3k4 requested a review from a team April 10, 2026 17:40

dan2k3k4 requested a review from dspachos May 4, 2026 06:35

dspachos reviewed May 26, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: optimize trial user monitor with batch query and deactivate helper#214

refactor: optimize trial user monitor with batch query and deactivate helper#214
dan2k3k4 wants to merge 4 commits into
devfrom
add-cleanup-trial-recon-job

dan2k3k4 commented Dec 23, 2025 •

edited by greptile-apps Bot

Loading

Uh oh!

alagoa left a comment

Uh oh!

Uh oh!

alagoa Dec 31, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

dspachos left a comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dan2k3k4 commented Dec 23, 2025 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

alagoa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alagoa Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

dspachos left a comment

Choose a reason for hiding this comment

🔴 High: Exception swallowed in monitor_trial_users

🟡 Medium: External side effects before commit (ordering issue)

🟡 Medium: Missing DBTeam.is_active filter

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dan2k3k4 commented Dec 23, 2025 •

edited by greptile-apps Bot

Loading