Skip to content

feat(jobs): Add data retention jobs#4128

Open
TheodoreSpeaks wants to merge 15 commits intostagingfrom
feat/auto-redaction
Open

feat(jobs): Add data retention jobs#4128
TheodoreSpeaks wants to merge 15 commits intostagingfrom
feat/auto-redaction

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

@TheodoreSpeaks TheodoreSpeaks commented Apr 13, 2026

Summary

Add data retention jobs. 3 jobs created:

  1. Clean up soft deleted resources (7 days free, 30 days paid, customizable enterprise)
  2. Log retention cleanup (7 days free, infinite paid, customizable enterprise)
  3. Task cleanup (7 days free, infinite paid, customizable enterprise)

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Other: ___________

Testing

  • Tested locally. Validated that data is deleted from sim and copilot dbs. Validated that s3 buckets clean up data as well.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Screenshots/Videos

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 18, 2026 10:29pm

Request Review

@waleedlatif1
Copy link
Copy Markdown
Collaborator

@TheodoreSpeaks let's consolidate the migrations into a single one, just delete the existing ones and run it once over all the changes in shcema.ts

@TheodoreSpeaks TheodoreSpeaks changed the title Feat/auto redaction (wip) feat(jobs): Add data retention jobs Apr 18, 2026
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 18, 2026

PR Summary

High Risk
High risk because it introduces new automated deletion jobs across multiple tables and storage backends, plus new workspace retention columns and dispatch logic that could delete data incorrectly if scopes/retention are misconfigured.

Overview
Introduces a data retention system backed by new workspace columns (logRetentionHours, softDeleteRetentionHours, taskCleanupHours) and a central CLEANUP_CONFIG + dispatcher that enqueues plan-scoped jobs and per-workspace enterprise jobs (Trigger.dev batch trigger when available, otherwise the internal job queue with optional inline execution).

Replaces the previous monolithic /api/logs/cleanup behavior with a lightweight cron-triggered dispatcher, and adds new cron endpoints for cleanup-soft-deletes and cleanup-tasks.

Adds three new background cleanup tasks (cleanup-logs, cleanup-soft-deletes, cleanup-tasks) that batch-delete expired DB rows and also remove associated cloud storage objects and Copilot backend chat artifacts, plus shared helpers (batchDeleteByWorkspaceAndTimestamp, prepareChatCleanup).

Adds an Enterprise settings surface for configuring retention: a new /api/workspaces/[id]/data-retention GET/PUT (admin + enterprise-gated, audited) and a new Settings tab/UI (Data Retention) with a locked read-only view for non-enterprise plans.

Reviewed by Cursor Bugbot for commit 4d76165. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/background/cleanup-soft-deletes.ts Outdated
Comment thread packages/db/migrations/meta/_journal.json
Comment thread packages/db/migrations/meta/0191_snapshot.json Outdated
Comment thread apps/sim/background/cleanup-tasks.ts
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@BugBot review

Comment thread apps/sim/lib/billing/cleanup-dispatcher.ts Outdated
@TheodoreSpeaks TheodoreSpeaks marked this pull request as ready for review April 18, 2026 19:34
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 18, 2026

Greptile Summary

Adds three data retention background jobs (soft-delete cleanup, log cleanup, task/chat cleanup) dispatched via Trigger.dev or an inline fallback, with an enterprise-gated UI and API for per-workspace configuration. The migration replaces full soft-delete indexes with partial indexes and adds three retention columns to workspace.

  • P1 — S3 objects orphaned for workspace_file rows: cleanupWorkspaceFileStorage only queries workspaceFiles (plural) for S3 cleanup, but CLEANUP_TARGETS also includes workspaceFile (singular), which has its own key column. When cleanupTable hard-deletes those rows the corresponding S3 objects are never removed.
  • P2 — task cleanup defaults mismatch: The PR description states "7 days free" for task cleanup, but getRetentionDefaultHours returns null for both free and paid tiers (confirmed by the in-code comment "No task cleanup"). Clarify whether this is intentional or a description oversight.

Confidence Score: 3/5

Not safe to merge as-is: the S3 cleanup gap will permanently orphan workspace_file objects in object storage on every cleanup run.

One confirmed P1 data-integrity bug (workspace_file S3 objects never deleted) that will silently accumulate orphaned cloud storage objects on each cron execution. Everything else — batching logic, auth, migration, Trigger.dev wiring, enterprise UI — is well-structured.

apps/sim/background/cleanup-soft-deletes.ts — cleanupWorkspaceFileStorage must also cover the workspaceFile (singular) table

Important Files Changed

Filename Overview
apps/sim/background/cleanup-soft-deletes.ts Batched soft-delete cleanup with partial-index support; P1 bug: workspaceFile S3 objects are never deleted when DB rows are purged
apps/sim/background/cleanup-tasks.ts Task/chat/run cleanup with correct deletion ordering; duplicate copilotChats query for feedback deletion (addressed in prior thread)
apps/sim/background/cleanup-logs.ts Batched execution log cleanup with S3 file handling; dynamic import inside loop (addressed in prior thread)
apps/sim/lib/billing/cleanup-dispatcher.ts Dispatches free/paid/enterprise cleanup jobs; taskCleanupHours returns null for free/paid despite PR description claiming 7-day free default
apps/sim/lib/cleanup/chat-cleanup.ts Collects file refs from workspaceFiles and JSONB messages, calls copilot backend, deletes S3 files with correct context per file
apps/sim/app/api/workspaces/[id]/data-retention/route.ts GET/PUT API for data retention config; properly gates writes to enterprise plan with admin permission check and audit logging
packages/db/migrations/0193_unknown_franklin_richards.sql Adds log/soft-delete/task retention columns to workspace; replaces full indexes with partial indexes on deleted_at/archived_at for query efficiency
apps/sim/ee/data-retention/components/data-retention-settings.tsx Enterprise-gated UI for configuring retention periods; correctly renders locked vs. editable views based on plan
apps/sim/ee/data-retention/hooks/data-retention.ts React Query hooks for data retention with correct staleTime, signal forwarding, and onSettled invalidation
apps/sim/lib/core/async-jobs/backends/trigger-dev.ts Adds cleanup-logs, cleanup-soft-deletes, cleanup-tasks to the Trigger.dev job type mapping; no issues

Sequence Diagram

sequenceDiagram
    participant Cron as Cron (GET /api/cron/*)
    participant Dispatcher as dispatchCleanupJobs
    participant Queue as JobQueue (Trigger.dev / DB)
    participant Task as Background Task
    participant DB as Database
    participant S3 as Object Storage
    participant Copilot as Copilot Backend

    Cron->>Dispatcher: dispatchCleanupJobs(jobType, retentionColumn)
    Dispatcher->>Queue: enqueue free-tier job
    Dispatcher->>Queue: enqueue paid-tier job
    Dispatcher->>DB: query enterprise workspaces with non-NULL retention
    Dispatcher->>Queue: batchTrigger enterprise jobs

    Queue->>Task: run(payload)
    Task->>DB: resolveTierWorkspaceIds or lookup workspace retention
    Task->>DB: SELECT expiring rows (batched, LIMIT 2000)
    Task->>S3: delete associated files (pre-deletion)
    Task->>Copilot: POST /api/tasks/cleanup (chat IDs)
    Task->>DB: DELETE rows by ID
    Task-->>Queue: complete
Loading

Reviews (2): Last reviewed commit: "fix lint" | Re-trigger Greptile

Comment thread apps/sim/background/cleanup-soft-deletes.ts
Comment thread apps/sim/background/cleanup-tasks.ts
Comment thread apps/sim/lib/billing/cleanup-dispatcher.ts Outdated
Comment thread apps/sim/background/cleanup-logs.ts Outdated
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment on lines +99 to +130
async function cleanupWorkspaceFileStorage(
workspaceIds: string[],
retentionDate: Date
): Promise<{ filesDeleted: number; filesFailed: number }> {
const stats = { filesDeleted: 0, filesFailed: 0 }

if (!isUsingCloudStorage() || workspaceIds.length === 0) return stats

const filesToDelete = await db
.select({ key: workspaceFiles.key })
.from(workspaceFiles)
.where(
and(
inArray(workspaceFiles.workspaceId, workspaceIds),
isNotNull(workspaceFiles.deletedAt),
lt(workspaceFiles.deletedAt, retentionDate)
)
)
.limit(BATCH_SIZE * MAX_BATCHES_PER_TABLE)

for (const file of filesToDelete) {
try {
await StorageService.deleteFile({ key: file.key, context: 'workspace' })
stats.filesDeleted++
} catch (error) {
stats.filesFailed++
logger.error(`Failed to delete storage file ${file.key}:`, { error })
}
}

return stats
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 workspaceFile S3 objects orphaned on hard-delete

cleanupWorkspaceFileStorage queries only the workspaceFiles (plural) table, but CLEANUP_TARGETS also includes workspaceFile (singular — workspace_file table). That table has a key column pointing directly to an S3 object. When cleanupTable purges those DB rows, the S3 objects are never removed, leaving permanent storage orphans.

The function needs a parallel query over the workspaceFile table and should call StorageService.deleteFile for each matching key before cleanupTable deletes the rows.

Comment thread apps/sim/lib/billing/cleanup-dispatcher.ts Outdated
logRetentionHours: ws.logRetentionHours,
softDeleteRetentionHours: ws.softDeleteRetentionHours,
taskCleanupHours: ws.taskCleanupHours,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enterprise effective values ignore configured defaults when null

Medium Severity

For enterprise workspaces, the effective retention values are set directly from the workspace's configured columns (ws.logRetentionHours, etc.) with no fallback. When an enterprise workspace hasn't configured any retention, all effective values are null (meaning "forever"/no cleanup). This means the dispatcher's isNotNull(retentionCol) check never matches, so no cleanup job is ever dispatched. Enterprise workspaces that want the system's enterprise defaults (which are also null) behave identically to those who explicitly chose "Forever" — there's no way to distinguish "not yet configured" from "intentionally set to never." This prevents an enterprise admin from expecting any out-of-the-box cleanup behavior.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit adb8f71. Configure here.

Comment thread apps/sim/background/cleanup-tasks.ts
Comment thread apps/sim/background/cleanup-soft-deletes.ts Outdated
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4d76165. Configure here.

.where(eq(workspace.id, payload.workspaceId))
.limit(1)

if (!ws?.hours) return null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falsy check skips enterprise cleanup for zero-hour retention

Low Severity

resolveCleanupScope uses !ws?.hours which is a falsy check that treats 0 as equivalent to null/not-configured. While the Zod validation on the PUT endpoint enforces min(24), the database column is an unconstrained integer. If a value of 0 were ever stored (e.g., via migration or direct DB access), cleanup would silently skip that workspace instead of applying zero-hour retention. A strict null check like ws?.hours == null would be more correct.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d76165. Configure here.

function hoursToDisplayDays(hours: number | null): string {
if (hours === null) return 'never'
return String(Math.round(hours / 24))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Select displays broken state for non-standard hour values

Low Severity

hoursToDisplayDays converts hours to a day string via Math.round(hours / 24), but the result may not match any entry in DAY_OPTIONS. The API accepts any integer between 24 and 43800, so values like 36 hours would produce '2' — not in the select options. This causes the Select component to have a value with no matching SelectItem, resulting in a blank or broken display for enterprise workspaces whose retention was set via direct API call.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d76165. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants