Background Jobs
Version: 2.0 Last Updated: 2026-01-15 Status: Ready for Stakeholder Approval Change Log: Converted from technical specification to business-focused requirements with measurable outcomes
Overview
Purpose
Enable reliable, automated processing of time-intensive and scheduled business operations without impacting user-facing application performance.
Problem Statement
Critical business operations fail or degrade user experience when processed synchronously:
- Timeout Failures: Complex operations (company bans, bulk updates) exceed API response limits, causing incomplete transactions
- Lost Revenue: Payment processing failures without retries result in missed revenue capture
- Customer Churn: Missed automated touchpoints (birthday reminders) reduce engagement
- Manual Overhead: Staff manually re-trigger failed operations, wasting 5-10 hours/week
- Data Inconsistencies: Multi-step operations that fail mid-process leave data in corrupted states
Business Value
| Value Claim | Measurable Outcome | Baseline | Target | Timeframe |
|---|---|---|---|---|
| Reliability | Successful completion rate for critical operations | 85% (estimated, transient failures cause drops) | 99.5%+ with automatic retries | 3 months post-launch |
| Revenue Protection | Payment capture rate from webhook events | Manual intervention required for ~5% failures | <0.5% requiring intervention | 3 months |
| Operational Efficiency | Staff hours spent on manual job retriggers | 5-10 hours/week | <1 hour/week | 3 months |
| User Experience | API response time for complex operations | 8-15 seconds (blocking) | <500ms (immediate response) | Immediate |
| Customer Engagement | Automated touchpoint delivery rate (birthday, reminders) | 70% (manual/inconsistent) | 98%+ automated delivery | 6 months |
| Data Integrity | Incomplete multi-step operations requiring cleanup | 2-3 incidents/month | Zero incomplete operations | 3 months |
Success Metrics
| Metric | Definition | Baseline | Target | Measurement Method |
|---|---|---|---|---|
| Job Success Rate | (Completed jobs / Total jobs) × 100 | 85% (estimated) | 99.5% | Job status tracking in database |
| Retry Resolution Rate | Jobs that succeed after retry / Jobs that needed retry | N/A (no retry system) | 95%+ | Job tries counter analysis |
| API Response Time | P95 response time for endpoints that enqueue jobs | 8-15 seconds | <500ms | APM monitoring (Vercel Analytics) |
| Scheduled Task Reliability | (Executed cron jobs / Expected cron jobs) × 100 | 90% (manual verification) | 99.9% | Cron execution logs |
| Manual Intervention Hours | Staff time spent re-triggering failed operations | 5-10 hours/week | <1 hour/week | Time tracking + support tickets |
| Payment Capture Rate | Successful payment processing from webhooks | 95% | 99.5% | Stripe dashboard + job logs |
| Customer Touchpoint Delivery | Birthday reminders and notifications sent on schedule | 70% | 98%+ | Mailgun delivery reports |
| Mean Time to Recovery | Average time from job failure to successful completion | 24+ hours (manual) | <15 minutes (automatic retry) | Job timestamp analysis |
User Stories
P0 - Critical
As a Finance Manager
- I want payment processing to complete reliably with automatic retries so revenue is captured without manual intervention
- Acceptance Criteria:
- GIVEN a Stripe payment webhook event
- WHEN processing fails due to transient error
- THEN the system automatically retries up to 3 times
- AND payment is captured within 15 minutes of original event
- AND I receive an alert only if all retries fail
As an Operations Manager
- I want company ban/unban operations to complete fully so data remains consistent
- Acceptance Criteria:
- GIVEN a company ban request affecting 50+ vehicles and orders
- WHEN the operation is initiated
- THEN all vehicles are unpublished AND all orders are deactivated
- AND the operation completes within 5 minutes regardless of volume
- AND no partial states exist if any step fails
P1 - High Priority
As a Customer Success Manager
- I want birthday reminders sent automatically so customers feel valued without manual tracking
- Acceptance Criteria:
- GIVEN customers with birthdays in the system
- WHEN their birthday arrives
- THEN a reminder email is sent before 9 AM local time
- AND delivery confirmation is logged
- AND I can verify delivery rates in reports
As a Fleet Manager
- I want calendar synchronization to happen automatically so availability is always accurate
- Acceptance Criteria:
- GIVEN external calendar integrations configured
- WHEN synchronization runs on schedule
- THEN availability is updated within 30 minutes of external changes
- AND conflicts are flagged for manual review
As an IT Administrator
- I want visibility into job execution status so I can monitor system health proactively
- Acceptance Criteria:
- GIVEN active background jobs in the system
- WHEN I access the monitoring dashboard
- THEN I see job counts by status (pending, processing, completed, failed)
- AND I can identify stuck or repeatedly failing jobs
- AND I receive alerts for jobs that fail all retry attempts
P2 - Medium Priority
As a Support Agent
- I want failed jobs to retry automatically so I don’t need to manually re-trigger operations
- Acceptance Criteria:
- GIVEN a job that fails due to temporary issue
- WHEN automatic retries are exhausted without success
- THEN I receive a notification with job details
- AND I can manually retry with one click
As a Tenant Administrator
- I want plan limit checks to run automatically so I’m notified before exceeding limits
- Acceptance Criteria:
- GIVEN tenant usage approaching plan limits
- WHEN the scheduled check runs
- THEN I receive a warning notification at 80% usage
- AND service degradation occurs gracefully at 100%
Functional Requirements
FR-1: Reliable Job Processing
- System must process jobs asynchronously without blocking API responses
- System must retry failed jobs automatically up to 3 times with exponential backoff
- System must track job status (pending, processing, completed, failed) with timestamps
- System must complete 99.5% of jobs successfully including retries
FR-2: Scheduled Task Automation
- System must execute recurring tasks on defined schedules (daily, hourly, custom intervals)
- System must authenticate scheduled task triggers to prevent unauthorized execution
- System must support the following automated tasks:
- Birthday reminder notifications (daily)
- Calendar synchronization (configurable intervals)
- Tenant plan limit checks (daily)
- Vehicle reminder notifications
- Vehicle purchase status checks
FR-3: Complex Workflow Support
- System must support multi-step operations that maintain consistency across all steps
- System must rollback or report partial failures without leaving data in inconsistent states
- System must support the following workflows:
- Company ban (unpublish all vehicles, deactivate all orders)
- Company unban (republish vehicles, reactivate orders)
- Marketplace integration toggle (update companies and vehicle models)
- Bulk search index updates
FR-4: Event-Driven Processing
- System must process external events (payment webhooks, notifications) reliably
- System must handle the following event types:
- Payment intent processing
- Charge succeeded confirmations
- Email notification delivery
- External marketplace events
FR-5: Batch Operations
- System must process large datasets in batches without overwhelming system resources
- System must support batch sizes of 50-100 items depending on operation complexity
- System must track batch progress and report completion status
FR-6: Security and Authorization
- System must verify that job triggers originate from authorized sources
- System must encrypt sensitive data during transmission
- System must reject unauthorized job execution attempts
FR-7: Monitoring and Alerting
- System must log job execution with sufficient detail for debugging
- System must report errors to monitoring systems with context
- System must alert operations team when jobs fail all retry attempts
Acceptance Criteria
AC-1: Job Reliability
- GIVEN a critical business operation (payment, ban, notification)
- WHEN the operation encounters a transient failure
- THEN the system automatically retries
- AND 95%+ of retried jobs succeed within 15 minutes
- AND operations team is notified only for persistent failures
AC-2: Non-Blocking Operations
- GIVEN a user initiating a complex operation (company ban with 100+ entities)
- WHEN the request is submitted
- THEN the API responds within 500ms with confirmation
- AND the operation completes asynchronously within 5 minutes
- AND the user can check status at any time
AC-3: Scheduled Task Execution
- GIVEN a configured daily task (birthday reminders)
- WHEN the scheduled time arrives
- THEN the task executes within 5 minutes of scheduled time
- AND 99.9% of scheduled executions occur as expected over 30 days
AC-4: Data Consistency
- GIVEN a multi-step workflow (company ban)
- WHEN any step fails
- THEN either all steps complete OR none complete
- AND no partial states exist requiring manual cleanup
AC-5: Security Enforcement
- GIVEN a job execution request without valid authorization
- WHEN the system receives the request
- THEN the request is rejected
- AND no job processing occurs
- AND the attempt is logged for security review
AC-6: Batch Processing
- GIVEN 250 items requiring processing
- WHEN a batch job is triggered
- THEN items are processed in manageable batches (50 each)
- AND overall completion status is tracked
- AND partial progress is preserved if interrupted
Business Rules
BR-1: Retry Policy
- Jobs must retry up to 3 times before marking as failed
- Retry delays must increase between attempts (exponential backoff)
- Jobs with explicit delays must wait before first execution
BR-2: Job Retention
- Completed job records must be retained for 90 days for audit purposes
- Failed job records must be retained until manually resolved
- Temporary workflow state may expire after 24 hours
BR-3: Batch Size Limits
- Standard batch size is 50-100 items depending on operation complexity
- Batches must not exceed resource limits that would cause timeouts
BR-4: Duplicate Prevention
- Idempotent operations must not be duplicated if retriggered
- Duplicate detection based on content hash where appropriate
BR-5: Environment Behavior
- Development environments may bypass certain security checks for testing
- Production environments must enforce all security requirements
BR-6: Priority Handling
- Payment-related jobs must have highest retry priority
- Customer-facing notifications must complete before internal tasks when resources constrained
Dependencies
Upstream Dependencies
| System | Purpose | Impact if Unavailable |
|---|---|---|
| QStash/Upstash | Job queue infrastructure | All background processing halts |
| Vercel Cron | Scheduled task triggers | Automated tasks stop executing |
| PostgreSQL | Job status persistence | Cannot track or recover job state |
Downstream Dependencies
| System | Purpose | Jobs Affected |
|---|---|---|
| Stripe | Payment processing | Payment intent, charge handling |
| Mailgun | Email delivery | Birthday reminders, notifications |
| Google Calendar | Availability sync | Calendar synchronization |
| Search Service | Entity indexing | Search re-indexing workflows |
Integration Points
- Payment Processing: Stripe webhook events trigger payment jobs
- Email Notifications: Notification jobs invoke Mailgun API
- Marketplace Sync: Billion integration jobs update external listings
- Calendar Sync: Google Calendar API for availability updates
Non-Functional Requirements
Performance
- Job enqueueing must complete in <100ms
- API endpoints that enqueue jobs must respond in <500ms
- Batch processing must handle 1,000+ items without timeout
Reliability
- 99.5%+ job success rate including retries
- 99.9%+ scheduled task execution rate
- Zero data loss from job failures
Scalability
- Support 10,000+ jobs per day
- Handle burst loads of 500+ concurrent job triggers
- Scale batch sizes dynamically based on system load
Security
- All job triggers must be cryptographically verified
- Sensitive data encrypted in transit
- Job execution logs must not expose PII
Monitoring
- Real-time job status visibility
- Alerting within 5 minutes for stuck jobs
- Historical metrics retained for 90 days
Glossary
| Term | Business Definition |
|---|---|
| Background Job | A business operation that runs automatically without making users wait, like sending birthday emails or processing payments |
| Batch Processing | Handling many items in groups to avoid overwhelming the system, like updating 250 vehicles 50 at a time |
| Cron Job | An automated task that runs on a schedule, like daily birthday reminder checks |
| Retry | Automatic re-attempt of a failed operation, ensuring temporary glitches don’t cause permanent failures |
| Workflow | A multi-step business process where all steps must succeed together, like banning a company (unpublish vehicles AND deactivate orders) |
| Webhook | An automatic notification from an external service (like Stripe) that triggers a job in our system |
| Job Queue | A waiting line for operations to be processed in order |
| Idempotent | An operation that produces the same result if run multiple times, preventing duplicate actions |
Appendix: Technical Implementation Notes
For engineering reference only - not part of business requirements
- Infrastructure: QStash/Upstash for message queue, Vercel Cron for scheduling
- Job persistence: PostgreSQL
jobstable - Signature verification required for all job endpoints
- Environment variables: QSTASH_TOKEN, QSTASH_CURRENT_SIGNING_KEY, QSTASH_NEXT_SIGNING_KEY, CRON_SECRET