Skip to content

Background Jobs

Version: 2.0 Last Updated: 2026-01-15 Status: Ready for Stakeholder Approval Change Log: Converted from technical specification to business-focused requirements with measurable outcomes


Overview

Purpose

Enable reliable, automated processing of time-intensive and scheduled business operations without impacting user-facing application performance.

Problem Statement

Critical business operations fail or degrade user experience when processed synchronously:

  • Timeout Failures: Complex operations (company bans, bulk updates) exceed API response limits, causing incomplete transactions
  • Lost Revenue: Payment processing failures without retries result in missed revenue capture
  • Customer Churn: Missed automated touchpoints (birthday reminders) reduce engagement
  • Manual Overhead: Staff manually re-trigger failed operations, wasting 5-10 hours/week
  • Data Inconsistencies: Multi-step operations that fail mid-process leave data in corrupted states

Business Value

Value ClaimMeasurable OutcomeBaselineTargetTimeframe
ReliabilitySuccessful completion rate for critical operations85% (estimated, transient failures cause drops)99.5%+ with automatic retries3 months post-launch
Revenue ProtectionPayment capture rate from webhook eventsManual intervention required for ~5% failures<0.5% requiring intervention3 months
Operational EfficiencyStaff hours spent on manual job retriggers5-10 hours/week<1 hour/week3 months
User ExperienceAPI response time for complex operations8-15 seconds (blocking)<500ms (immediate response)Immediate
Customer EngagementAutomated touchpoint delivery rate (birthday, reminders)70% (manual/inconsistent)98%+ automated delivery6 months
Data IntegrityIncomplete multi-step operations requiring cleanup2-3 incidents/monthZero incomplete operations3 months

Success Metrics

MetricDefinitionBaselineTargetMeasurement Method
Job Success Rate(Completed jobs / Total jobs) × 10085% (estimated)99.5%Job status tracking in database
Retry Resolution RateJobs that succeed after retry / Jobs that needed retryN/A (no retry system)95%+Job tries counter analysis
API Response TimeP95 response time for endpoints that enqueue jobs8-15 seconds<500msAPM monitoring (Vercel Analytics)
Scheduled Task Reliability(Executed cron jobs / Expected cron jobs) × 10090% (manual verification)99.9%Cron execution logs
Manual Intervention HoursStaff time spent re-triggering failed operations5-10 hours/week<1 hour/weekTime tracking + support tickets
Payment Capture RateSuccessful payment processing from webhooks95%99.5%Stripe dashboard + job logs
Customer Touchpoint DeliveryBirthday reminders and notifications sent on schedule70%98%+Mailgun delivery reports
Mean Time to RecoveryAverage time from job failure to successful completion24+ hours (manual)<15 minutes (automatic retry)Job timestamp analysis

User Stories

P0 - Critical

As a Finance Manager

  • I want payment processing to complete reliably with automatic retries so revenue is captured without manual intervention
  • Acceptance Criteria:
    • GIVEN a Stripe payment webhook event
    • WHEN processing fails due to transient error
    • THEN the system automatically retries up to 3 times
    • AND payment is captured within 15 minutes of original event
    • AND I receive an alert only if all retries fail

As an Operations Manager

  • I want company ban/unban operations to complete fully so data remains consistent
  • Acceptance Criteria:
    • GIVEN a company ban request affecting 50+ vehicles and orders
    • WHEN the operation is initiated
    • THEN all vehicles are unpublished AND all orders are deactivated
    • AND the operation completes within 5 minutes regardless of volume
    • AND no partial states exist if any step fails

P1 - High Priority

As a Customer Success Manager

  • I want birthday reminders sent automatically so customers feel valued without manual tracking
  • Acceptance Criteria:
    • GIVEN customers with birthdays in the system
    • WHEN their birthday arrives
    • THEN a reminder email is sent before 9 AM local time
    • AND delivery confirmation is logged
    • AND I can verify delivery rates in reports

As a Fleet Manager

  • I want calendar synchronization to happen automatically so availability is always accurate
  • Acceptance Criteria:
    • GIVEN external calendar integrations configured
    • WHEN synchronization runs on schedule
    • THEN availability is updated within 30 minutes of external changes
    • AND conflicts are flagged for manual review

As an IT Administrator

  • I want visibility into job execution status so I can monitor system health proactively
  • Acceptance Criteria:
    • GIVEN active background jobs in the system
    • WHEN I access the monitoring dashboard
    • THEN I see job counts by status (pending, processing, completed, failed)
    • AND I can identify stuck or repeatedly failing jobs
    • AND I receive alerts for jobs that fail all retry attempts

P2 - Medium Priority

As a Support Agent

  • I want failed jobs to retry automatically so I don’t need to manually re-trigger operations
  • Acceptance Criteria:
    • GIVEN a job that fails due to temporary issue
    • WHEN automatic retries are exhausted without success
    • THEN I receive a notification with job details
    • AND I can manually retry with one click

As a Tenant Administrator

  • I want plan limit checks to run automatically so I’m notified before exceeding limits
  • Acceptance Criteria:
    • GIVEN tenant usage approaching plan limits
    • WHEN the scheduled check runs
    • THEN I receive a warning notification at 80% usage
    • AND service degradation occurs gracefully at 100%

Functional Requirements

FR-1: Reliable Job Processing

  • System must process jobs asynchronously without blocking API responses
  • System must retry failed jobs automatically up to 3 times with exponential backoff
  • System must track job status (pending, processing, completed, failed) with timestamps
  • System must complete 99.5% of jobs successfully including retries

FR-2: Scheduled Task Automation

  • System must execute recurring tasks on defined schedules (daily, hourly, custom intervals)
  • System must authenticate scheduled task triggers to prevent unauthorized execution
  • System must support the following automated tasks:
    • Birthday reminder notifications (daily)
    • Calendar synchronization (configurable intervals)
    • Tenant plan limit checks (daily)
    • Vehicle reminder notifications
    • Vehicle purchase status checks

FR-3: Complex Workflow Support

  • System must support multi-step operations that maintain consistency across all steps
  • System must rollback or report partial failures without leaving data in inconsistent states
  • System must support the following workflows:
    • Company ban (unpublish all vehicles, deactivate all orders)
    • Company unban (republish vehicles, reactivate orders)
    • Marketplace integration toggle (update companies and vehicle models)
    • Bulk search index updates

FR-4: Event-Driven Processing

  • System must process external events (payment webhooks, notifications) reliably
  • System must handle the following event types:
    • Payment intent processing
    • Charge succeeded confirmations
    • Email notification delivery
    • External marketplace events

FR-5: Batch Operations

  • System must process large datasets in batches without overwhelming system resources
  • System must support batch sizes of 50-100 items depending on operation complexity
  • System must track batch progress and report completion status

FR-6: Security and Authorization

  • System must verify that job triggers originate from authorized sources
  • System must encrypt sensitive data during transmission
  • System must reject unauthorized job execution attempts

FR-7: Monitoring and Alerting

  • System must log job execution with sufficient detail for debugging
  • System must report errors to monitoring systems with context
  • System must alert operations team when jobs fail all retry attempts

Acceptance Criteria

AC-1: Job Reliability

  • GIVEN a critical business operation (payment, ban, notification)
  • WHEN the operation encounters a transient failure
  • THEN the system automatically retries
  • AND 95%+ of retried jobs succeed within 15 minutes
  • AND operations team is notified only for persistent failures

AC-2: Non-Blocking Operations

  • GIVEN a user initiating a complex operation (company ban with 100+ entities)
  • WHEN the request is submitted
  • THEN the API responds within 500ms with confirmation
  • AND the operation completes asynchronously within 5 minutes
  • AND the user can check status at any time

AC-3: Scheduled Task Execution

  • GIVEN a configured daily task (birthday reminders)
  • WHEN the scheduled time arrives
  • THEN the task executes within 5 minutes of scheduled time
  • AND 99.9% of scheduled executions occur as expected over 30 days

AC-4: Data Consistency

  • GIVEN a multi-step workflow (company ban)
  • WHEN any step fails
  • THEN either all steps complete OR none complete
  • AND no partial states exist requiring manual cleanup

AC-5: Security Enforcement

  • GIVEN a job execution request without valid authorization
  • WHEN the system receives the request
  • THEN the request is rejected
  • AND no job processing occurs
  • AND the attempt is logged for security review

AC-6: Batch Processing

  • GIVEN 250 items requiring processing
  • WHEN a batch job is triggered
  • THEN items are processed in manageable batches (50 each)
  • AND overall completion status is tracked
  • AND partial progress is preserved if interrupted

Business Rules

BR-1: Retry Policy

  • Jobs must retry up to 3 times before marking as failed
  • Retry delays must increase between attempts (exponential backoff)
  • Jobs with explicit delays must wait before first execution

BR-2: Job Retention

  • Completed job records must be retained for 90 days for audit purposes
  • Failed job records must be retained until manually resolved
  • Temporary workflow state may expire after 24 hours

BR-3: Batch Size Limits

  • Standard batch size is 50-100 items depending on operation complexity
  • Batches must not exceed resource limits that would cause timeouts

BR-4: Duplicate Prevention

  • Idempotent operations must not be duplicated if retriggered
  • Duplicate detection based on content hash where appropriate

BR-5: Environment Behavior

  • Development environments may bypass certain security checks for testing
  • Production environments must enforce all security requirements

BR-6: Priority Handling

  • Payment-related jobs must have highest retry priority
  • Customer-facing notifications must complete before internal tasks when resources constrained

Dependencies

Upstream Dependencies

SystemPurposeImpact if Unavailable
QStash/UpstashJob queue infrastructureAll background processing halts
Vercel CronScheduled task triggersAutomated tasks stop executing
PostgreSQLJob status persistenceCannot track or recover job state

Downstream Dependencies

SystemPurposeJobs Affected
StripePayment processingPayment intent, charge handling
MailgunEmail deliveryBirthday reminders, notifications
Google CalendarAvailability syncCalendar synchronization
Search ServiceEntity indexingSearch re-indexing workflows

Integration Points

  • Payment Processing: Stripe webhook events trigger payment jobs
  • Email Notifications: Notification jobs invoke Mailgun API
  • Marketplace Sync: Billion integration jobs update external listings
  • Calendar Sync: Google Calendar API for availability updates

Non-Functional Requirements

Performance

  • Job enqueueing must complete in <100ms
  • API endpoints that enqueue jobs must respond in <500ms
  • Batch processing must handle 1,000+ items without timeout

Reliability

  • 99.5%+ job success rate including retries
  • 99.9%+ scheduled task execution rate
  • Zero data loss from job failures

Scalability

  • Support 10,000+ jobs per day
  • Handle burst loads of 500+ concurrent job triggers
  • Scale batch sizes dynamically based on system load

Security

  • All job triggers must be cryptographically verified
  • Sensitive data encrypted in transit
  • Job execution logs must not expose PII

Monitoring

  • Real-time job status visibility
  • Alerting within 5 minutes for stuck jobs
  • Historical metrics retained for 90 days

Glossary

TermBusiness Definition
Background JobA business operation that runs automatically without making users wait, like sending birthday emails or processing payments
Batch ProcessingHandling many items in groups to avoid overwhelming the system, like updating 250 vehicles 50 at a time
Cron JobAn automated task that runs on a schedule, like daily birthday reminder checks
RetryAutomatic re-attempt of a failed operation, ensuring temporary glitches don’t cause permanent failures
WorkflowA multi-step business process where all steps must succeed together, like banning a company (unpublish vehicles AND deactivate orders)
WebhookAn automatic notification from an external service (like Stripe) that triggers a job in our system
Job QueueA waiting line for operations to be processed in order
IdempotentAn operation that produces the same result if run multiple times, preventing duplicate actions

Appendix: Technical Implementation Notes

For engineering reference only - not part of business requirements

  • Infrastructure: QStash/Upstash for message queue, Vercel Cron for scheduling
  • Job persistence: PostgreSQL jobs table
  • Signature verification required for all job endpoints
  • Environment variables: QSTASH_TOKEN, QSTASH_CURRENT_SIGNING_KEY, QSTASH_NEXT_SIGNING_KEY, CRON_SECRET