Skip to content

vt-c-error-monitoring

Integrate error monitoring (Sentry, DataDog, LogRocket) to track errors, performance, and user sessions in production. Essential for validating fixes and detecting regressions.

Plugin: core-standards
Category: Operations
Command: /vt-c-error-monitoring


Error Monitoring Skill

Purpose: Ensure production visibility through proper error tracking, performance monitoring, and alerting. You cannot validate fixes or detect regressions without observability.

Why This Matters

Without error monitoring: - Bugs in production go unnoticed until users complain - Performance regressions are invisible - You can't validate that fixes actually work - Root cause analysis is guesswork

Core Principle: Instrument Before Deploying

NEVER deploy a new feature without: 1. Error tracking configured 2. Key metrics instrumented 3. Alerts set up for critical paths 4. Baseline performance recorded


Tool Recommendations

Error Tracking

Tool Best For Pricing
Sentry Error tracking, release tracking Free tier, then usage-based
Bugsnag Mobile + web errors Per-device pricing
Rollbar Real-time error tracking Usage-based

Performance Monitoring

Tool Best For Pricing
DataDog Full-stack APM + infrastructure Per-host pricing
New Relic APM + browser monitoring Usage-based
Grafana Cloud Open-source friendly, metrics Free tier available

Session Replay

Tool Best For Pricing
LogRocket Session replay + error context Session-based
FullStory UX analytics + replay Session-based
Hotjar Heatmaps + recordings Usage-based

Implementation Patterns

Backend (Node.js/Express)

// src/lib/monitoring.ts
import * as Sentry from '@sentry/node';

export function initMonitoring() {
  if (process.env.NODE_ENV === 'production') {
    Sentry.init({
      dsn: process.env.SENTRY_DSN,
      environment: process.env.NODE_ENV,
      release: process.env.GIT_COMMIT_SHA,

      // Performance monitoring
      tracesSampleRate: 0.1, // 10% of transactions

      // Filter sensitive data
      beforeSend(event) {
        // Remove sensitive headers
        if (event.request?.headers) {
          delete event.request.headers['authorization'];
          delete event.request.headers['cookie'];
        }
        return event;
      },
    });
  }
}

// Error handler middleware
export function errorHandler(err: Error, req: Request, res: Response, next: NextFunction) {
  Sentry.captureException(err, {
    extra: {
      url: req.url,
      method: req.method,
      userId: req.user?.id,
    },
  });

  // Don't expose error details in production
  const message = process.env.NODE_ENV === 'production'
    ? 'Internal server error'
    : err.message;

  res.status(500).json({ error: message });
}

Frontend (React)

// src/lib/monitoring.ts
import * as Sentry from '@sentry/react';

export function initFrontendMonitoring() {
  if (process.env.NODE_ENV === 'production') {
    Sentry.init({
      dsn: process.env.REACT_APP_SENTRY_DSN,
      environment: process.env.NODE_ENV,
      release: process.env.REACT_APP_VERSION,

      integrations: [
        // Browser tracing for performance
        new Sentry.BrowserTracing({
          tracePropagationTargets: ['localhost', /^https:\/\/api\.yourapp\.com/],
        }),
        // Capture console errors
        new Sentry.Integrations.CaptureConsole({
          levels: ['error'],
        }),
      ],

      tracesSampleRate: 0.1,
    });
  }
}

// Error boundary wrapper
export const SentryErrorBoundary = Sentry.ErrorBoundary;

// Manual error capture
export function captureError(error: Error, context?: Record<string, any>) {
  Sentry.captureException(error, { extra: context });
}

Rails

# config/initializers/sentry.rb
Sentry.init do |config|
  config.dsn = ENV['SENTRY_DSN']
  config.environment = Rails.env
  config.release = ENV['GIT_COMMIT_SHA']

  # Performance monitoring
  config.traces_sample_rate = 0.1

  # Filter sensitive params
  config.before_send = lambda do |event, hint|
    event.request.data = '[FILTERED]' if event.request&.data
    event
  end
end

What to Monitor

Critical Metrics

Metric Why Alert Threshold
Error rate Detect regressions > 1% of requests
P95 latency Performance degradation > 2x baseline
Apdex score User satisfaction < 0.9
Crash-free sessions App stability < 99%

Business Metrics

Metric Example
Conversion funnel Signup → activation → purchase
Feature adoption % users using new feature
API usage Requests per endpoint

Infrastructure

Metric Alert Threshold
CPU usage > 80% sustained
Memory usage > 85%
Disk space < 20% free
Queue depth > 1000 jobs

Alerting Best Practices

Alert Severity Levels

# Example PagerDuty/Opsgenie configuration
alerts:
  critical:
    - name: "Error rate spike"
      condition: "error_rate > 5%"
      action: "page on-call"

  warning:
    - name: "Elevated latency"
      condition: "p95_latency > 2s"
      action: "slack notification"

  info:
    - name: "New error type"
      condition: "new_issue_created"
      action: "slack notification"

Avoid Alert Fatigue

  • Group related alerts - Don't page for every instance of same error
  • Set appropriate thresholds - Not too sensitive, not too loose
  • Auto-resolve - Clear alerts when condition resolves
  • Scheduled quiet hours - For non-critical alerts

Debugging Production Issues

When an Error Occurs

  1. Check error in Sentry/monitoring tool
  2. Stack trace
  3. Request context
  4. User affected
  5. Frequency

  6. Check if this is new or recurring

  7. First seen date
  8. Trend (increasing/decreasing)
  9. Related issues

  10. Gather context

  11. Session replay (if available)
  12. Related logs
  13. Recent deployments

  14. Document in session-journal

  15. What you learned
  16. How you diagnosed
  17. Root cause

Pre-Deployment Checklist

Before deploying any feature:

  • [ ] Error tracking SDK initialized
  • [ ] Key user flows instrumented
  • [ ] Performance metrics baselined
  • [ ] Alerts configured for new endpoints
  • [ ] Release tagged in monitoring tool
  • [ ] Rollback procedure documented

Integration with Toolkit

With finalization-orchestrator

The finalization-orchestrator should verify:

### Monitoring Verification
- [ ] Sentry/monitoring tool configured
- [ ] Release version tagged
- [ ] Baseline metrics recorded
- [ ] Alerts active for new features

With bugfix-orchestrator

After fixing a bug:

### Post-Fix Monitoring
- [ ] Error no longer appearing in monitoring
- [ ] No new errors introduced
- [ ] Performance metrics stable

With continuous-learning

Document monitoring insights:

### Learning: [Error Pattern]
- How it was detected: [Sentry alert, user report, etc.]
- Time to detection: [minutes/hours/days]
- Improvement: [What to monitor better next time]


Common Mistakes to Avoid

1. Logging Sensitive Data

// ❌ BAD - logs password
Sentry.captureMessage(`Login failed for ${email} with password ${password}`);

// ✅ GOOD - sanitized
Sentry.captureMessage(`Login failed for ${email}`);

2. Missing Context

// ❌ BAD - no context
Sentry.captureException(error);

// ✅ GOOD - includes context
Sentry.captureException(error, {
  extra: { userId, action, input },
  tags: { feature: 'checkout' },
});

3. Alerting on Every Error

# ❌ BAD - alert on every error
alert: error_count > 0

# ✅ GOOD - alert on significant increase
alert: error_rate > 1% AND error_count > 10

4. No Baseline

// ❌ BAD - deploying without baseline
deploy();

// ✅ GOOD - record baseline first
recordBaselineMetrics();
deploy();
compareWithBaseline();

Resources