vt-c-error-monitoring¶

Integrate error monitoring (Sentry, DataDog, LogRocket) to track errors, performance, and user sessions in production. Essential for validating fixes and detecting regressions.

Plugin: core-standards
Category: Operations
Command: /vt-c-error-monitoring

Error Monitoring Skill¶

Purpose: Ensure production visibility through proper error tracking, performance monitoring, and alerting. You cannot validate fixes or detect regressions without observability.

Why This Matters¶

Without error monitoring: - Bugs in production go unnoticed until users complain - Performance regressions are invisible - You can't validate that fixes actually work - Root cause analysis is guesswork

Core Principle: Instrument Before Deploying¶

NEVER deploy a new feature without: 1. Error tracking configured 2. Key metrics instrumented 3. Alerts set up for critical paths 4. Baseline performance recorded

Tool Recommendations¶

Error Tracking¶

Tool	Best For	Pricing
Sentry	Error tracking, release tracking	Free tier, then usage-based
Bugsnag	Mobile + web errors	Per-device pricing
Rollbar	Real-time error tracking	Usage-based

Performance Monitoring¶

Tool	Best For	Pricing
DataDog	Full-stack APM + infrastructure	Per-host pricing
New Relic	APM + browser monitoring	Usage-based
Grafana Cloud	Open-source friendly, metrics	Free tier available

Session Replay¶

Tool	Best For	Pricing
LogRocket	Session replay + error context	Session-based
FullStory	UX analytics + replay	Session-based
Hotjar	Heatmaps + recordings	Usage-based

Implementation Patterns¶

Sentry Setup (Recommended Starting Point)¶

Backend (Node.js/Express)¶

// src/lib/monitoring.ts
import * as Sentry from '@sentry/node';

export function initMonitoring() {
  if (process.env.NODE_ENV === 'production') {
    Sentry.init({
      dsn: process.env.SENTRY_DSN,
      environment: process.env.NODE_ENV,
      release: process.env.GIT_COMMIT_SHA,

      // Performance monitoring
      tracesSampleRate: 0.1, // 10% of transactions

      // Filter sensitive data
      beforeSend(event) {
        // Remove sensitive headers
        if (event.request?.headers) {
          delete event.request.headers['authorization'];
          delete event.request.headers['cookie'];
        }
        return event;
      },
    });
  }
}

// Error handler middleware
export function errorHandler(err: Error, req: Request, res: Response, next: NextFunction) {
  Sentry.captureException(err, {
    extra: {
      url: req.url,
      method: req.method,
      userId: req.user?.id,
    },
  });

  // Don't expose error details in production
  const message = process.env.NODE_ENV === 'production'
    ? 'Internal server error'
    : err.message;

  res.status(500).json({ error: message });
}

Frontend (React)¶

// src/lib/monitoring.ts
import * as Sentry from '@sentry/react';

export function initFrontendMonitoring() {
  if (process.env.NODE_ENV === 'production') {
    Sentry.init({
      dsn: process.env.REACT_APP_SENTRY_DSN,
      environment: process.env.NODE_ENV,
      release: process.env.REACT_APP_VERSION,

      integrations: [
        // Browser tracing for performance
        new Sentry.BrowserTracing({
          tracePropagationTargets: ['localhost', /^https:\/\/api\.yourapp\.com/],
        }),
        // Capture console errors
        new Sentry.Integrations.CaptureConsole({
          levels: ['error'],
        }),
      ],

      tracesSampleRate: 0.1,
    });
  }
}

// Error boundary wrapper
export const SentryErrorBoundary = Sentry.ErrorBoundary;

// Manual error capture
export function captureError(error: Error, context?: Record<string, any>) {
  Sentry.captureException(error, { extra: context });
}

Rails¶

# config/initializers/sentry.rb
Sentry.init do |config|
  config.dsn = ENV['SENTRY_DSN']
  config.environment = Rails.env
  config.release = ENV['GIT_COMMIT_SHA']

  # Performance monitoring
  config.traces_sample_rate = 0.1

  # Filter sensitive params
  config.before_send = lambda do |event, hint|
    event.request.data = '[FILTERED]' if event.request&.data
    event
  end
end

What to Monitor¶

Critical Metrics¶

Metric	Why	Alert Threshold
Error rate	Detect regressions	> 1% of requests
P95 latency	Performance degradation	> 2x baseline
Apdex score	User satisfaction	< 0.9
Crash-free sessions	App stability	< 99%

Business Metrics¶

Metric	Example
Conversion funnel	Signup → activation → purchase
Feature adoption	% users using new feature
API usage	Requests per endpoint

Infrastructure¶

Metric	Alert Threshold
CPU usage	> 80% sustained
Memory usage	> 85%
Disk space	< 20% free
Queue depth	> 1000 jobs

Alerting Best Practices¶

Alert Severity Levels¶

# Example PagerDuty/Opsgenie configuration
alerts:
  critical:
    - name: "Error rate spike"
      condition: "error_rate > 5%"
      action: "page on-call"

  warning:
    - name: "Elevated latency"
      condition: "p95_latency > 2s"
      action: "slack notification"

  info:
    - name: "New error type"
      condition: "new_issue_created"
      action: "slack notification"

Avoid Alert Fatigue¶

Group related alerts - Don't page for every instance of same error
Set appropriate thresholds - Not too sensitive, not too loose
Auto-resolve - Clear alerts when condition resolves
Scheduled quiet hours - For non-critical alerts

Debugging Production Issues¶

When an Error Occurs¶

Check error in Sentry/monitoring tool
Stack trace
Request context
User affected
Frequency
Check if this is new or recurring
First seen date
Trend (increasing/decreasing)
Related issues
Gather context
Session replay (if available)
Related logs
Recent deployments
Document in session-journal
What you learned
How you diagnosed
Root cause

Pre-Deployment Checklist¶

Before deploying any feature:

[ ] Error tracking SDK initialized
[ ] Key user flows instrumented
[ ] Performance metrics baselined
[ ] Alerts configured for new endpoints
[ ] Release tagged in monitoring tool
[ ] Rollback procedure documented

Integration with Toolkit¶

With finalization-orchestrator¶

The finalization-orchestrator should verify:

### Monitoring Verification
- [ ] Sentry/monitoring tool configured
- [ ] Release version tagged
- [ ] Baseline metrics recorded
- [ ] Alerts active for new features

With bugfix-orchestrator¶

After fixing a bug:

### Post-Fix Monitoring
- [ ] Error no longer appearing in monitoring
- [ ] No new errors introduced
- [ ] Performance metrics stable

With continuous-learning¶

Document monitoring insights:

### Learning: [Error Pattern]
- How it was detected: [Sentry alert, user report, etc.]
- Time to detection: [minutes/hours/days]
- Improvement: [What to monitor better next time]

Common Mistakes to Avoid¶

1. Logging Sensitive Data¶

// ❌ BAD - logs password
Sentry.captureMessage(`Login failed for ${email} with password ${password}`);

// ✅ GOOD - sanitized
Sentry.captureMessage(`Login failed for ${email}`);

2. Missing Context¶

// ❌ BAD - no context
Sentry.captureException(error);

// ✅ GOOD - includes context
Sentry.captureException(error, {
  extra: { userId, action, input },
  tags: { feature: 'checkout' },
});

3. Alerting on Every Error¶

# ❌ BAD - alert on every error
alert: error_count > 0

# ✅ GOOD - alert on significant increase
alert: error_rate > 1% AND error_count > 10

4. No Baseline¶

// ❌ BAD - deploying without baseline
deploy();

// ✅ GOOD - record baseline first
recordBaselineMetrics();
deploy();
compareWithBaseline();

Resources¶

Sentry Documentation
DataDog APM Guide
OpenTelemetry - Vendor-neutral instrumentation
Google SRE Book - Monitoring