vt-c-error-monitoring¶
Integrate error monitoring (Sentry, DataDog, LogRocket) to track errors, performance, and user sessions in production. Essential for validating fixes and detecting regressions.
Plugin: core-standards
Category: Operations
Command: /vt-c-error-monitoring
Error Monitoring Skill¶
Purpose: Ensure production visibility through proper error tracking, performance monitoring, and alerting. You cannot validate fixes or detect regressions without observability.
Why This Matters¶
Without error monitoring: - Bugs in production go unnoticed until users complain - Performance regressions are invisible - You can't validate that fixes actually work - Root cause analysis is guesswork
Core Principle: Instrument Before Deploying¶
NEVER deploy a new feature without: 1. Error tracking configured 2. Key metrics instrumented 3. Alerts set up for critical paths 4. Baseline performance recorded
Tool Recommendations¶
Error Tracking¶
| Tool | Best For | Pricing |
|---|---|---|
| Sentry | Error tracking, release tracking | Free tier, then usage-based |
| Bugsnag | Mobile + web errors | Per-device pricing |
| Rollbar | Real-time error tracking | Usage-based |
Performance Monitoring¶
| Tool | Best For | Pricing |
|---|---|---|
| DataDog | Full-stack APM + infrastructure | Per-host pricing |
| New Relic | APM + browser monitoring | Usage-based |
| Grafana Cloud | Open-source friendly, metrics | Free tier available |
Session Replay¶
| Tool | Best For | Pricing |
|---|---|---|
| LogRocket | Session replay + error context | Session-based |
| FullStory | UX analytics + replay | Session-based |
| Hotjar | Heatmaps + recordings | Usage-based |
Implementation Patterns¶
Sentry Setup (Recommended Starting Point)¶
Backend (Node.js/Express)¶
// src/lib/monitoring.ts
import * as Sentry from '@sentry/node';
export function initMonitoring() {
if (process.env.NODE_ENV === 'production') {
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.GIT_COMMIT_SHA,
// Performance monitoring
tracesSampleRate: 0.1, // 10% of transactions
// Filter sensitive data
beforeSend(event) {
// Remove sensitive headers
if (event.request?.headers) {
delete event.request.headers['authorization'];
delete event.request.headers['cookie'];
}
return event;
},
});
}
}
// Error handler middleware
export function errorHandler(err: Error, req: Request, res: Response, next: NextFunction) {
Sentry.captureException(err, {
extra: {
url: req.url,
method: req.method,
userId: req.user?.id,
},
});
// Don't expose error details in production
const message = process.env.NODE_ENV === 'production'
? 'Internal server error'
: err.message;
res.status(500).json({ error: message });
}
Frontend (React)¶
// src/lib/monitoring.ts
import * as Sentry from '@sentry/react';
export function initFrontendMonitoring() {
if (process.env.NODE_ENV === 'production') {
Sentry.init({
dsn: process.env.REACT_APP_SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.REACT_APP_VERSION,
integrations: [
// Browser tracing for performance
new Sentry.BrowserTracing({
tracePropagationTargets: ['localhost', /^https:\/\/api\.yourapp\.com/],
}),
// Capture console errors
new Sentry.Integrations.CaptureConsole({
levels: ['error'],
}),
],
tracesSampleRate: 0.1,
});
}
}
// Error boundary wrapper
export const SentryErrorBoundary = Sentry.ErrorBoundary;
// Manual error capture
export function captureError(error: Error, context?: Record<string, any>) {
Sentry.captureException(error, { extra: context });
}
Rails¶
# config/initializers/sentry.rb
Sentry.init do |config|
config.dsn = ENV['SENTRY_DSN']
config.environment = Rails.env
config.release = ENV['GIT_COMMIT_SHA']
# Performance monitoring
config.traces_sample_rate = 0.1
# Filter sensitive params
config.before_send = lambda do |event, hint|
event.request.data = '[FILTERED]' if event.request&.data
event
end
end
What to Monitor¶
Critical Metrics¶
| Metric | Why | Alert Threshold |
|---|---|---|
| Error rate | Detect regressions | > 1% of requests |
| P95 latency | Performance degradation | > 2x baseline |
| Apdex score | User satisfaction | < 0.9 |
| Crash-free sessions | App stability | < 99% |
Business Metrics¶
| Metric | Example |
|---|---|
| Conversion funnel | Signup → activation → purchase |
| Feature adoption | % users using new feature |
| API usage | Requests per endpoint |
Infrastructure¶
| Metric | Alert Threshold |
|---|---|
| CPU usage | > 80% sustained |
| Memory usage | > 85% |
| Disk space | < 20% free |
| Queue depth | > 1000 jobs |
Alerting Best Practices¶
Alert Severity Levels¶
# Example PagerDuty/Opsgenie configuration
alerts:
critical:
- name: "Error rate spike"
condition: "error_rate > 5%"
action: "page on-call"
warning:
- name: "Elevated latency"
condition: "p95_latency > 2s"
action: "slack notification"
info:
- name: "New error type"
condition: "new_issue_created"
action: "slack notification"
Avoid Alert Fatigue¶
- Group related alerts - Don't page for every instance of same error
- Set appropriate thresholds - Not too sensitive, not too loose
- Auto-resolve - Clear alerts when condition resolves
- Scheduled quiet hours - For non-critical alerts
Debugging Production Issues¶
When an Error Occurs¶
- Check error in Sentry/monitoring tool
- Stack trace
- Request context
- User affected
-
Frequency
-
Check if this is new or recurring
- First seen date
- Trend (increasing/decreasing)
-
Related issues
-
Gather context
- Session replay (if available)
- Related logs
-
Recent deployments
-
Document in session-journal
- What you learned
- How you diagnosed
- Root cause
Pre-Deployment Checklist¶
Before deploying any feature:
- [ ] Error tracking SDK initialized
- [ ] Key user flows instrumented
- [ ] Performance metrics baselined
- [ ] Alerts configured for new endpoints
- [ ] Release tagged in monitoring tool
- [ ] Rollback procedure documented
Integration with Toolkit¶
With finalization-orchestrator¶
The finalization-orchestrator should verify:
### Monitoring Verification
- [ ] Sentry/monitoring tool configured
- [ ] Release version tagged
- [ ] Baseline metrics recorded
- [ ] Alerts active for new features
With bugfix-orchestrator¶
After fixing a bug:
### Post-Fix Monitoring
- [ ] Error no longer appearing in monitoring
- [ ] No new errors introduced
- [ ] Performance metrics stable
With continuous-learning¶
Document monitoring insights:
### Learning: [Error Pattern]
- How it was detected: [Sentry alert, user report, etc.]
- Time to detection: [minutes/hours/days]
- Improvement: [What to monitor better next time]
Common Mistakes to Avoid¶
1. Logging Sensitive Data¶
// ❌ BAD - logs password
Sentry.captureMessage(`Login failed for ${email} with password ${password}`);
// ✅ GOOD - sanitized
Sentry.captureMessage(`Login failed for ${email}`);
2. Missing Context¶
// ❌ BAD - no context
Sentry.captureException(error);
// ✅ GOOD - includes context
Sentry.captureException(error, {
extra: { userId, action, input },
tags: { feature: 'checkout' },
});
3. Alerting on Every Error¶
# ❌ BAD - alert on every error
alert: error_count > 0
# ✅ GOOD - alert on significant increase
alert: error_rate > 1% AND error_count > 10
4. No Baseline¶
// ❌ BAD - deploying without baseline
deploy();
// ✅ GOOD - record baseline first
recordBaselineMetrics();
deploy();
compareWithBaseline();
Resources¶
- Sentry Documentation
- DataDog APM Guide
- OpenTelemetry - Vendor-neutral instrumentation
- Google SRE Book - Monitoring