Getting Started
What is Nova AI Ops? #
Nova AI Ops is an AI-powered incident management and observability platform built for engineering teams that operate production infrastructure. It unifies metrics, logs, traces, synthetic monitoring, and incident workflows into a single pane of glass — with AI-driven detection, automated runbooks, and real-time collaboration.
Unlike traditional monitoring tools that require manual correlation across dozens of dashboards, Nova AI Ops automatically connects signals across your entire stack and surfaces the most likely root cause within seconds of an anomaly.
Core capabilities
- Real-time Observability — Live metrics, logs, and traces from your infrastructure via the Nova AI Agent
- AI-Powered Incident Detection — Machine learning models detect anomalies before they become outages
- Automated Runbooks — Pre-defined remediation playbooks execute automatically on incident trigger
- Synthetic Monitoring — Proactive endpoint checks with configurable failure thresholds
- Certificate Management — Track, renew, and auto-manage TLS certificates across your domains
- Sprinta Task Management — Built-in Kanban workflow with AI-assisted task creation
- Service Map — Automatic dependency visualization across your microservices
- Nova AI Copilot — Natural language interface to navigate, query, and manage your platform
Quick Start Guide #
Get from zero to your first dashboard in under 10 minutes.
-
Create your account
Visit app.novaaiops.com and sign up with your work email. You'll receive a verification link. Your organization is created automatically on first login.
-
Install the Nova AI Agent
The agent is a lightweight daemon that runs on your infrastructure and streams metrics to Nova AI Ops. Install it on any Linux host:
bash# Download and install the Nova AI Agent curl -sSL https://get.novaaiops.com/agent | bash # Configure with your API key (found in Settings > API Keys) nova-agent config set api-key YOUR_API_KEY # Start the agent sudo systemctl start nova-ai-agent sudo systemctl enable nova-ai-agent -
Connect your integrations
Navigate to Settings > Integrations and connect your cloud providers, container platforms, and notification channels. Nova AI Ops supports AWS, Docker, Grafana, Slack, and more.
-
View your first dashboard
Once the agent is reporting, navigate to
/dashboard. You'll see real-time CPU, memory, disk, and network metrics within 30 seconds of agent startup.
The agent begins collecting metrics immediately upon startup. No additional configuration is required for system-level metrics (CPU, memory, disk, network). Application-level metrics require collector plugins — see the Agent Configuration section.
System Requirements #
Nova AI Agent
| Requirement | Minimum | Recommended |
|---|---|---|
| Operating System | Linux (kernel 4.14+) | Amazon Linux 2023, Ubuntu 22.04+ |
| CPU | 1 core | 2 cores |
| Memory | 128 MB | 256 MB |
| Disk | 50 MB | 200 MB (for metric buffering) |
| Network | HTTPS outbound to *.novaaiops.com | Same |
Supported browsers
| Browser | Minimum Version |
|---|---|
| Chrome / Edge | 90+ |
| Firefox | 90+ |
| Safari | 15+ |
| Mobile Safari / Chrome | iOS 15+ / Android 10+ |
Real-time Metrics Dashboard #
The main dashboard at /dashboard is your operational command center. It displays live metrics streamed from all connected agents via WebSocket, updating every second without page refresh.
Dashboard panels
- CPU Utilization — Current usage percentage across all cores, with sparkline history
- Memory Usage — Used, available, and cached memory with trend indicator
- Disk I/O — Read/write throughput and IOPS per volume
- Network Traffic — Inbound/outbound bandwidth with packet error rates
- Active Incidents — Count badge showing open SEV-1/2/3 incidents, color-coded by highest severity
- Service Health — Aggregate status of all registered services
All dashboard metrics originate from the Nova AI Agent. The agent pushes data to backend ingestion endpoints, which persist to the database and emit real-time Socket.IO events to connected frontends.
The data pipeline is: nova-ai-agent → API ingestion → Database → Socket.IO → Dashboard. There are no synthetic or mocked metrics. Every data point originates from your actual infrastructure.
System Status #
The System Status page at / (the homepage) provides a high-level operational overview using a donut chart and service group cards.
Health score
The central donut displays an aggregate health score from 0 to 100, calculated from service availability, active incident severity, and synthetic monitor pass rates. The donut color reflects overall status:
| Score | Color | Meaning |
|---|---|---|
| 90 – 100 | Green | All systems operational |
| 70 – 89 | Amber | Degraded performance or minor incidents |
| 0 – 69 | Red | Major incident or significant outage |
Service groups
Below the donut, services are grouped by category (API, Database, Cache, Frontend, etc.). Each group shows the number of healthy vs. unhealthy services. Clicking a group navigates to the Service Catalog filtered to that category.
When an incident occurs, the dashboard and overview page colors update dynamically from green to yellow to red based on the highest active incident severity.
Performance Trends #
The Performance Trends page at /trends provides historical analysis of key metrics over configurable time ranges.
Time range selection
Use the time picker in the top-right corner to select from preset ranges (Last 1h, 6h, 24h, 7d, 30d) or define a custom range. All charts update simultaneously when the time range changes.
Available trend views
- Resource Utilization — CPU, memory, and disk usage over time with percentile bands (P50, P95, P99)
- Request Rate — HTTP requests per second with breakdown by status code class (2xx, 4xx, 5xx)
- Error Rate — Error percentage with anomaly detection overlay
- Latency Distribution — Response time percentiles shown as area charts
Golden Signals #
The Golden Signals page at /golden-signals implements the four golden signals of monitoring defined in the Google SRE handbook. These are the most critical indicators of system health.
| Signal | What It Measures | Key Metric |
|---|---|---|
| Latency | Time to serve a request | P95 response time |
| Traffic | Demand on your system | Requests per second |
| Errors | Rate of failed requests | 5xx error percentage |
| Saturation | How full your resources are | CPU/memory/disk utilization |
Each signal is displayed as a real-time gauge with sparkline history. When any signal crosses its configured threshold, the gauge changes color and (if alerting is enabled) triggers an alert through your configured notification channels.
Google's SRE methodology recommends alerting on symptoms (the golden signals) rather than causes. If your latency P95 exceeds your SLO, that's an actionable signal — the root cause (CPU, GC, database, network) is secondary to the user impact.
Incident Timeline #
The Incident Timeline at /incidents/timeline displays all active and recently resolved incidents in chronological order. Each incident card shows:
- Severity — Color-coded badge (SEV-1 red, SEV-2 amber, SEV-3 blue)
- Title — Auto-generated or manually set description
- Duration — Time since detection (or total duration if resolved)
- Affected Services — List of impacted services from the Service Catalog
- Current Phase — Detection, Triage, Mitigation, or Resolved
Severity levels
| Level | Impact | Response Expectation |
|---|---|---|
| SEV-1 | Complete service outage, data loss risk, or security breach | Immediate response, all-hands |
| SEV-2 | Significant degradation affecting many users | Respond within 15 minutes |
| SEV-3 | Minor issue, limited user impact | Respond within 1 hour |
Incident Lifecycle #
Every incident in Nova AI Ops follows a structured lifecycle:
-
Detection
An incident is created automatically when an alert fires, a synthetic monitor exceeds its consecutive failure threshold, or an AI anomaly is detected. Incidents can also be created manually.
-
Triage
The on-call engineer is notified. They assess the incident scope, assign a severity level, and determine affected services. The AI Copilot suggests related past incidents and relevant runbooks.
-
Mitigation
The team works to reduce user impact. This might involve rolling back a deployment, scaling up resources, or activating an automated runbook. All actions are logged in the incident timeline.
-
Resolution
The root cause is addressed and services are restored. The incident is marked resolved, which triggers a recovery notification and resets associated monitor states.
-
Postmortem
Within 48 hours, the team completes a blameless postmortem documenting the timeline, root cause, impact, and action items. See the Postmortem section.
Incident History & Archive #
The Incident Archive at /incidents/history provides a searchable, filterable record of all past incidents. Use the filters to narrow by severity, date range, affected service, or resolution time. This data feeds into trend analysis — helping you identify recurring failure patterns and track MTTR (Mean Time to Resolution) improvements.
AI Runbooks #
AI Runbooks at /runbooks are automated remediation playbooks that execute predefined steps when triggered by an incident. Unlike static runbooks, Nova's AI Runbooks adapt their execution based on the specific context of each incident.
Creating a runbook
name: High CPU Remediation
trigger:
metric: cpu_utilization
condition: "> 90%"
duration: 5m
steps:
- action: identify_top_processes
params: { limit: 10 }
- action: check_recent_deployments
params: { window: "2h" }
- action: scale_horizontally
params: { increment: 2 }
requires_approval: true
Runbooks support approval gates for destructive actions, audit logging for every step executed, and automatic rollback if a step fails.
Postmortem #
The Postmortem page at /postmortem provides a structured, blameless review template. Each postmortem includes:
- Summary — One-paragraph description of what happened
- Timeline — Minute-by-minute account of detection, response, and resolution
- Root Cause — The fundamental reason the incident occurred
- Impact — Quantified user impact (requests affected, downtime duration, revenue loss)
- Action Items — Concrete tasks with owners and due dates to prevent recurrence
- Lessons Learned — What went well and what needs improvement
Nova AI automatically populates the postmortem timeline from incident events, reducing manual effort. Focus your writing on root cause analysis and action items — these are the parts that drive real improvement.
Creating Alert Rules #
Navigate to /alerts to configure alert rules. Nova AI Ops supports three types of alerts:
Threshold alerts
Trigger when a metric crosses a static threshold. Best for well-understood metrics with predictable ranges.
{
"name": "High Memory Usage",
"type": "threshold",
"metric": "memory.used_percent",
"condition": "above",
"threshold": 85,
"duration": "5m",
"severity": "SEV-2",
"channels": ["slack-ops", "email-oncall"]
}
Anomaly detection alerts
Use machine learning to detect unusual patterns without requiring manual thresholds. Nova AI builds a baseline from historical data and alerts when behavior deviates significantly. Ideal for metrics with seasonal patterns or variable baselines.
Heartbeat alerts
Trigger when an expected signal stops arriving. Use these to detect silent failures — if a service should be reporting metrics every 60 seconds, a heartbeat alert will fire after the configured silence window (e.g., 5 minutes of no data).
Alert conditions
| Condition | Behavior |
|---|---|
above | Triggers when metric exceeds threshold |
below | Triggers when metric drops below threshold |
equal | Triggers when metric equals a specific value |
absent | Triggers when no data is received within the window |
anomaly | Triggers on statistical deviation from baseline |
Notification Channels #
Configure where and how alerts are delivered:
- Email — Individual or distribution list. Supports HTML-formatted incident summaries.
- Slack — Posts to a channel with severity-colored attachments and action buttons (Acknowledge, Resolve).
- Microsoft Teams — Adaptive card notifications with direct links to the incident.
- Webhooks — POST JSON payloads to any HTTP endpoint. Use this to integrate with PagerDuty, Opsgenie, or custom tooling.
Each alert rule can target multiple channels. You can also configure escalation policies — if an alert isn't acknowledged within N minutes, it escalates to the next tier (e.g., from Slack to phone call).
Silencing & Maintenance Windows #
Suppress alerts during planned maintenance or known noisy periods:
- Silence by rule — Mute a specific alert rule for a defined duration
- Silence by service — Mute all alerts for a service (useful during deployments)
- Maintenance window — Schedule a recurring window (e.g., every Sunday 02:00-04:00 UTC) that auto-silences matching alerts
Silenced alerts are still evaluated and recorded in the alert history. They simply don't trigger notifications. Review your silences regularly — stale silences can mask real incidents.
Log Explorer #
The Log Explorer at /logs provides full-text search across all ingested log data with sub-second response times.
Search syntax
Log queries support a structured search syntax:
# Simple text search
connection refused
# Field-specific search
service:api-gateway level:error
# Wildcard patterns
message:timeout* host:prod-web-*
# Numeric comparisons
status_code:>=500 response_time:>2000
# Boolean operators
(level:error OR level:fatal) AND service:payment-api
# Exclude terms
level:error NOT "health check"
Filters
Use the left sidebar filters to narrow by time range, log level (DEBUG, INFO, WARN, ERROR, FATAL), service name, host, or custom tags. Applied filters are reflected in the URL — share the URL with teammates to reproduce exact search context.
Synthetic Monitoring #
Synthetic Monitoring at /synthetic performs automated HTTP checks against your endpoints at regular intervals, detecting outages before your users do.
Monitor statuses
| Status | Color | Meaning |
|---|---|---|
| UP | Green | Endpoint responding within expected latency and returning expected status code |
| DEGRADED | Amber | Endpoint responding but with elevated latency (above P95 threshold) |
| DOWN | Red | Endpoint not responding, returning errors, or exceeding timeout |
Consecutive failure thresholds
To prevent alert fatigue from transient network issues, synthetic monitors use a consecutive failure threshold. An incident is only created after the configured number of consecutive failures (default: 5). A single successful check resets the failure counter to zero.
Key metrics
- Availability — Percentage of successful checks over the selected time window
- P95 Latency — 95th percentile response time, the standard measure for user-perceived performance
- Uptime — Continuous availability duration since last outage
- Certificate Expiry — Days until TLS certificate expiration (if HTTPS)
Cooldown period
After an incident is created for a failing monitor, a 6-hour cooldown prevents duplicate incidents for the same monitor. The monitor continues to be checked during cooldown, but no new incidents are created until the cooldown expires or the monitor recovers and fails again.
Session Replay #
Session Replay at /replay records and plays back real user sessions on your web application, allowing you to see exactly what users saw when they encountered an error.
How it works
The replay recorder captures DOM mutations, mouse movements, clicks, scrolls, and console errors — not screenshots. This means recordings are lightweight (typically 50-200 KB per minute) and don't capture sensitive input field values.
Key features
- Error-linked replays — Jump directly to the moment an error occurred
- Console overlay — See JavaScript errors and network failures alongside the visual replay
- Timeline scrubbing — Scrub forward/backward through the session timeline
- Speed control — Play at 1x, 2x, 4x, or 8x speed
Distributed Tracing #
The Tracing page at /tracing visualizes request flows across microservices. Each trace shows the complete journey of a request from ingress to response, with timing breakdowns per service hop.
Trace anatomy
- Trace — The complete end-to-end journey of a single request
- Span — A single unit of work within a trace (e.g., an HTTP call, database query, or cache lookup)
- Parent-child relationships — Spans are nested to show causality (Service A called Service B, which called the database)
Use tracing to identify slow service dependencies, detect N+1 query patterns, and understand cascading failure paths.
Service Catalog #
The Service Catalog at /services is your central registry of all monitored services. Services can be auto-discovered by the agent or manually registered.
Service properties
- Name — Human-readable service name
- Tier — Criticality level (Tier 1: Revenue-critical, Tier 2: User-facing, Tier 3: Internal)
- Owner — Team or individual responsible for the service
- Status — Current operational status (Operational, Degraded, Outage)
- Dependencies — Upstream and downstream service connections
- SLO — Service level objective (e.g., 99.9% availability, P95 latency < 200ms)
Auto-discovery
When the Nova AI Agent detects a new process listening on a network port, it can automatically register it as a service. Auto-discovered services appear with a "Discovered" badge and require manual confirmation to be promoted to the active catalog.
Service Map #
The Service Map at /service-map provides a real-time, interactive visualization of your service dependency graph. Each node represents a service, and edges represent network communication between them.
- Node color — Reflects service health (green = healthy, amber = degraded, red = failing)
- Edge thickness — Proportional to request volume between services
- Edge color — Changes to red when error rate between services exceeds threshold
- Click a node — Opens a detail panel showing metrics, recent incidents, and dependencies for that service
On-Call Management #
Configure on-call rotations and escalation policies to ensure the right person is notified when incidents occur.
Rotation types
- Weekly rotation — Handoff every Monday at a configurable time
- Daily rotation — Handoff every 24 hours
- Custom schedule — Define specific shifts with overlap periods
Escalation policies
Define multi-tier escalation for each severity level:
- Tier 1 — Notify primary on-call via Slack (immediate)
- Tier 2 — If not acknowledged in 10 minutes, notify secondary on-call + email
- Tier 3 — If not acknowledged in 20 minutes, notify engineering manager + phone call
Certificate Manager Overview #
The Certificate Manager provides a centralized view of all TLS/SSL certificates across your domains. Track expiration dates, health scores, and renewal status from a single dashboard.
Managing Certificates #
Viewing certificates
The certificate list displays each certificate's domain, issuer, expiration date, and health score. Certificates are color-coded by health:
| Health Score | Color | Meaning |
|---|---|---|
| 80 – 100 | Green | Healthy — more than 80 days until expiration |
| 30 – 79 | Amber | Attention — certificate will expire within 80 days |
| 0 – 29 | Red | Critical — certificate expires within 30 days or already expired |
Adding a new certificate
- Click Add Certificate in the top-right corner
- Enter the domain name (e.g.,
api.example.com) - Select the certificate type (single domain, wildcard, or multi-domain SAN)
- Choose verification method (DNS TXT record or HTTP file verification)
- Submit and complete the verification challenge
Renewing certificates
Click the Renew button on any certificate card. For certificates with auto-renew enabled, renewal happens automatically 30 days before expiration. Manual renewal is available at any time.
Auto-renew toggle
Enable auto-renew on individual certificates to eliminate manual renewal. When enabled, Nova AI Ops will attempt renewal 30 days before expiry. If DNS verification is configured, the entire process is automatic. You'll receive a notification when auto-renewal succeeds or if it requires manual intervention.
Revoking and deleting
- Revoke — Invalidates the certificate immediately. Use this if the private key is compromised. Revocation is irreversible.
- Delete — Removes the certificate from Nova AI Ops tracking. The certificate itself remains valid until its natural expiration date.
DNS verification
For domain validation, add the provided TXT record to your DNS zone. Nova AI Ops checks for the record automatically and completes verification once detected (typically within 1-5 minutes).
# Add this TXT record to your DNS zone
_acme-challenge.api.example.com TXT "dGVzdC12ZXJpZmljYXRpb24..."
Certificate Health Score #
The health score (0-100) is a composite metric based on:
- Days until expiration (primary factor) — Certificates expiring sooner score lower
- Certificate chain validity — Broken chains reduce the score
- Key strength — RSA 2048+ or ECC P-256+ required for full score
- Protocol support — TLS 1.2+ required, TLS 1.3 preferred
The aggregate certificate health across all domains contributes to your overall System Status health score.
Integrations Overview #
Nova AI Ops integrates with your existing infrastructure and tooling. Navigate to Settings > Integrations to configure connections.
Infrastructure Integrations #
Grafana
Connect your Grafana instance to import dashboards, sync alert rules, and display Grafana panels natively within Nova AI Ops. Configure at /integrations/grafana.
{
"grafana_url": "https://grafana.internal.example.com",
"api_key": "glsa_...",
"sync_dashboards": true,
"sync_alerts": true
}
AWS
The AWS integration at /integrations/aws pulls CloudWatch metrics, ECS/EKS container health, RDS performance insights, and S3 bucket metrics. Requires an IAM role with read-only access to CloudWatch, ECS, EC2, and RDS.
Docker
The Docker integration at /integrations/docker monitors container lifecycle events, resource utilization per container, image vulnerability status, and network connectivity between containers. The Nova AI Agent detects Docker automatically when installed on a host running the Docker daemon.
Redis
Monitor Redis memory usage, connected clients, keyspace hit/miss ratio, replication lag, and slow log entries.
MongoDB
Track MongoDB operation counters, replication set status, connection pool utilization, and slow query patterns.
PostgreSQL
Monitor active connections, transaction throughput, index hit ratios, table bloat, and replication lag.
Databricks / Splunk
Ingest analytics and log data from Databricks workspaces and Splunk indexes. Configure forwarding rules to stream relevant events into Nova AI Ops for correlation with infrastructure metrics.
AI Model Integrations #
Nova AI Ops can monitor and manage AI model endpoints:
- Anthropic (Claude) — Monitor API latency, token usage, and error rates
- OpenAI (GPT) — Track request volumes, rate limit utilization, and cost per request
- Google Gemini — Monitor multimodal request throughput and response quality
- DeepSeek — Track inference latency and throughput across model versions
AI model integrations provide specialized dashboards showing token consumption trends, cost projections, and performance comparison across providers.
Webhook Gateway #
The Webhook Gateway allows external tools to push alerts and events into Nova AI Ops via HTTP POST. Each organization gets a unique webhook URL:
curl -X POST https://app.novaaiops.com/api/webhooks/ingest \
-H "Authorization: Bearer YOUR_WEBHOOK_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"source": "custom-monitor",
"severity": "warning",
"title": "Disk space low on db-primary",
"message": "Disk usage at 92% on /data volume",
"tags": {"host": "db-primary", "region": "us-east-1"}
}'
Incoming webhooks are processed through the standard incident pipeline — they can trigger alerts, create incidents, and appear on the incident timeline alongside native detections.
Nova AI Copilot #
The Nova AI Copilot is a natural language interface accessible from any page in the platform. Open it by clicking the chat icon in the bottom-right corner or pressing Ctrl + /.
The Copilot can:
- Navigate to any page in the platform
- Create, update, and manage Sprinta tasks
- Query metrics and incident data
- Manage certificates (view, renew, check status)
- Show system stats and service health
- Explain incidents and suggest remediation steps
Copilot Commands #
Navigation
"Navigate to dashboard"
"Go to incident timeline"
"Open the service map"
"Show me the golden signals page"
Certificates
"Show my certificates"
"Which certificates expire this month?"
"Renew the certificate for api.example.com"
"What's the health score of my certificates?"
Tasks (Sprinta)
"Create task Fix login bug on the auth service"
"Show my open tasks"
"Move task SPRINT-42 to In Progress"
"What tasks are in review?"
Metrics and incidents
"What's the current CPU usage?"
"Are there any active incidents?"
"Show me the P95 latency for the API gateway"
"How many incidents did we have this week?"
The Copilot understands context. If you're viewing the Incident Timeline and ask "Tell me about this one," it will describe the most recent or selected incident.
Sprinta Task Management #
Sprinta is Nova AI's built-in task management system, designed for engineering teams that want to track work alongside their operational tooling. Access it from the sidebar or via /sprinta.
Kanban Board #
The Kanban board organizes tasks into five columns:
| Column | Purpose |
|---|---|
| To Do | Tasks that are planned but not started |
| In Progress | Tasks currently being worked on |
| Needs Support | Blocked tasks awaiting input from another team member |
| Review | Tasks completed and awaiting peer review or QA |
| Completed | Done and verified tasks |
Drag and drop tasks between columns. Each task card shows the title, assignee avatar, priority badge, and due date. Click a card to open the detail view with description, comments, attachments, and activity log.
Creating tasks
- Manual — Click "New Task" and fill in the title, description, priority, and assignee
- From incident — Action items from postmortems automatically create Sprinta tasks
- Via AI Copilot — Say "Create task [description]" and the Copilot creates it instantly
- Import from Jira — Bulk import existing Jira tickets via CSV or API integration
Bulk operations
Select multiple tasks using checkboxes and apply bulk actions: assign, change priority, move to column, or archive. Useful for sprint planning and backlog grooming.
Nova AI in Sprinta #
The Sprinta board includes a dedicated AI chat panel. Use it for context-aware assistance:
- Summarize all tasks in a column
- Suggest task breakdown for large features
- Draft task descriptions from a short prompt
- Identify blocked tasks and suggest unblocking steps
- Generate sprint reports and velocity metrics
User Roles #
Nova AI Ops uses a six-role hierarchy. Each role inherits all permissions of the roles below it.
| Role | Level | Capabilities |
|---|---|---|
| Founder | 5 | Full platform access. Manage tenants, users, billing, and all admin pages. Cannot be deleted. |
| Nova Admin | 4 | Platform-wide administration. Manage integrations, agents, and organization settings. |
| Organization Owner | 3 | Full control over their organization. Manage team members, services, and billing. |
| Organization Admin | 3 | Manage team members, configure alerts, and administer organization settings. |
| Engineer | 2 | View all dashboards, manage incidents, configure monitors, and create runbooks. |
| Viewer | 1 | Read-only access to dashboards, incidents, and metrics. Cannot modify configurations. |
Team Invitations #
Invite team members from /team-members:
- Click Invite Member
- Enter the email address and select a role
- The invitee receives an email with a signup link pre-associated with your organization
- Once accepted, they appear in your team roster and can access pages permitted by their role
Pending invitations can be revoked. Invitations expire after 7 days if not accepted.
Two-Factor Authentication #
Enable 2FA from Settings > Security:
- Click Enable Two-Factor Authentication
- Scan the QR code with your authenticator app (Google Authenticator, Authy, 1Password)
- Enter the 6-digit verification code to confirm setup
- Save your recovery codes in a secure location
Recovery codes are shown only once during setup. Store them securely — they are the only way to regain access if you lose your authenticator device. Each recovery code can be used exactly once.
Profile & Preferences #
Access your profile from /settings. Customize:
- Display name — Shown across all platform interactions
- Profile photo — Uploaded photos are used everywhere an avatar is displayed (sidebar, comments, team roster, incident assignments)
- Timezone — All timestamps in the UI are converted to your selected timezone
- Theme — Dark (default) or light mode
- Notification preferences — Configure which events trigger email, in-app, or push notifications
Page Control #
Available to Founder role at /page-control. This admin tool lets you show or hide navigation items for your organization. Use it to:
- Enable new feature pages as they become relevant
- Hide pages that aren't applicable to your team's workflow
- Customize the sidebar navigation for your organization
Hidden pages are only removed from navigation — direct URL access still respects role-based permissions.
Billing & Plans #
| Plan | Price | Includes |
|---|---|---|
| Free | $0/month | 3 users, 5 services, 7-day data retention, community support |
| Team | $29/user/month | Unlimited services, 30-day retention, Slack integration, synthetic monitoring |
| Pro | $59/user/month | Everything in Team + AI Runbooks, Session Replay, 90-day retention, SSO |
| Enterprise | Custom | Unlimited retention, dedicated support, SLA, custom integrations, on-premise option |
Manage your subscription from Settings > Billing. All plans include a 14-day free trial of Pro features. Upgrades take effect immediately; downgrades take effect at the end of the current billing cycle.
API Authentication #
The Nova AI Ops API uses JWT (JSON Web Token) authentication. Tokens are issued on login and included in all subsequent requests.
Session-based (browser)
The web application uses HTTP-only secure cookies for session management. Tokens are automatically refreshed before expiration.
API key (programmatic)
For programmatic access, generate an API key from Settings > API Keys. Include it in the Authorization header:
curl -H "Authorization: Bearer YOUR_API_KEY" \
https://app.novaaiops.com/api/services
Never expose API keys in client-side code, public repositories, or logs. If a key is compromised, revoke it immediately from the API Keys settings page.
Key Endpoints #
| Endpoint | Method | Description |
|---|---|---|
/api/health | GET | Platform health check. Returns {"status":"ok"} |
/api/incidents | GET | List incidents. Supports ?status=active&severity=SEV-1 filters |
/api/incidents | POST | Create a new incident manually |
/api/alerts | GET | List configured alert rules |
/api/alerts | POST | Create a new alert rule |
/api/services | GET | List all registered services |
/api/synthetic/monitors | GET | List synthetic monitors with current status |
/api/synthetic/monitors | POST | Create a new synthetic monitor |
/api/certificates | GET | List all tracked certificates |
/api/certificates | POST | Add a new certificate to track |
/api/metrics/ingest | POST | Ingest metrics from the Nova AI Agent |
Response format
All API responses follow a consistent JSON envelope:
{
"success": true,
"data": [ ... ],
"meta": {
"total": 42,
"page": 1,
"per_page": 20
}
}
Error responses include a human-readable message and machine-parseable error code:
{
"success": false,
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit exceeded. Try again in 30 seconds."
}
}
For the complete API reference with request/response schemas, visit api.novaaiops.com.
Rate Limiting #
| Plan | Rate Limit | Burst |
|---|---|---|
| Free | 60 requests/minute | 10 requests/second |
| Team | 300 requests/minute | 30 requests/second |
| Pro | 1,000 requests/minute | 100 requests/second |
| Enterprise | Custom | Custom |
Rate limit headers are included in every response:
X-RateLimit-Limit: 300
X-RateLimit-Remaining: 297
X-RateLimit-Reset: 1679529600
When rate limited, the API returns 429 Too Many Requests. Implement exponential backoff in your client code to handle this gracefully.
Nova AI Agent Overview #
The Nova AI Agent is a lightweight, zero-dependency metrics collector that runs on your infrastructure. It is the single source of truth for all metrics in Nova AI Ops — every data point displayed in dashboards originates from the agent.
What it collects
- CPU — Per-core utilization, load averages (1m, 5m, 15m), steal time, iowait
- Memory — Used, available, cached, buffered, swap usage
- Disk — Space utilization per mount, read/write IOPS, throughput, latency
- Network — Bytes in/out per interface, packet errors, connection states
- Processes — Top processes by CPU and memory, process count, zombie detection
- Containers — Docker/Podman container metrics (CPU, memory, network per container)
- Custom metrics — Application-level metrics via collector plugins
Architecture
The agent uses a collector registry pattern. Each metric type has a dedicated collector that runs on a configurable interval (default: 10 seconds). Collected metrics are batched and sent to the Nova AI Ops backend via HTTPS. If the backend is unreachable, metrics are buffered to disk and retried with exponential backoff.
Agent Installation #
Linux (systemd)
# One-line install
curl -sSL https://get.novaaiops.com/agent | bash
# Set your API key
nova-agent config set api-key YOUR_API_KEY
# Start and enable on boot
sudo systemctl start nova-ai-agent
sudo systemctl enable nova-ai-agent
# Verify it's running
sudo systemctl status nova-ai-agent
Docker
docker run -d \
--name nova-ai-agent \
--restart unless-stopped \
--pid host \
--net host \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-e NOVA_API_KEY=YOUR_API_KEY \
-e NOVA_API_URL=https://app.novaaiops.com/api/metrics/ingest \
novaaiops/agent:latest
Agent Configuration #
The agent reads configuration from multiple sources (highest priority first):
- CLI flags
- Environment variables (
NOVA_*) - Config file (
/etc/nova-agent/config.yml) - Built-in defaults
# /etc/nova-agent/config.yml
api_key: "your-api-key-here"
api_url: "https://app.novaaiops.com/api/metrics/ingest"
collect_interval: 10 # seconds
send_interval: 30 # seconds (batch sends)
buffer_dir: "~/.nova-agent/buffer"
collectors:
cpu: true
memory: true
disk: true
network: true
process: true
docker: true # auto-detected
health_server:
enabled: true
port: 9100 # /health, /ready, /metrics
resilience:
retry_max_attempts: 5
circuit_breaker_threshold: 5
buffer_max_size_mb: 100
Health endpoints
When the health server is enabled (default port 9100), the agent exposes:
| Endpoint | Purpose |
|---|---|
/health | Liveness check — returns 200 if the agent process is running |
/ready | Readiness check — returns 200 if collectors are initialized and backend is reachable |
/metrics | Prometheus-format metrics about the agent itself (send success/failure counts, buffer size, etc.) |
/status | JSON status report with uptime, collector states, and connection status |
The agent includes a circuit breaker that prevents flooding the backend during outages, exponential backoff with jitter for retries, and a disk-backed metric buffer (default: 100 MB) that persists unsent metrics across agent restarts.