Getting Started

What is Nova AI Ops? #

Nova AI Ops is an AI-powered incident management and observability platform built for engineering teams that operate production infrastructure. It unifies metrics, logs, traces, synthetic monitoring, and incident workflows into a single pane of glass — with AI-driven detection, automated runbooks, and real-time collaboration.

Unlike traditional monitoring tools that require manual correlation across dozens of dashboards, Nova AI Ops automatically connects signals across your entire stack and surfaces the most likely root cause within seconds of an anomaly.

Core capabilities

Real-time Observability — Live metrics, logs, and traces from your infrastructure via the Nova AI Agent
AI-Powered Incident Detection — Machine learning models detect anomalies before they become outages
Automated Runbooks — Pre-defined remediation playbooks execute automatically on incident trigger
Synthetic Monitoring — Proactive endpoint checks with configurable failure thresholds
Certificate Management — Track, renew, and auto-manage TLS certificates across your domains
Sprinta Task Management — Built-in Kanban workflow with AI-assisted task creation
Service Map — Automatic dependency visualization across your microservices
Nova AI Copilot — Natural language interface to navigate, query, and manage your platform

Quick Start Guide #

Get from zero to your first dashboard in under 10 minutes.

Create your account
Visit app.novaaiops.com and sign up with your work email. You'll receive a verification link. Your organization is created automatically on first login.

Install the Nova AI Agent

The agent is a lightweight daemon that runs on your infrastructure and streams metrics to Nova AI Ops. Install it on any Linux host:

bash

# Download and install the Nova AI Agent
curl -sSL https://get.novaaiops.com/agent | bash

# Configure with your API key (found in Settings > API Keys)
nova-agent config set api-key YOUR_API_KEY

# Start the agent
sudo systemctl start nova-ai-agent
sudo systemctl enable nova-ai-agent

Connect your integrations
Navigate to Settings > Integrations and connect your cloud providers, container platforms, and notification channels. Nova AI Ops supports AWS, Docker, Grafana, Slack, and more.
View your first dashboard
Once the agent is reporting, navigate to /dashboard. You'll see real-time CPU, memory, disk, and network metrics within 30 seconds of agent startup.

Tip

The agent begins collecting metrics immediately upon startup. No additional configuration is required for system-level metrics (CPU, memory, disk, network). Application-level metrics require collector plugins — see the Agent Configuration section.

System Requirements #

Nova AI Agent

Requirement	Minimum	Recommended
Operating System	Linux (kernel 4.14+)	Amazon Linux 2023, Ubuntu 22.04+
CPU	1 core	2 cores
Memory	128 MB	256 MB
Disk	50 MB	200 MB (for metric buffering)
Network	HTTPS outbound to *.novaaiops.com	Same

Supported browsers

Browser	Minimum Version
Chrome / Edge	90+
Firefox	90+
Safari	15+
Mobile Safari / Chrome	iOS 15+ / Android 10+

Real-time Metrics Dashboard #

The main dashboard at /dashboard is your operational command center. It displays live metrics streamed from all connected agents via WebSocket, updating every second without page refresh.

Dashboard panels

CPU Utilization — Current usage percentage across all cores, with sparkline history
Memory Usage — Used, available, and cached memory with trend indicator
Disk I/O — Read/write throughput and IOPS per volume
Network Traffic — Inbound/outbound bandwidth with packet error rates
Active Incidents — Count badge showing open SEV-1/2/3 incidents, color-coded by highest severity
Service Health — Aggregate status of all registered services

All dashboard metrics originate from the Nova AI Agent. The agent pushes data to backend ingestion endpoints, which persist to the database and emit real-time Socket.IO events to connected frontends.

Architecture

The data pipeline is: nova-ai-agent → API ingestion → Database → Socket.IO → Dashboard. There are no synthetic or mocked metrics. Every data point originates from your actual infrastructure.

System Status #

The System Status page at / (the homepage) provides a high-level operational overview using a donut chart and service group cards.

Health score

The central donut displays an aggregate health score from 0 to 100, calculated from service availability, active incident severity, and synthetic monitor pass rates. The donut color reflects overall status:

Score	Color	Meaning
90 – 100	Green	All systems operational
70 – 89	Amber	Degraded performance or minor incidents
0 – 69	Red	Major incident or significant outage

Service groups

Below the donut, services are grouped by category (API, Database, Cache, Frontend, etc.). Each group shows the number of healthy vs. unhealthy services. Clicking a group navigates to the Service Catalog filtered to that category.

When an incident occurs, the dashboard and overview page colors update dynamically from green to yellow to red based on the highest active incident severity.

Performance Trends #

The Performance Trends page at /trends provides historical analysis of key metrics over configurable time ranges.

Time range selection

Use the time picker in the top-right corner to select from preset ranges (Last 1h, 6h, 24h, 7d, 30d) or define a custom range. All charts update simultaneously when the time range changes.

Available trend views

Resource Utilization — CPU, memory, and disk usage over time with percentile bands (P50, P95, P99)
Request Rate — HTTP requests per second with breakdown by status code class (2xx, 4xx, 5xx)
Error Rate — Error percentage with anomaly detection overlay
Latency Distribution — Response time percentiles shown as area charts

Golden Signals #

The Golden Signals page at /golden-signals implements the four golden signals of monitoring defined in the Google SRE handbook. These are the most critical indicators of system health.

Signal	What It Measures	Key Metric
Latency	Time to serve a request	P95 response time
Traffic	Demand on your system	Requests per second
Errors	Rate of failed requests	5xx error percentage
Saturation	How full your resources are	CPU/memory/disk utilization

Each signal is displayed as a real-time gauge with sparkline history. When any signal crosses its configured threshold, the gauge changes color and (if alerting is enabled) triggers an alert through your configured notification channels.

SRE Best Practice

Google's SRE methodology recommends alerting on symptoms (the golden signals) rather than causes. If your latency P95 exceeds your SLO, that's an actionable signal — the root cause (CPU, GC, database, network) is secondary to the user impact.

Incident Timeline #

The Incident Timeline at /incidents/timeline displays all active and recently resolved incidents in chronological order. Each incident card shows:

Severity — Color-coded badge (SEV-1 red, SEV-2 amber, SEV-3 blue)
Title — Auto-generated or manually set description
Duration — Time since detection (or total duration if resolved)
Affected Services — List of impacted services from the Service Catalog
Current Phase — Detection, Triage, Mitigation, or Resolved

Severity levels

Level	Impact	Response Expectation
SEV-1	Complete service outage, data loss risk, or security breach	Immediate response, all-hands
SEV-2	Significant degradation affecting many users	Respond within 15 minutes
SEV-3	Minor issue, limited user impact	Respond within 1 hour

Incident Lifecycle #

Every incident in Nova AI Ops follows a structured lifecycle:

Detection
An incident is created automatically when an alert fires, a synthetic monitor exceeds its consecutive failure threshold, or an AI anomaly is detected. Incidents can also be created manually.
Triage
The on-call engineer is notified. They assess the incident scope, assign a severity level, and determine affected services. The AI Copilot suggests related past incidents and relevant runbooks.
Mitigation
The team works to reduce user impact. This might involve rolling back a deployment, scaling up resources, or activating an automated runbook. All actions are logged in the incident timeline.
Resolution
The root cause is addressed and services are restored. The incident is marked resolved, which triggers a recovery notification and resets associated monitor states.
Postmortem
Within 48 hours, the team completes a blameless postmortem documenting the timeline, root cause, impact, and action items. See the Postmortem section.

Incident History & Archive #

The Incident Archive at /incidents/history provides a searchable, filterable record of all past incidents. Use the filters to narrow by severity, date range, affected service, or resolution time. This data feeds into trend analysis — helping you identify recurring failure patterns and track MTTR (Mean Time to Resolution) improvements.

AI Runbooks #

AI Runbooks at /runbooks are automated remediation playbooks that execute predefined steps when triggered by an incident. Unlike static runbooks, Nova's AI Runbooks adapt their execution based on the specific context of each incident.

Creating a runbook

yaml

name: High CPU Remediation
trigger:
  metric: cpu_utilization
  condition: "> 90%"
  duration: 5m
steps:
  - action: identify_top_processes
    params: { limit: 10 }
  - action: check_recent_deployments
    params: { window: "2h" }
  - action: scale_horizontally
    params: { increment: 2 }
    requires_approval: true

Runbooks support approval gates for destructive actions, audit logging for every step executed, and automatic rollback if a step fails.

Postmortem #

The Postmortem page at /postmortem provides a structured, blameless review template. Each postmortem includes:

Summary — One-paragraph description of what happened
Timeline — Minute-by-minute account of detection, response, and resolution
Root Cause — The fundamental reason the incident occurred
Impact — Quantified user impact (requests affected, downtime duration, revenue loss)
Action Items — Concrete tasks with owners and due dates to prevent recurrence
Lessons Learned — What went well and what needs improvement

Best Practice

Nova AI automatically populates the postmortem timeline from incident events, reducing manual effort. Focus your writing on root cause analysis and action items — these are the parts that drive real improvement.

Creating Alert Rules #

Navigate to /alerts to configure alert rules. Nova AI Ops supports three types of alerts:

Threshold alerts

Trigger when a metric crosses a static threshold. Best for well-understood metrics with predictable ranges.

json

{
  "name": "High Memory Usage",
  "type": "threshold",
  "metric": "memory.used_percent",
  "condition": "above",
  "threshold": 85,
  "duration": "5m",
  "severity": "SEV-2",
  "channels": ["slack-ops", "email-oncall"]
}

Anomaly detection alerts

Use machine learning to detect unusual patterns without requiring manual thresholds. Nova AI builds a baseline from historical data and alerts when behavior deviates significantly. Ideal for metrics with seasonal patterns or variable baselines.

Heartbeat alerts

Trigger when an expected signal stops arriving. Use these to detect silent failures — if a service should be reporting metrics every 60 seconds, a heartbeat alert will fire after the configured silence window (e.g., 5 minutes of no data).

Alert conditions

Condition	Behavior
`above`	Triggers when metric exceeds threshold
`below`	Triggers when metric drops below threshold
`equal`	Triggers when metric equals a specific value
`absent`	Triggers when no data is received within the window
`anomaly`	Triggers on statistical deviation from baseline

Notification Channels #

Configure where and how alerts are delivered:

Email — Individual or distribution list. Supports HTML-formatted incident summaries.
Slack — Posts to a channel with severity-colored attachments and action buttons (Acknowledge, Resolve).
Microsoft Teams — Adaptive card notifications with direct links to the incident.
Webhooks — POST JSON payloads to any HTTP endpoint. Use this to integrate with PagerDuty, Opsgenie, or custom tooling.

Each alert rule can target multiple channels. You can also configure escalation policies — if an alert isn't acknowledged within N minutes, it escalates to the next tier (e.g., from Slack to phone call).

Silencing & Maintenance Windows #

Suppress alerts during planned maintenance or known noisy periods:

Silence by rule — Mute a specific alert rule for a defined duration
Silence by service — Mute all alerts for a service (useful during deployments)
Maintenance window — Schedule a recurring window (e.g., every Sunday 02:00-04:00 UTC) that auto-silences matching alerts

Warning

Silenced alerts are still evaluated and recorded in the alert history. They simply don't trigger notifications. Review your silences regularly — stale silences can mask real incidents.

Log Explorer #

The Log Explorer at /logs provides full-text search across all ingested log data with sub-second response times.

Search syntax

Log queries support a structured search syntax:

text

# Simple text search
connection refused

# Field-specific search
service:api-gateway level:error

# Wildcard patterns
message:timeout* host:prod-web-*

# Numeric comparisons
status_code:>=500 response_time:>2000

# Boolean operators
(level:error OR level:fatal) AND service:payment-api

# Exclude terms
level:error NOT "health check"

Filters

Use the left sidebar filters to narrow by time range, log level (DEBUG, INFO, WARN, ERROR, FATAL), service name, host, or custom tags. Applied filters are reflected in the URL — share the URL with teammates to reproduce exact search context.

Synthetic Monitoring #

Synthetic Monitoring at /synthetic performs automated HTTP checks against your endpoints at regular intervals, detecting outages before your users do.

Monitor statuses

Status	Color	Meaning
UP	Green	Endpoint responding within expected latency and returning expected status code
DEGRADED	Amber	Endpoint responding but with elevated latency (above P95 threshold)
DOWN	Red	Endpoint not responding, returning errors, or exceeding timeout

Consecutive failure thresholds

To prevent alert fatigue from transient network issues, synthetic monitors use a consecutive failure threshold. An incident is only created after the configured number of consecutive failures (default: 5). A single successful check resets the failure counter to zero.

Key metrics

Availability — Percentage of successful checks over the selected time window
P95 Latency — 95th percentile response time, the standard measure for user-perceived performance
Uptime — Continuous availability duration since last outage
Certificate Expiry — Days until TLS certificate expiration (if HTTPS)

Cooldown period

After an incident is created for a failing monitor, a 6-hour cooldown prevents duplicate incidents for the same monitor. The monitor continues to be checked during cooldown, but no new incidents are created until the cooldown expires or the monitor recovers and fails again.

Session Replay #

Session Replay at /replay records and plays back real user sessions on your web application, allowing you to see exactly what users saw when they encountered an error.

How it works

The replay recorder captures DOM mutations, mouse movements, clicks, scrolls, and console errors — not screenshots. This means recordings are lightweight (typically 50-200 KB per minute) and don't capture sensitive input field values.

Key features

Error-linked replays — Jump directly to the moment an error occurred
Console overlay — See JavaScript errors and network failures alongside the visual replay
Timeline scrubbing — Scrub forward/backward through the session timeline
Speed control — Play at 1x, 2x, 4x, or 8x speed

Distributed Tracing #

The Tracing page at /tracing visualizes request flows across microservices. Each trace shows the complete journey of a request from ingress to response, with timing breakdowns per service hop.

Trace anatomy

Trace — The complete end-to-end journey of a single request
Span — A single unit of work within a trace (e.g., an HTTP call, database query, or cache lookup)
Parent-child relationships — Spans are nested to show causality (Service A called Service B, which called the database)

Use tracing to identify slow service dependencies, detect N+1 query patterns, and understand cascading failure paths.

Service Catalog #

The Service Catalog at /services is your central registry of all monitored services. Services can be auto-discovered by the agent or manually registered.

Service properties

Name — Human-readable service name
Tier — Criticality level (Tier 1: Revenue-critical, Tier 2: User-facing, Tier 3: Internal)
Owner — Team or individual responsible for the service
Status — Current operational status (Operational, Degraded, Outage)
Dependencies — Upstream and downstream service connections
SLO — Service level objective (e.g., 99.9% availability, P95 latency < 200ms)

Auto-discovery

When the Nova AI Agent detects a new process listening on a network port, it can automatically register it as a service. Auto-discovered services appear with a "Discovered" badge and require manual confirmation to be promoted to the active catalog.

Service Map #

The Service Map at /service-map provides a real-time, interactive visualization of your service dependency graph. Each node represents a service, and edges represent network communication between them.

Node color — Reflects service health (green = healthy, amber = degraded, red = failing)
Edge thickness — Proportional to request volume between services
Edge color — Changes to red when error rate between services exceeds threshold
Click a node — Opens a detail panel showing metrics, recent incidents, and dependencies for that service

On-Call Management #

Configure on-call rotations and escalation policies to ensure the right person is notified when incidents occur.

Rotation types

Weekly rotation — Handoff every Monday at a configurable time
Daily rotation — Handoff every 24 hours
Custom schedule — Define specific shifts with overlap periods

Escalation policies

Define multi-tier escalation for each severity level:

Tier 1 — Notify primary on-call via Slack (immediate)
Tier 2 — If not acknowledged in 10 minutes, notify secondary on-call + email
Tier 3 — If not acknowledged in 20 minutes, notify engineering manager + phone call

Certificate Manager Overview #

The Certificate Manager provides a centralized view of all TLS/SSL certificates across your domains. Track expiration dates, health scores, and renewal status from a single dashboard.

Managing Certificates #

Viewing certificates

The certificate list displays each certificate's domain, issuer, expiration date, and health score. Certificates are color-coded by health:

Health Score	Color	Meaning
80 – 100	Green	Healthy — more than 80 days until expiration
30 – 79	Amber	Attention — certificate will expire within 80 days
0 – 29	Red	Critical — certificate expires within 30 days or already expired

Adding a new certificate

Click Add Certificate in the top-right corner
Enter the domain name (e.g., api.example.com)
Select the certificate type (single domain, wildcard, or multi-domain SAN)
Choose verification method (DNS TXT record or HTTP file verification)
Submit and complete the verification challenge

Renewing certificates

Click the Renew button on any certificate card. For certificates with auto-renew enabled, renewal happens automatically 30 days before expiration. Manual renewal is available at any time.

Auto-renew toggle

Enable auto-renew on individual certificates to eliminate manual renewal. When enabled, Nova AI Ops will attempt renewal 30 days before expiry. If DNS verification is configured, the entire process is automatic. You'll receive a notification when auto-renewal succeeds or if it requires manual intervention.

Revoking and deleting

Revoke — Invalidates the certificate immediately. Use this if the private key is compromised. Revocation is irreversible.
Delete — Removes the certificate from Nova AI Ops tracking. The certificate itself remains valid until its natural expiration date.

DNS verification

For domain validation, add the provided TXT record to your DNS zone. Nova AI Ops checks for the record automatically and completes verification once detected (typically within 1-5 minutes).

text

# Add this TXT record to your DNS zone
_acme-challenge.api.example.com  TXT  "dGVzdC12ZXJpZmljYXRpb24..."

Certificate Health Score #

The health score (0-100) is a composite metric based on:

Days until expiration (primary factor) — Certificates expiring sooner score lower
Certificate chain validity — Broken chains reduce the score
Key strength — RSA 2048+ or ECC P-256+ required for full score
Protocol support — TLS 1.2+ required, TLS 1.3 preferred

The aggregate certificate health across all domains contributes to your overall System Status health score.

Integrations Overview #

Nova AI Ops integrates with your existing infrastructure and tooling. Navigate to Settings > Integrations to configure connections.

Infrastructure Integrations #

Grafana

Connect your Grafana instance to import dashboards, sync alert rules, and display Grafana panels natively within Nova AI Ops. Configure at /integrations/grafana.

json

{
  "grafana_url": "https://grafana.internal.example.com",
  "api_key": "glsa_...",
  "sync_dashboards": true,
  "sync_alerts": true
}

AWS

The AWS integration at /integrations/aws pulls CloudWatch metrics, ECS/EKS container health, RDS performance insights, and S3 bucket metrics. Requires an IAM role with read-only access to CloudWatch, ECS, EC2, and RDS.

Docker

The Docker integration at /integrations/docker monitors container lifecycle events, resource utilization per container, image vulnerability status, and network connectivity between containers. The Nova AI Agent detects Docker automatically when installed on a host running the Docker daemon.

Redis

Monitor Redis memory usage, connected clients, keyspace hit/miss ratio, replication lag, and slow log entries.

MongoDB

Track MongoDB operation counters, replication set status, connection pool utilization, and slow query patterns.

PostgreSQL

Monitor active connections, transaction throughput, index hit ratios, table bloat, and replication lag.

Databricks / Splunk

Ingest analytics and log data from Databricks workspaces and Splunk indexes. Configure forwarding rules to stream relevant events into Nova AI Ops for correlation with infrastructure metrics.

AI Model Integrations #

Nova AI Ops can monitor and manage AI model endpoints:

Anthropic (Claude) — Monitor API latency, token usage, and error rates
OpenAI (GPT) — Track request volumes, rate limit utilization, and cost per request
Google Gemini — Monitor multimodal request throughput and response quality
DeepSeek — Track inference latency and throughput across model versions

AI model integrations provide specialized dashboards showing token consumption trends, cost projections, and performance comparison across providers.

Webhook Gateway #

The Webhook Gateway allows external tools to push alerts and events into Nova AI Ops via HTTP POST. Each organization gets a unique webhook URL:

bash

curl -X POST https://app.novaaiops.com/api/webhooks/ingest \
  -H "Authorization: Bearer YOUR_WEBHOOK_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "source": "custom-monitor",
    "severity": "warning",
    "title": "Disk space low on db-primary",
    "message": "Disk usage at 92% on /data volume",
    "tags": {"host": "db-primary", "region": "us-east-1"}
  }'

Incoming webhooks are processed through the standard incident pipeline — they can trigger alerts, create incidents, and appear on the incident timeline alongside native detections.

Nova AI Copilot #

The Nova AI Copilot is a natural language interface accessible from any page in the platform. Open it by clicking the chat icon in the bottom-right corner or pressing Ctrl + /.

The Copilot can:

Navigate to any page in the platform
Create, update, and manage Sprinta tasks
Query metrics and incident data
Manage certificates (view, renew, check status)
Show system stats and service health
Explain incidents and suggest remediation steps

Copilot Commands #

Navigation

text

"Navigate to dashboard"
"Go to incident timeline"
"Open the service map"
"Show me the golden signals page"

Certificates

text

"Show my certificates"
"Which certificates expire this month?"
"Renew the certificate for api.example.com"
"What's the health score of my certificates?"

Tasks (Sprinta)

text

"Create task Fix login bug on the auth service"
"Show my open tasks"
"Move task SPRINT-42 to In Progress"
"What tasks are in review?"

Metrics and incidents

text

"What's the current CPU usage?"
"Are there any active incidents?"
"Show me the P95 latency for the API gateway"
"How many incidents did we have this week?"

Tip

The Copilot understands context. If you're viewing the Incident Timeline and ask "Tell me about this one," it will describe the most recent or selected incident.

Sprinta Task Management #

Sprinta is Nova AI's built-in task management system, designed for engineering teams that want to track work alongside their operational tooling. Access it from the sidebar or via /sprinta.

Kanban Board #

The Kanban board organizes tasks into five columns:

Column	Purpose
To Do	Tasks that are planned but not started
In Progress	Tasks currently being worked on
Needs Support	Blocked tasks awaiting input from another team member
Review	Tasks completed and awaiting peer review or QA
Completed	Done and verified tasks

Drag and drop tasks between columns. Each task card shows the title, assignee avatar, priority badge, and due date. Click a card to open the detail view with description, comments, attachments, and activity log.

Creating tasks

Manual — Click "New Task" and fill in the title, description, priority, and assignee
From incident — Action items from postmortems automatically create Sprinta tasks
Via AI Copilot — Say "Create task [description]" and the Copilot creates it instantly
Import from Jira — Bulk import existing Jira tickets via CSV or API integration

Bulk operations

Select multiple tasks using checkboxes and apply bulk actions: assign, change priority, move to column, or archive. Useful for sprint planning and backlog grooming.

Nova AI in Sprinta #

The Sprinta board includes a dedicated AI chat panel. Use it for context-aware assistance:

Summarize all tasks in a column
Suggest task breakdown for large features
Draft task descriptions from a short prompt
Identify blocked tasks and suggest unblocking steps
Generate sprint reports and velocity metrics

User Roles #

Nova AI Ops uses a six-role hierarchy. Each role inherits all permissions of the roles below it.

Role	Level	Capabilities
Founder	5	Full platform access. Manage tenants, users, billing, and all admin pages. Cannot be deleted.
Nova Admin	4	Platform-wide administration. Manage integrations, agents, and organization settings.
Organization Owner	3	Full control over their organization. Manage team members, services, and billing.
Organization Admin	3	Manage team members, configure alerts, and administer organization settings.
Engineer	2	View all dashboards, manage incidents, configure monitors, and create runbooks.
Viewer	1	Read-only access to dashboards, incidents, and metrics. Cannot modify configurations.

Team Invitations #

Invite team members from /team-members:

Click Invite Member
Enter the email address and select a role
The invitee receives an email with a signup link pre-associated with your organization
Once accepted, they appear in your team roster and can access pages permitted by their role

Pending invitations can be revoked. Invitations expire after 7 days if not accepted.

Two-Factor Authentication #

Enable 2FA from Settings > Security:

Click Enable Two-Factor Authentication
Scan the QR code with your authenticator app (Google Authenticator, Authy, 1Password)
Enter the 6-digit verification code to confirm setup
Save your recovery codes in a secure location

Important

Recovery codes are shown only once during setup. Store them securely — they are the only way to regain access if you lose your authenticator device. Each recovery code can be used exactly once.

Profile & Preferences #

Access your profile from /settings. Customize:

Display name — Shown across all platform interactions
Profile photo — Uploaded photos are used everywhere an avatar is displayed (sidebar, comments, team roster, incident assignments)
Timezone — All timestamps in the UI are converted to your selected timezone
Theme — Dark (default) or light mode
Notification preferences — Configure which events trigger email, in-app, or push notifications

Page Control #

Available to Founder role at /page-control. This admin tool lets you show or hide navigation items for your organization. Use it to:

Enable new feature pages as they become relevant
Hide pages that aren't applicable to your team's workflow
Customize the sidebar navigation for your organization

Hidden pages are only removed from navigation — direct URL access still respects role-based permissions.

Billing & Plans #

Plan	Price	Includes
Free	$0/month	3 users, 5 services, 7-day data retention, community support
Team	$29/user/month	Unlimited services, 30-day retention, Slack integration, synthetic monitoring
Pro	$59/user/month	Everything in Team + AI Runbooks, Session Replay, 90-day retention, SSO
Enterprise	Custom	Unlimited retention, dedicated support, SLA, custom integrations, on-premise option

Manage your subscription from Settings > Billing. All plans include a 14-day free trial of Pro features. Upgrades take effect immediately; downgrades take effect at the end of the current billing cycle.

API Authentication #

The Nova AI Ops API uses JWT (JSON Web Token) authentication. Tokens are issued on login and included in all subsequent requests.

Session-based (browser)

The web application uses HTTP-only secure cookies for session management. Tokens are automatically refreshed before expiration.

API key (programmatic)

For programmatic access, generate an API key from Settings > API Keys. Include it in the Authorization header:

bash

curl -H "Authorization: Bearer YOUR_API_KEY" \
  https://app.novaaiops.com/api/services

Security

Never expose API keys in client-side code, public repositories, or logs. If a key is compromised, revoke it immediately from the API Keys settings page.

Key Endpoints #

Endpoint	Method	Description
`/api/health`	GET	Platform health check. Returns `{"status":"ok"}`
`/api/incidents`	GET	List incidents. Supports `?status=active&severity=SEV-1` filters
`/api/incidents`	POST	Create a new incident manually
`/api/alerts`	GET	List configured alert rules
`/api/alerts`	POST	Create a new alert rule
`/api/services`	GET	List all registered services
`/api/synthetic/monitors`	GET	List synthetic monitors with current status
`/api/synthetic/monitors`	POST	Create a new synthetic monitor
`/api/certificates`	GET	List all tracked certificates
`/api/certificates`	POST	Add a new certificate to track
`/api/metrics/ingest`	POST	Ingest metrics from the Nova AI Agent

Response format

All API responses follow a consistent JSON envelope:

json

{
  "success": true,
  "data": [ ... ],
  "meta": {
    "total": 42,
    "page": 1,
    "per_page": 20
  }
}

Error responses include a human-readable message and machine-parseable error code:

json

{
  "success": false,
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Try again in 30 seconds."
  }
}

For the complete API reference with request/response schemas, visit api.novaaiops.com.

Rate Limiting #

Plan	Rate Limit	Burst
Free	60 requests/minute	10 requests/second
Team	300 requests/minute	30 requests/second
Pro	1,000 requests/minute	100 requests/second
Enterprise	Custom	Custom

Rate limit headers are included in every response:

http

X-RateLimit-Limit: 300
X-RateLimit-Remaining: 297
X-RateLimit-Reset: 1679529600

When rate limited, the API returns 429 Too Many Requests. Implement exponential backoff in your client code to handle this gracefully.

Nova AI Agent Overview #

The Nova AI Agent is a lightweight, zero-dependency metrics collector that runs on your infrastructure. It is the single source of truth for all metrics in Nova AI Ops — every data point displayed in dashboards originates from the agent.

What it collects

CPU — Per-core utilization, load averages (1m, 5m, 15m), steal time, iowait
Memory — Used, available, cached, buffered, swap usage
Disk — Space utilization per mount, read/write IOPS, throughput, latency
Network — Bytes in/out per interface, packet errors, connection states
Processes — Top processes by CPU and memory, process count, zombie detection
Containers — Docker/Podman container metrics (CPU, memory, network per container)
Custom metrics — Application-level metrics via collector plugins

Architecture

The agent uses a collector registry pattern. Each metric type has a dedicated collector that runs on a configurable interval (default: 10 seconds). Collected metrics are batched and sent to the Nova AI Ops backend via HTTPS. If the backend is unreachable, metrics are buffered to disk and retried with exponential backoff.

Agent Installation #

Linux (systemd)

bash

# One-line install
curl -sSL https://get.novaaiops.com/agent | bash

# Set your API key
nova-agent config set api-key YOUR_API_KEY

# Start and enable on boot
sudo systemctl start nova-ai-agent
sudo systemctl enable nova-ai-agent

# Verify it's running
sudo systemctl status nova-ai-agent

Docker

bash

docker run -d \
  --name nova-ai-agent \
  --restart unless-stopped \
  --pid host \
  --net host \
  -v /proc:/host/proc:ro \
  -v /sys:/host/sys:ro \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -e NOVA_API_KEY=YOUR_API_KEY \
  -e NOVA_API_URL=https://app.novaaiops.com/api/metrics/ingest \
  novaaiops/agent:latest

Agent Configuration #

The agent reads configuration from multiple sources (highest priority first):

CLI flags
Environment variables (NOVA_*)
Config file (/etc/nova-agent/config.yml)
Built-in defaults

yaml

# /etc/nova-agent/config.yml
api_key: "your-api-key-here"
api_url: "https://app.novaaiops.com/api/metrics/ingest"
collect_interval: 10    # seconds
send_interval: 30       # seconds (batch sends)
buffer_dir: "~/.nova-agent/buffer"

collectors:
  cpu: true
  memory: true
  disk: true
  network: true
  process: true
  docker: true       # auto-detected

health_server:
  enabled: true
  port: 9100         # /health, /ready, /metrics

resilience:
  retry_max_attempts: 5
  circuit_breaker_threshold: 5
  buffer_max_size_mb: 100

Health endpoints

When the health server is enabled (default port 9100), the agent exposes:

Endpoint	Purpose
`/health`	Liveness check — returns 200 if the agent process is running
`/ready`	Readiness check — returns 200 if collectors are initialized and backend is reachable
`/metrics`	Prometheus-format metrics about the agent itself (send success/failure counts, buffer size, etc.)
`/status`	JSON status report with uptime, collector states, and connection status

Resilience

The agent includes a circuit breaker that prevents flooding the backend during outages, exponential backoff with jitter for retries, and a disk-backed metric buffer (default: 100 MB) that persists unsent metrics across agent restarts.