On this page

Getting Started

What is Nova AI Ops? #

Nova AI Ops is an AI-powered incident management and observability platform built for engineering teams that operate production infrastructure. It unifies metrics, logs, traces, synthetic monitoring, and incident workflows into a single pane of glass — with AI-driven detection, automated runbooks, and real-time collaboration.

Unlike traditional monitoring tools that require manual correlation across dozens of dashboards, Nova AI Ops automatically connects signals across your entire stack and surfaces the most likely root cause within seconds of an anomaly.

Core capabilities

  • Real-time Observability — Live metrics, logs, and traces from your infrastructure via the Nova AI Agent
  • AI-Powered Incident Detection — Machine learning models detect anomalies before they become outages
  • Automated Runbooks — Pre-defined remediation playbooks execute automatically on incident trigger
  • Synthetic Monitoring — Proactive endpoint checks with configurable failure thresholds
  • Certificate Management — Track, renew, and auto-manage TLS certificates across your domains
  • Sprinta Task Management — Built-in Kanban workflow with AI-assisted task creation
  • Service Map — Automatic dependency visualization across your microservices
  • Nova AI Copilot — Natural language interface to navigate, query, and manage your platform

Quick Start Guide #

Get from zero to your first dashboard in under 10 minutes.

  1. Create your account

    Visit app.novaaiops.com and sign up with your work email. You'll receive a verification link. Your organization is created automatically on first login.

  2. Install the Nova AI Agent

    The agent is a lightweight daemon that runs on your infrastructure and streams metrics to Nova AI Ops. Install it on any Linux host:

    bash
    # Download and install the Nova AI Agent
    curl -sSL https://get.novaaiops.com/agent | bash
    
    # Configure with your API key (found in Settings > API Keys)
    nova-agent config set api-key YOUR_API_KEY
    
    # Start the agent
    sudo systemctl start nova-ai-agent
    sudo systemctl enable nova-ai-agent
  3. Connect your integrations

    Navigate to Settings > Integrations and connect your cloud providers, container platforms, and notification channels. Nova AI Ops supports AWS, Docker, Grafana, Slack, and more.

  4. View your first dashboard

    Once the agent is reporting, navigate to /dashboard. You'll see real-time CPU, memory, disk, and network metrics within 30 seconds of agent startup.

Tip

The agent begins collecting metrics immediately upon startup. No additional configuration is required for system-level metrics (CPU, memory, disk, network). Application-level metrics require collector plugins — see the Agent Configuration section.

System Requirements #

Nova AI Agent

RequirementMinimumRecommended
Operating SystemLinux (kernel 4.14+)Amazon Linux 2023, Ubuntu 22.04+
CPU1 core2 cores
Memory128 MB256 MB
Disk50 MB200 MB (for metric buffering)
NetworkHTTPS outbound to *.novaaiops.comSame

Supported browsers

BrowserMinimum Version
Chrome / Edge90+
Firefox90+
Safari15+
Mobile Safari / ChromeiOS 15+ / Android 10+

Real-time Metrics Dashboard #

The main dashboard at /dashboard is your operational command center. It displays live metrics streamed from all connected agents via WebSocket, updating every second without page refresh.

Dashboard panels

  • CPU Utilization — Current usage percentage across all cores, with sparkline history
  • Memory Usage — Used, available, and cached memory with trend indicator
  • Disk I/O — Read/write throughput and IOPS per volume
  • Network Traffic — Inbound/outbound bandwidth with packet error rates
  • Active Incidents — Count badge showing open SEV-1/2/3 incidents, color-coded by highest severity
  • Service Health — Aggregate status of all registered services

All dashboard metrics originate from the Nova AI Agent. The agent pushes data to backend ingestion endpoints, which persist to the database and emit real-time Socket.IO events to connected frontends.

Architecture

The data pipeline is: nova-ai-agent → API ingestion → Database → Socket.IO → Dashboard. There are no synthetic or mocked metrics. Every data point originates from your actual infrastructure.

System Status #

The System Status page at / (the homepage) provides a high-level operational overview using a donut chart and service group cards.

Health score

The central donut displays an aggregate health score from 0 to 100, calculated from service availability, active incident severity, and synthetic monitor pass rates. The donut color reflects overall status:

ScoreColorMeaning
90 – 100GreenAll systems operational
70 – 89AmberDegraded performance or minor incidents
0 – 69RedMajor incident or significant outage

Service groups

Below the donut, services are grouped by category (API, Database, Cache, Frontend, etc.). Each group shows the number of healthy vs. unhealthy services. Clicking a group navigates to the Service Catalog filtered to that category.

When an incident occurs, the dashboard and overview page colors update dynamically from green to yellow to red based on the highest active incident severity.

Performance Trends #

The Performance Trends page at /trends provides historical analysis of key metrics over configurable time ranges.

Time range selection

Use the time picker in the top-right corner to select from preset ranges (Last 1h, 6h, 24h, 7d, 30d) or define a custom range. All charts update simultaneously when the time range changes.

Available trend views

  • Resource Utilization — CPU, memory, and disk usage over time with percentile bands (P50, P95, P99)
  • Request Rate — HTTP requests per second with breakdown by status code class (2xx, 4xx, 5xx)
  • Error Rate — Error percentage with anomaly detection overlay
  • Latency Distribution — Response time percentiles shown as area charts

Golden Signals #

The Golden Signals page at /golden-signals implements the four golden signals of monitoring defined in the Google SRE handbook. These are the most critical indicators of system health.

SignalWhat It MeasuresKey Metric
LatencyTime to serve a requestP95 response time
TrafficDemand on your systemRequests per second
ErrorsRate of failed requests5xx error percentage
SaturationHow full your resources areCPU/memory/disk utilization

Each signal is displayed as a real-time gauge with sparkline history. When any signal crosses its configured threshold, the gauge changes color and (if alerting is enabled) triggers an alert through your configured notification channels.

SRE Best Practice

Google's SRE methodology recommends alerting on symptoms (the golden signals) rather than causes. If your latency P95 exceeds your SLO, that's an actionable signal — the root cause (CPU, GC, database, network) is secondary to the user impact.

Incident Timeline #

The Incident Timeline at /incidents/timeline displays all active and recently resolved incidents in chronological order. Each incident card shows:

  • Severity — Color-coded badge (SEV-1 red, SEV-2 amber, SEV-3 blue)
  • Title — Auto-generated or manually set description
  • Duration — Time since detection (or total duration if resolved)
  • Affected Services — List of impacted services from the Service Catalog
  • Current Phase — Detection, Triage, Mitigation, or Resolved

Severity levels

LevelImpactResponse Expectation
SEV-1Complete service outage, data loss risk, or security breachImmediate response, all-hands
SEV-2Significant degradation affecting many usersRespond within 15 minutes
SEV-3Minor issue, limited user impactRespond within 1 hour

Incident Lifecycle #

Every incident in Nova AI Ops follows a structured lifecycle:

  1. Detection

    An incident is created automatically when an alert fires, a synthetic monitor exceeds its consecutive failure threshold, or an AI anomaly is detected. Incidents can also be created manually.

  2. Triage

    The on-call engineer is notified. They assess the incident scope, assign a severity level, and determine affected services. The AI Copilot suggests related past incidents and relevant runbooks.

  3. Mitigation

    The team works to reduce user impact. This might involve rolling back a deployment, scaling up resources, or activating an automated runbook. All actions are logged in the incident timeline.

  4. Resolution

    The root cause is addressed and services are restored. The incident is marked resolved, which triggers a recovery notification and resets associated monitor states.

  5. Postmortem

    Within 48 hours, the team completes a blameless postmortem documenting the timeline, root cause, impact, and action items. See the Postmortem section.

Incident History & Archive #

The Incident Archive at /incidents/history provides a searchable, filterable record of all past incidents. Use the filters to narrow by severity, date range, affected service, or resolution time. This data feeds into trend analysis — helping you identify recurring failure patterns and track MTTR (Mean Time to Resolution) improvements.

AI Runbooks #

AI Runbooks at /runbooks are automated remediation playbooks that execute predefined steps when triggered by an incident. Unlike static runbooks, Nova's AI Runbooks adapt their execution based on the specific context of each incident.

Creating a runbook

yaml
name: High CPU Remediation
trigger:
  metric: cpu_utilization
  condition: "> 90%"
  duration: 5m
steps:
  - action: identify_top_processes
    params: { limit: 10 }
  - action: check_recent_deployments
    params: { window: "2h" }
  - action: scale_horizontally
    params: { increment: 2 }
    requires_approval: true

Runbooks support approval gates for destructive actions, audit logging for every step executed, and automatic rollback if a step fails.

Postmortem #

The Postmortem page at /postmortem provides a structured, blameless review template. Each postmortem includes:

  • Summary — One-paragraph description of what happened
  • Timeline — Minute-by-minute account of detection, response, and resolution
  • Root Cause — The fundamental reason the incident occurred
  • Impact — Quantified user impact (requests affected, downtime duration, revenue loss)
  • Action Items — Concrete tasks with owners and due dates to prevent recurrence
  • Lessons Learned — What went well and what needs improvement
Best Practice

Nova AI automatically populates the postmortem timeline from incident events, reducing manual effort. Focus your writing on root cause analysis and action items — these are the parts that drive real improvement.

Creating Alert Rules #

Navigate to /alerts to configure alert rules. Nova AI Ops supports three types of alerts:

Threshold alerts

Trigger when a metric crosses a static threshold. Best for well-understood metrics with predictable ranges.

json
{
  "name": "High Memory Usage",
  "type": "threshold",
  "metric": "memory.used_percent",
  "condition": "above",
  "threshold": 85,
  "duration": "5m",
  "severity": "SEV-2",
  "channels": ["slack-ops", "email-oncall"]
}

Anomaly detection alerts

Use machine learning to detect unusual patterns without requiring manual thresholds. Nova AI builds a baseline from historical data and alerts when behavior deviates significantly. Ideal for metrics with seasonal patterns or variable baselines.

Heartbeat alerts

Trigger when an expected signal stops arriving. Use these to detect silent failures — if a service should be reporting metrics every 60 seconds, a heartbeat alert will fire after the configured silence window (e.g., 5 minutes of no data).

Alert conditions

ConditionBehavior
aboveTriggers when metric exceeds threshold
belowTriggers when metric drops below threshold
equalTriggers when metric equals a specific value
absentTriggers when no data is received within the window
anomalyTriggers on statistical deviation from baseline

Notification Channels #

Configure where and how alerts are delivered:

  • Email — Individual or distribution list. Supports HTML-formatted incident summaries.
  • Slack — Posts to a channel with severity-colored attachments and action buttons (Acknowledge, Resolve).
  • Microsoft Teams — Adaptive card notifications with direct links to the incident.
  • Webhooks — POST JSON payloads to any HTTP endpoint. Use this to integrate with PagerDuty, Opsgenie, or custom tooling.

Each alert rule can target multiple channels. You can also configure escalation policies — if an alert isn't acknowledged within N minutes, it escalates to the next tier (e.g., from Slack to phone call).

Silencing & Maintenance Windows #

Suppress alerts during planned maintenance or known noisy periods:

  • Silence by rule — Mute a specific alert rule for a defined duration
  • Silence by service — Mute all alerts for a service (useful during deployments)
  • Maintenance window — Schedule a recurring window (e.g., every Sunday 02:00-04:00 UTC) that auto-silences matching alerts
Warning

Silenced alerts are still evaluated and recorded in the alert history. They simply don't trigger notifications. Review your silences regularly — stale silences can mask real incidents.

Log Explorer #

The Log Explorer at /logs provides full-text search across all ingested log data with sub-second response times.

Search syntax

Log queries support a structured search syntax:

text
# Simple text search
connection refused

# Field-specific search
service:api-gateway level:error

# Wildcard patterns
message:timeout* host:prod-web-*

# Numeric comparisons
status_code:>=500 response_time:>2000

# Boolean operators
(level:error OR level:fatal) AND service:payment-api

# Exclude terms
level:error NOT "health check"

Filters

Use the left sidebar filters to narrow by time range, log level (DEBUG, INFO, WARN, ERROR, FATAL), service name, host, or custom tags. Applied filters are reflected in the URL — share the URL with teammates to reproduce exact search context.

Synthetic Monitoring #

Synthetic Monitoring at /synthetic performs automated HTTP checks against your endpoints at regular intervals, detecting outages before your users do.

Monitor statuses

StatusColorMeaning
UPGreenEndpoint responding within expected latency and returning expected status code
DEGRADEDAmberEndpoint responding but with elevated latency (above P95 threshold)
DOWNRedEndpoint not responding, returning errors, or exceeding timeout

Consecutive failure thresholds

To prevent alert fatigue from transient network issues, synthetic monitors use a consecutive failure threshold. An incident is only created after the configured number of consecutive failures (default: 5). A single successful check resets the failure counter to zero.

Key metrics

  • Availability — Percentage of successful checks over the selected time window
  • P95 Latency — 95th percentile response time, the standard measure for user-perceived performance
  • Uptime — Continuous availability duration since last outage
  • Certificate Expiry — Days until TLS certificate expiration (if HTTPS)

Cooldown period

After an incident is created for a failing monitor, a 6-hour cooldown prevents duplicate incidents for the same monitor. The monitor continues to be checked during cooldown, but no new incidents are created until the cooldown expires or the monitor recovers and fails again.

Session Replay #

Session Replay at /replay records and plays back real user sessions on your web application, allowing you to see exactly what users saw when they encountered an error.

How it works

The replay recorder captures DOM mutations, mouse movements, clicks, scrolls, and console errors — not screenshots. This means recordings are lightweight (typically 50-200 KB per minute) and don't capture sensitive input field values.

Key features

  • Error-linked replays — Jump directly to the moment an error occurred
  • Console overlay — See JavaScript errors and network failures alongside the visual replay
  • Timeline scrubbing — Scrub forward/backward through the session timeline
  • Speed control — Play at 1x, 2x, 4x, or 8x speed

Distributed Tracing #

The Tracing page at /tracing visualizes request flows across microservices. Each trace shows the complete journey of a request from ingress to response, with timing breakdowns per service hop.

Trace anatomy

  • Trace — The complete end-to-end journey of a single request
  • Span — A single unit of work within a trace (e.g., an HTTP call, database query, or cache lookup)
  • Parent-child relationships — Spans are nested to show causality (Service A called Service B, which called the database)

Use tracing to identify slow service dependencies, detect N+1 query patterns, and understand cascading failure paths.

Service Catalog #

The Service Catalog at /services is your central registry of all monitored services. Services can be auto-discovered by the agent or manually registered.

Service properties

  • Name — Human-readable service name
  • Tier — Criticality level (Tier 1: Revenue-critical, Tier 2: User-facing, Tier 3: Internal)
  • Owner — Team or individual responsible for the service
  • Status — Current operational status (Operational, Degraded, Outage)
  • Dependencies — Upstream and downstream service connections
  • SLO — Service level objective (e.g., 99.9% availability, P95 latency < 200ms)

Auto-discovery

When the Nova AI Agent detects a new process listening on a network port, it can automatically register it as a service. Auto-discovered services appear with a "Discovered" badge and require manual confirmation to be promoted to the active catalog.

Service Map #

The Service Map at /service-map provides a real-time, interactive visualization of your service dependency graph. Each node represents a service, and edges represent network communication between them.

  • Node color — Reflects service health (green = healthy, amber = degraded, red = failing)
  • Edge thickness — Proportional to request volume between services
  • Edge color — Changes to red when error rate between services exceeds threshold
  • Click a node — Opens a detail panel showing metrics, recent incidents, and dependencies for that service

On-Call Management #

Configure on-call rotations and escalation policies to ensure the right person is notified when incidents occur.

Rotation types

  • Weekly rotation — Handoff every Monday at a configurable time
  • Daily rotation — Handoff every 24 hours
  • Custom schedule — Define specific shifts with overlap periods

Escalation policies

Define multi-tier escalation for each severity level:

  1. Tier 1 — Notify primary on-call via Slack (immediate)
  2. Tier 2 — If not acknowledged in 10 minutes, notify secondary on-call + email
  3. Tier 3 — If not acknowledged in 20 minutes, notify engineering manager + phone call

Certificate Manager Overview #

The Certificate Manager provides a centralized view of all TLS/SSL certificates across your domains. Track expiration dates, health scores, and renewal status from a single dashboard.

Managing Certificates #

Viewing certificates

The certificate list displays each certificate's domain, issuer, expiration date, and health score. Certificates are color-coded by health:

Health ScoreColorMeaning
80 – 100GreenHealthy — more than 80 days until expiration
30 – 79AmberAttention — certificate will expire within 80 days
0 – 29RedCritical — certificate expires within 30 days or already expired

Adding a new certificate

  1. Click Add Certificate in the top-right corner
  2. Enter the domain name (e.g., api.example.com)
  3. Select the certificate type (single domain, wildcard, or multi-domain SAN)
  4. Choose verification method (DNS TXT record or HTTP file verification)
  5. Submit and complete the verification challenge

Renewing certificates

Click the Renew button on any certificate card. For certificates with auto-renew enabled, renewal happens automatically 30 days before expiration. Manual renewal is available at any time.

Auto-renew toggle

Enable auto-renew on individual certificates to eliminate manual renewal. When enabled, Nova AI Ops will attempt renewal 30 days before expiry. If DNS verification is configured, the entire process is automatic. You'll receive a notification when auto-renewal succeeds or if it requires manual intervention.

Revoking and deleting

  • Revoke — Invalidates the certificate immediately. Use this if the private key is compromised. Revocation is irreversible.
  • Delete — Removes the certificate from Nova AI Ops tracking. The certificate itself remains valid until its natural expiration date.

DNS verification

For domain validation, add the provided TXT record to your DNS zone. Nova AI Ops checks for the record automatically and completes verification once detected (typically within 1-5 minutes).

text
# Add this TXT record to your DNS zone
_acme-challenge.api.example.com  TXT  "dGVzdC12ZXJpZmljYXRpb24..."

Certificate Health Score #

The health score (0-100) is a composite metric based on:

  • Days until expiration (primary factor) — Certificates expiring sooner score lower
  • Certificate chain validity — Broken chains reduce the score
  • Key strength — RSA 2048+ or ECC P-256+ required for full score
  • Protocol support — TLS 1.2+ required, TLS 1.3 preferred

The aggregate certificate health across all domains contributes to your overall System Status health score.

Integrations Overview #

Nova AI Ops integrates with your existing infrastructure and tooling. Navigate to Settings > Integrations to configure connections.

Infrastructure Integrations #

Grafana

Connect your Grafana instance to import dashboards, sync alert rules, and display Grafana panels natively within Nova AI Ops. Configure at /integrations/grafana.

json
{
  "grafana_url": "https://grafana.internal.example.com",
  "api_key": "glsa_...",
  "sync_dashboards": true,
  "sync_alerts": true
}

AWS

The AWS integration at /integrations/aws pulls CloudWatch metrics, ECS/EKS container health, RDS performance insights, and S3 bucket metrics. Requires an IAM role with read-only access to CloudWatch, ECS, EC2, and RDS.

Docker

The Docker integration at /integrations/docker monitors container lifecycle events, resource utilization per container, image vulnerability status, and network connectivity between containers. The Nova AI Agent detects Docker automatically when installed on a host running the Docker daemon.

Redis

Monitor Redis memory usage, connected clients, keyspace hit/miss ratio, replication lag, and slow log entries.

MongoDB

Track MongoDB operation counters, replication set status, connection pool utilization, and slow query patterns.

PostgreSQL

Monitor active connections, transaction throughput, index hit ratios, table bloat, and replication lag.

Databricks / Splunk

Ingest analytics and log data from Databricks workspaces and Splunk indexes. Configure forwarding rules to stream relevant events into Nova AI Ops for correlation with infrastructure metrics.

AI Model Integrations #

Nova AI Ops can monitor and manage AI model endpoints:

  • Anthropic (Claude) — Monitor API latency, token usage, and error rates
  • OpenAI (GPT) — Track request volumes, rate limit utilization, and cost per request
  • Google Gemini — Monitor multimodal request throughput and response quality
  • DeepSeek — Track inference latency and throughput across model versions

AI model integrations provide specialized dashboards showing token consumption trends, cost projections, and performance comparison across providers.

Webhook Gateway #

The Webhook Gateway allows external tools to push alerts and events into Nova AI Ops via HTTP POST. Each organization gets a unique webhook URL:

bash
curl -X POST https://app.novaaiops.com/api/webhooks/ingest \
  -H "Authorization: Bearer YOUR_WEBHOOK_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "source": "custom-monitor",
    "severity": "warning",
    "title": "Disk space low on db-primary",
    "message": "Disk usage at 92% on /data volume",
    "tags": {"host": "db-primary", "region": "us-east-1"}
  }'

Incoming webhooks are processed through the standard incident pipeline — they can trigger alerts, create incidents, and appear on the incident timeline alongside native detections.

Nova AI Copilot #

The Nova AI Copilot is a natural language interface accessible from any page in the platform. Open it by clicking the chat icon in the bottom-right corner or pressing Ctrl + /.

The Copilot can:

  • Navigate to any page in the platform
  • Create, update, and manage Sprinta tasks
  • Query metrics and incident data
  • Manage certificates (view, renew, check status)
  • Show system stats and service health
  • Explain incidents and suggest remediation steps

Copilot Commands #

Navigation

text
"Navigate to dashboard"
"Go to incident timeline"
"Open the service map"
"Show me the golden signals page"

Certificates

text
"Show my certificates"
"Which certificates expire this month?"
"Renew the certificate for api.example.com"
"What's the health score of my certificates?"

Tasks (Sprinta)

text
"Create task Fix login bug on the auth service"
"Show my open tasks"
"Move task SPRINT-42 to In Progress"
"What tasks are in review?"

Metrics and incidents

text
"What's the current CPU usage?"
"Are there any active incidents?"
"Show me the P95 latency for the API gateway"
"How many incidents did we have this week?"
Tip

The Copilot understands context. If you're viewing the Incident Timeline and ask "Tell me about this one," it will describe the most recent or selected incident.

Sprinta Task Management #

Sprinta is Nova AI's built-in task management system, designed for engineering teams that want to track work alongside their operational tooling. Access it from the sidebar or via /sprinta.

Kanban Board #

The Kanban board organizes tasks into five columns:

ColumnPurpose
To DoTasks that are planned but not started
In ProgressTasks currently being worked on
Needs SupportBlocked tasks awaiting input from another team member
ReviewTasks completed and awaiting peer review or QA
CompletedDone and verified tasks

Drag and drop tasks between columns. Each task card shows the title, assignee avatar, priority badge, and due date. Click a card to open the detail view with description, comments, attachments, and activity log.

Creating tasks

  • Manual — Click "New Task" and fill in the title, description, priority, and assignee
  • From incident — Action items from postmortems automatically create Sprinta tasks
  • Via AI Copilot — Say "Create task [description]" and the Copilot creates it instantly
  • Import from Jira — Bulk import existing Jira tickets via CSV or API integration

Bulk operations

Select multiple tasks using checkboxes and apply bulk actions: assign, change priority, move to column, or archive. Useful for sprint planning and backlog grooming.

Nova AI in Sprinta #

The Sprinta board includes a dedicated AI chat panel. Use it for context-aware assistance:

  • Summarize all tasks in a column
  • Suggest task breakdown for large features
  • Draft task descriptions from a short prompt
  • Identify blocked tasks and suggest unblocking steps
  • Generate sprint reports and velocity metrics

User Roles #

Nova AI Ops uses a six-role hierarchy. Each role inherits all permissions of the roles below it.

RoleLevelCapabilities
Founder5Full platform access. Manage tenants, users, billing, and all admin pages. Cannot be deleted.
Nova Admin4Platform-wide administration. Manage integrations, agents, and organization settings.
Organization Owner3Full control over their organization. Manage team members, services, and billing.
Organization Admin3Manage team members, configure alerts, and administer organization settings.
Engineer2View all dashboards, manage incidents, configure monitors, and create runbooks.
Viewer1Read-only access to dashboards, incidents, and metrics. Cannot modify configurations.

Team Invitations #

Invite team members from /team-members:

  1. Click Invite Member
  2. Enter the email address and select a role
  3. The invitee receives an email with a signup link pre-associated with your organization
  4. Once accepted, they appear in your team roster and can access pages permitted by their role

Pending invitations can be revoked. Invitations expire after 7 days if not accepted.

Two-Factor Authentication #

Enable 2FA from Settings > Security:

  1. Click Enable Two-Factor Authentication
  2. Scan the QR code with your authenticator app (Google Authenticator, Authy, 1Password)
  3. Enter the 6-digit verification code to confirm setup
  4. Save your recovery codes in a secure location
Important

Recovery codes are shown only once during setup. Store them securely — they are the only way to regain access if you lose your authenticator device. Each recovery code can be used exactly once.

Profile & Preferences #

Access your profile from /settings. Customize:

  • Display name — Shown across all platform interactions
  • Profile photo — Uploaded photos are used everywhere an avatar is displayed (sidebar, comments, team roster, incident assignments)
  • Timezone — All timestamps in the UI are converted to your selected timezone
  • Theme — Dark (default) or light mode
  • Notification preferences — Configure which events trigger email, in-app, or push notifications

Page Control #

Available to Founder role at /page-control. This admin tool lets you show or hide navigation items for your organization. Use it to:

  • Enable new feature pages as they become relevant
  • Hide pages that aren't applicable to your team's workflow
  • Customize the sidebar navigation for your organization

Hidden pages are only removed from navigation — direct URL access still respects role-based permissions.

Billing & Plans #

PlanPriceIncludes
Free$0/month3 users, 5 services, 7-day data retention, community support
Team$29/user/monthUnlimited services, 30-day retention, Slack integration, synthetic monitoring
Pro$59/user/monthEverything in Team + AI Runbooks, Session Replay, 90-day retention, SSO
EnterpriseCustomUnlimited retention, dedicated support, SLA, custom integrations, on-premise option

Manage your subscription from Settings > Billing. All plans include a 14-day free trial of Pro features. Upgrades take effect immediately; downgrades take effect at the end of the current billing cycle.

API Authentication #

The Nova AI Ops API uses JWT (JSON Web Token) authentication. Tokens are issued on login and included in all subsequent requests.

Session-based (browser)

The web application uses HTTP-only secure cookies for session management. Tokens are automatically refreshed before expiration.

API key (programmatic)

For programmatic access, generate an API key from Settings > API Keys. Include it in the Authorization header:

bash
curl -H "Authorization: Bearer YOUR_API_KEY" \
  https://app.novaaiops.com/api/services
Security

Never expose API keys in client-side code, public repositories, or logs. If a key is compromised, revoke it immediately from the API Keys settings page.

Key Endpoints #

EndpointMethodDescription
/api/healthGETPlatform health check. Returns {"status":"ok"}
/api/incidentsGETList incidents. Supports ?status=active&severity=SEV-1 filters
/api/incidentsPOSTCreate a new incident manually
/api/alertsGETList configured alert rules
/api/alertsPOSTCreate a new alert rule
/api/servicesGETList all registered services
/api/synthetic/monitorsGETList synthetic monitors with current status
/api/synthetic/monitorsPOSTCreate a new synthetic monitor
/api/certificatesGETList all tracked certificates
/api/certificatesPOSTAdd a new certificate to track
/api/metrics/ingestPOSTIngest metrics from the Nova AI Agent

Response format

All API responses follow a consistent JSON envelope:

json
{
  "success": true,
  "data": [ ... ],
  "meta": {
    "total": 42,
    "page": 1,
    "per_page": 20
  }
}

Error responses include a human-readable message and machine-parseable error code:

json
{
  "success": false,
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Try again in 30 seconds."
  }
}

For the complete API reference with request/response schemas, visit api.novaaiops.com.

Rate Limiting #

PlanRate LimitBurst
Free60 requests/minute10 requests/second
Team300 requests/minute30 requests/second
Pro1,000 requests/minute100 requests/second
EnterpriseCustomCustom

Rate limit headers are included in every response:

http
X-RateLimit-Limit: 300
X-RateLimit-Remaining: 297
X-RateLimit-Reset: 1679529600

When rate limited, the API returns 429 Too Many Requests. Implement exponential backoff in your client code to handle this gracefully.

Nova AI Agent Overview #

The Nova AI Agent is a lightweight, zero-dependency metrics collector that runs on your infrastructure. It is the single source of truth for all metrics in Nova AI Ops — every data point displayed in dashboards originates from the agent.

What it collects

  • CPU — Per-core utilization, load averages (1m, 5m, 15m), steal time, iowait
  • Memory — Used, available, cached, buffered, swap usage
  • Disk — Space utilization per mount, read/write IOPS, throughput, latency
  • Network — Bytes in/out per interface, packet errors, connection states
  • Processes — Top processes by CPU and memory, process count, zombie detection
  • Containers — Docker/Podman container metrics (CPU, memory, network per container)
  • Custom metrics — Application-level metrics via collector plugins

Architecture

The agent uses a collector registry pattern. Each metric type has a dedicated collector that runs on a configurable interval (default: 10 seconds). Collected metrics are batched and sent to the Nova AI Ops backend via HTTPS. If the backend is unreachable, metrics are buffered to disk and retried with exponential backoff.

Agent Installation #

Linux (systemd)

bash
# One-line install
curl -sSL https://get.novaaiops.com/agent | bash

# Set your API key
nova-agent config set api-key YOUR_API_KEY

# Start and enable on boot
sudo systemctl start nova-ai-agent
sudo systemctl enable nova-ai-agent

# Verify it's running
sudo systemctl status nova-ai-agent

Docker

bash
docker run -d \
  --name nova-ai-agent \
  --restart unless-stopped \
  --pid host \
  --net host \
  -v /proc:/host/proc:ro \
  -v /sys:/host/sys:ro \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -e NOVA_API_KEY=YOUR_API_KEY \
  -e NOVA_API_URL=https://app.novaaiops.com/api/metrics/ingest \
  novaaiops/agent:latest

Agent Configuration #

The agent reads configuration from multiple sources (highest priority first):

  1. CLI flags
  2. Environment variables (NOVA_*)
  3. Config file (/etc/nova-agent/config.yml)
  4. Built-in defaults
yaml
# /etc/nova-agent/config.yml
api_key: "your-api-key-here"
api_url: "https://app.novaaiops.com/api/metrics/ingest"
collect_interval: 10    # seconds
send_interval: 30       # seconds (batch sends)
buffer_dir: "~/.nova-agent/buffer"

collectors:
  cpu: true
  memory: true
  disk: true
  network: true
  process: true
  docker: true       # auto-detected

health_server:
  enabled: true
  port: 9100         # /health, /ready, /metrics

resilience:
  retry_max_attempts: 5
  circuit_breaker_threshold: 5
  buffer_max_size_mb: 100

Health endpoints

When the health server is enabled (default port 9100), the agent exposes:

EndpointPurpose
/healthLiveness check — returns 200 if the agent process is running
/readyReadiness check — returns 200 if collectors are initialized and backend is reachable
/metricsPrometheus-format metrics about the agent itself (send success/failure counts, buffer size, etc.)
/statusJSON status report with uptime, collector states, and connection status
Resilience

The agent includes a circuit breaker that prevents flooding the backend during outages, exponential backoff with jitter for retries, and a disk-backed metric buffer (default: 100 MB) that persists unsent metrics across agent restarts.

Type to search across all documentation sections