Component C132 – NATS

By Raj Marni. March 27, 2025. Revised. Version: 0.0.09

1. Overview

NATS (C132) is a lightweight, high-performance pub/sub messaging system used by k8or Orbit to enable asynchronous, decoupled communication across various microservices or components (Portal, Manifestor, SyncMaster, etc.) and cluster-plane agents. By leveraging subject-based routing, NATS lets components publish events on specific “subjects” while other components subscribe to those subjects of interest—promoting an event-driven architecture that scales with minimal coupling.

Event Driven Architectural Diagram

2. Internal Modules & Responsibilities

2.1 NATS Server/Broker

  • Core Routing Engine:

    • Maintains subject-based subscriptions, ensures published messages are delivered to all relevant subscribers.

    • Implements the wire protocol (TCP, possibly TLS) for client connections.

  • Connection Manager:

    • Tracks active client sessions from orbit-plane services or cluster-plane agents.

    • Manages heartbeats, keep-alives, and auto-disconnects if a client goes silent or fails to authenticate.

  • Message Buffer & Delivery:

    • Buffers messages briefly if a subscriber is momentarily unavailable (in some configurations), ensuring best-effort or guaranteed delivery (depending on NATS mode, e.g., NATS JetStream for persistence).

2.2 Security & Policy Layer

  • Authentication:

    • May rely on token-based or user/pass credentials for each client.

    • Could integrate with orbit-plane’s IAM or AccessPoint for dynamic credential provisioning.

  • Authorization:

    • Subject-level authorization ensures only specific microservices can publish or subscribe to certain topics (e.g., “deploy.*”, “transfer.completed”).

    • This layer might check an external config or policy store to grant or deny pub/sub actions.

  • Encryption in Transit:

    • Typically uses TLS or a secure connection so that messages remain private and tamper-proof.

2.3 Orbit-Plane and Cluster-Plane Integrations

  • Orbit-Plane Services:

    • Each microservice/component (Portal Deploy Logic, Manifestor, etc.) includes a NATS client.

    • Publishes events (e.g., “image.uploaded”) or subscribes to subjects (e.g., “deployment.status”) to react in real time.

  • Cluster-Plane Agents:

    • Agents or sidecars in the K3s clusters can publish operational events (“node.scaled”, “pod.crashloop”) or logs, which orbit-plane subscribers can act upon.

    • They also might subscribe to commands or configuration updates from orbit-plane components.


3. Data Flow & Process IDs

Below is a generic example of how messages might flow:

  1. Publishing

    • A microservice (Portal Transfer Logic) finishes transferring an image. It publishes a message on subject transfer.completed with relevant metadata.

    • This call might be labeled with a PID like c8bmsXX-c132bus-e20, indicating the message was sent from the Portal back-end (c8bmsXX) to NATS (C132).

  2. Routing & Delivery

    • The NATS server sees there are multiple subscribers to transfer.completed (maybe a logging service, a metrics aggregator, and a Slack integration).

    • It delivers the message to each subscribed client. The message might have a small ephemeral buffer or be persisted if using NATS JetStream.

  3. Consumption

    • Each subscriber processes the event in their own way (logging it, updating UI, etc.).

    • If an error occurs, the subscriber can handle it locally or publish a new message (like “transfer.error”) that other components might watch.

  4. Cluster-Plane Interaction

    • If a cluster-plane agent wants to signal a new node addition, it publishes “cluster.dev.nodeAdded”. The orbit-plane’s management microservice, subscribed to “cluster.*.nodeAdded”, receives it and updates the UI or triggers a new environment config.


4. Error Handling & Observability

  1. Client Connection Failures

    • If a microservice loses connectivity or fails to authenticate, NATS logs the event and the microservice might attempt reconnection.

    • The orbit-plane monitoring stack can watch for high disconnection rates or failed auth attempts.

  2. Subject Overlaps or Collisions

    • In subject-based routing, well-defined naming conventions help avoid confusion or collisions. E.g., “deploy.prod.” or “deploy.dev.”.

    • If a subject is misnamed, no subscribers will receive the message, or unauthorized subscribers might not have permission.

  3. Performance Monitoring

    • NATS provides metrics (message rates, latencies, queue sizes) which can feed into the orbit-plane’s observability stack (Prometheus).

    • High message volume or slow consumers can lead to backpressure, so the system might require additional NATS servers or a cluster for scaling.

  4. Message Persistence (as required by an use case)

    • If ephemeral messages suffice, NATS in standard mode is used.

    • If guaranteed delivery is needed, JetStream or another persistence layer can store messages until consumed or for replay.


5. Security & Policy Enforcement

  • Subject-Level ACL:

    • Administrators define which microservices can publish or subscribe to each subject. For example, only the Portal back-end can publish “deploy.request” while certain cluster-plane agents subscribe to it.

  • Integration with AccessPoint:

    • If used, AccessPoint might handle or distribute short-lived credentials for NATS connections, ensuring that only authorized microservices get valid tokens to connect.

  • Auditing:

    • Potential logs: who published which message, from which IP, with which subject. This can be stored for compliance or forensic analysis.


6. Outcomes & Benefits

  1. Decoupled Event-Driven Architecture

    • Encourages each component to act on events it cares about, without direct coupling or synchronous calls.

  2. Scalability & Resilience

    • As new services come online, they simply subscribe to existing subjects or create new ones. The messaging layer can scale horizontally if needed.

  3. Faster Development

    • Teams can add features (like logging or analytics) that just subscribe to relevant events, with minimal changes to the original publisher code.

  4. Real-Time Updates

    • The entire orbit-plane or cluster-plane can respond in near real-time to events, enabling dynamic scaling, immediate logging, or user notifications.

Last updated