Websocket Gateway Cluster SDD

1. Document Control

Document Version: 1.0
Date: August 18, 2025
Author(s): TikTuzki
Reviewed by: [Reviewer Name]
Approved by: [Approver Name]
Status: Draft

2. Introduction

2.1 Purpose

High-performance WebSocket service for streaming events to end users in a Centralized Exchange (CEX) system, robust, scalable, and low-latency architecture.

2.2 Scope

This System Design Document (SDD) covers the architecture, design, and implementation of a high-performance WebSocket service for streaming real-time events to end users in a Centralized Exchange ( CEX) system.

In Scope:

Integration with external event sources (Kafka, NATS, Redis Streams) to deliver event streams (Depth data, Depth data, trading pairs Realtime Data, trading pairs Kline Data, BBO (Best Bid and Offer Data)) to clients via WebSocket.
Support for massive concurrency (100K+ connections), low-latency delivery (<100ms), horizontal scalability, and optional multi-region deployment.
Authentication (API-KEY/ListenKey), routing logic, transformation/mapping layers, fault tolerance, and observability (metrics, logging, dashboards).
Management layer (admin dashboard for monitoring/configuration), deployment strategies (Kubernetes orchestration), and integration with external systems (authentication service).
Functional and non-functional requirements, including security, reliability, and monitoring.
Detailed use cases: handshake/authentication, topic subscription/unsubscription, trade action via websocket, event streaming, backpressure handling, private streams, and resume support on reconnect.

2.3 References

HashKey Websocket API Documentation
Binance Websocket API Documentation

3. System Overview

3.1 System Context

System Context diagram

3.2 Objectives & Success Criteria

Concurrent Connections: >= 100,000 connections.
Latency: P95 <= 100ms.
Throughput: >= 50,000 msg/s gateway node.
Availability: 99.99% uptime.
Scalability: Scale horizontally to add capacity with zero downtime. Sharing domain with multiple cluster.
Monitoring Coverage: critical metrics (connections, latency, errors) are monitored and alerted.
Compliance: Pass all security and compliance audits (e.g., GDPR, PCI DSS) for data protection.

4. Architectural Design

4.1 Technology Stack

Layer	Technology	Purpose
Frontend (Client-Facing)	HAProxy	TCP/HTTP load balancing, connection distribution, health checks.
Core WebSocket Cluster	Rust (tokio, tungstenite, axum)	High-performance async runtime for WebSocket servers.
	async-rate-limiter	Rate limiting for abuse prevention.
Dynamic config storage	Redis	Dynamic config, detect peer join/left, negotiation communication
Replication / Fanout Messaging	NATS	Cross-region replication, replay capability.
Admin & Management	Next.js (UI)	Web-based admin dashboard.

4.2 Architecture Overview

HAProxy Cluster (Load Balancers)
- Provides TCP load balancing across the WebSocket server cluster.
- Distributes client connections evenly across available nodes.
- Can perform health checks and automatic failover if a WebSocket node goes down.
WebSocket Cluster
- The heart of the system — handles client WebSocket connections, subscription management, and event delivery.
- Key responsibilities:
  1. Consume from upstream event sources (Kafka,NATS,...).
  2. Publish/subscribe internally via NATS replication layer to ensure all nodes are in sync.
  3. Route events to the correct connected clients based on subscription rules.
  4. Apply mapping logic (format transformation, filtering).
  5. Scale horizontally — each node is stateless with respect to client routing.
Admin Dashboard
- Provides an operational view of the WebSocket cluster.
- Functions:
  - Monitor connection counts, per-stream latency, event throughput.
  - Configure routing/mapping rules dynamically.
  - Control replication and scaling.
- Connects directly with the WebSocket cluster for both monitoring and configuration changes.

4.3 Subsystems & Components

Break down into major components/modules:

Name
Responsibility
Interfaces
Dependencies

4.4 Data Design

Summary Table: Data Flow Elements

Component	Description
Connection	WebSocket setup with endpoint (with listenKey if connect to User data stream)
Ping–Pong	Keep-alive frames: client send ping to server every 10s.
Subscription	JSON subscribe/unsubscribe/list commands for streams
Market data Streams	Trade, kline, ticker, depth, liquidation, mark price, etc.
User Data	Separate connection for account/order/position updates via listen key

Connection Flow Diagram

Connecting

Connecting with listenKey

Ping-Pong Flow Diagram

Market Data Stream Flow Diagram

Apply route & mapping rule

Subscription Flow Diagram

User Data Private Stream Flow Diagram

Config update event handling

When a WS Node consumes a config_update event from the Config Store, it hot-reloads only the affected components (data sources, topics, routes, mappings) without interrupting live traffic.

5. Detailed Design

5.1 UML: WebSocket Node Internal Structure

5.2 Routing, Mapping

Sample of routing and mapping configuration:

{
  "datasource": [
    {
      "version": 1,
      "type": "kafka",
      "name": "default_kafka",
      "properties": {
        "bootstrap.servers": "kafka:9092",
        "group.id": "websocket_gateway_group",
        "auto.offset.reset": "earliest"
      }
    },
    {
      "version": 1,
      "type": "nats",
      "name": "default_nats",
      "properties": {
        "servers": [
          "nats://nats:4222"
        ],
        "max_reconnect_attempts": 10,
        "reconnect_time_wait": 2
      }
    }
  ],
  "topics": [
    {
      "version": 1,
      "topic": "btcusdt@trade",
      "rate": 20,
      "burst": 60,
      "auth": "public"
    },
    {
      "version": 1,
      "topic": "btcusdt@depth",
      "auth": "public"
    },
    {
      "version": 1,
      "topic": "balance",
      "rate": 20,
      "burst": 60,
      "auth": "private"
    }
  ],
  "routes": [
    {
      "id": "uuid4",
      "version": 1,
      "peer_id": "6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b",
      "data_source": "default_kafka",
      "mappings": [
        {
          "mapper": "trade_v1",
          "fanout_topic": "fanout.trade.btcusdt",
          "ws_topic": "btcusdt@trade"
        },
        {
          "mapper": "depth_v2",
          "fanout_topic": "fanout.depth.btcusdt",
          "ws_topic": "btcusdt@depth"
        },
        {
          "mapper": "balance",
          "fanout_topic": "fanout.balance",
          "ws_topic": "balance"
        }
      ]
    }
  ]
}

5.2.1 Negotiation algorithm

Update single config

When a peer leaves - re-balance config algorithm

PeerA know PeerC leaves cluster.
PeerA acquire lock "peer-left:peer_id" on redis key

if success, publish "peer_left:peer_id" event to redis stream "peer_update_queue"
else abort

PeerA or PeerB consume "peer_update_queue" event -> rebalance datasource -> update routing config with Update single config logic & remove "peer-left:peer_id".

When a peer join - re-balance config algorithm

PeerC online, publish "peer_join:peer_id" event to redis stream "peer_update_queue"
PeerA or PeerB consume "peer_update_queue" event -> rebalance datasource -> update routing config with Update single config logic

5.3 Listen key

Sample structure:

{
  "userId": "12345",
  "iat": 1713543000,
  // issued at
  "exp": 1713546600,
  // expires in 60 min
  "scope": "user:stream"
}

Create then connect with listenKey

Revoke listenKey

Reset validity listenKey

6. Integration & Interfaces

Config data source.

jhDatasource

Config websocket topics.

Topic Config

Config a routing & mapping per data source.

Route Config

7. Security Design

Updating...

8. Performance & Scalability

Ref 3.2 Objectives--success-criteria

9. Reliability & Availability

Ref 3.2 Objectives--success-criteria

9.1 Rate limiter

9.1.1 Rate Limit

Defines the average speed at which tokens are added into the bucket.
Example: 100 tokens/sec → the system allows 100 actions per second on average.

9.1.2 Burst (Bucket Capacity)

Defines the maximum number of tokens the bucket can hold.
Allows short bursts of traffic above the average rate — as long as the bucket has accumulated tokens.

Example:

Rate = 100 tokens/sec
Burst = 200
A client can send 200 messages instantly (if tokens were accumulated), but then will be throttled to ~100/sec afterward

9.1.3 Outbound Rate Limiting (server → client)

Each connection has a send queue.
Apply per-connection limiter before pushing into queue.
If over limit → drop message or coalesce (e.g., compress order book deltas).

10. Deployment & Infrastructure

Target deployment architecture:

Future migration:

11. Non-Functional Requirements

Updating...

12. Risks & Mitigations

12.1 Infrastructure Risks

Redis Single Point of Failure (SPOF)
- Risk: Redis is central for dynamic configuration, listenKey lifecycle, and peer coordination. Outage or partition leads to stalled config updates and failed authentication.
- Mitigation:
  - Deploy Redis in cluster mode with sentinel/raft consensus for automatic failover.
  - Enable persistence (AOF) for recovery.
  - Use Redis connection pooling and retry logic in clients.
NATS Partitioning / Message Loss
- Risk: Network partitions or broker node failure may cause missed replication messages or inconsistent state across WebSocket nodes.
- Mitigation:
  - Deploy NATS in clustered mode with JetStream (persistence + replay).
  - Enable ack-based fanout for critical messages.
  - Monitor replication lag and set alerts.
HAProxy Overload
- Risk: Load balancers may become a bottleneck under sudden traffic spikes.
- Mitigation:
  - Use multiple HAProxy instances with DNS/GSLB distribution.
  - Enable autoscaling on CPU/memory thresholds.
  - Apply rate limiting/DDoS protection at the LB layer.

12.2 Application-Level Risks

ListenKey Abuse
- Risk: Attackers may generate or brute-force listenKeys, leading to unauthorized access.
- Mitigation:
  - Generate listenKeys using cryptographically strong random values.
  - Enforce short TTL (e.g., 60m) with refresh required.
  - Bind listenKeys to userId + IP/device fingerprint.
  - Rate-limit API endpoints (POST /listenKey, PUT /listenKey).
Config Corruption / Version Skew
- Risk: Incorrect or conflicting config updates can cause routing errors or dropped messages.
- Mitigation:
  - Apply schema + version validation before accepting updates.
  - Use transactional updates with Redis WATCH/MULTI/EXEC.
  - Maintain config audit logs for rollback.
Unbounded Fanout / Backpressure
- Risk: Sudden spikes in subscription counts or large depth updates can overload WS nodes.
- Mitigation:
  - Apply per-topic and per-connection rate limits.
  - Use message coalescing/delta compression for order books.
  - Drop non-critical updates under backpressure.

12.3 Security Risks

Unauthorized Access
- Risk: Clients may bypass authentication or hijack valid sessions.
- Mitigation:
  - Enforce TLS everywhere.
  - Validate all requests with API key / listenKey authentication.
  - Implement WebSocket close codes for unauthorized access attempts.
DDoS Attacks
- Risk: Massive connection floods or malformed payloads overwhelm servers.
- Mitigation:
  - Apply connection rate limiting at HAProxy + WebSocket layer.
  - Implement per-IP throttling + IP reputation blocking.
  - Deploy WAF/CDN (e.g., Cloudflare) for edge mitigation.

13. Appendix

Updating...

1. Document Control​

2. Introduction​

2.1 Purpose​

2.2 Scope​

2.3 References​

3. System Overview​

3.1 System Context​

3.2 Objectives & Success Criteria​

4. Architectural Design​

4.1 Technology Stack​

4.2 Architecture Overview​

4.3 Subsystems & Components​

4.4 Data Design​

Summary Table: Data Flow Elements​

Connection Flow Diagram​

Ping-Pong Flow Diagram​

Market Data Stream Flow Diagram​

Apply route & mapping rule​

Subscription Flow Diagram​

User Data Private Stream Flow Diagram​

Config update event handling​

5. Detailed Design​

5.1 UML: WebSocket Node Internal Structure​

5.2 Routing, Mapping​

5.2.1 Negotiation algorithm​

Update single config​

When a peer leaves - re-balance config algorithm​

When a peer join - re-balance config algorithm​

5.3 Listen key​

6. Integration & Interfaces​

Config data source.​

Config websocket topics.​

Config a routing & mapping per data source.​

7. Security Design​

8. Performance & Scalability​

9. Reliability & Availability​

9.1 Rate limiter​

9.1.1 Rate Limit​

9.1.2 Burst (Bucket Capacity)​

9.1.3 Outbound Rate Limiting (server → client)​

10. Deployment & Infrastructure​

11. Non-Functional Requirements​

12. Risks & Mitigations​

12.1 Infrastructure Risks​

12.2 Application-Level Risks​

12.3 Security Risks​

13. Appendix​

1. Document Control

2. Introduction

2.1 Purpose

2.2 Scope

2.3 References

3. System Overview

3.1 System Context

3.2 Objectives & Success Criteria

4. Architectural Design

4.1 Technology Stack

4.2 Architecture Overview

4.3 Subsystems & Components

4.4 Data Design

Summary Table: Data Flow Elements

Connection Flow Diagram

Ping-Pong Flow Diagram

Market Data Stream Flow Diagram

Apply route & mapping rule

Subscription Flow Diagram

User Data Private Stream Flow Diagram

Config update event handling

5. Detailed Design

5.1 UML: WebSocket Node Internal Structure

5.2 Routing, Mapping

5.2.1 Negotiation algorithm

Update single config

When a peer leaves - re-balance config algorithm

When a peer join - re-balance config algorithm

5.3 Listen key

6. Integration & Interfaces

Config data source.

Config websocket topics.

Config a routing & mapping per data source.

7. Security Design

8. Performance & Scalability

9. Reliability & Availability

9.1 Rate limiter

9.1.1 Rate Limit

9.1.2 Burst (Bucket Capacity)

9.1.3 Outbound Rate Limiting (server → client)

10. Deployment & Infrastructure

11. Non-Functional Requirements

12. Risks & Mitigations

12.1 Infrastructure Risks

12.2 Application-Level Risks

12.3 Security Risks

13. Appendix