Go-based network automation with YANG models, gRPC, Ansible, Terraform, and Kubernetes integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
206 lines
11 KiB
Markdown
206 lines
11 KiB
Markdown
# CLAUDE.md — Kedge
|
|
|
|
> See the full architecture specification in the initial project prompt. This file captures conventions and development workflow.
|
|
|
|
## Quick Reference
|
|
|
|
- **Language**: Go (CNI plugin + DaemonSet), Python (YANG compiler)
|
|
- **Module**: `github.com/guildhouse-co/kedge`
|
|
- **Binaries**: `cmd/kedge-cni/` (CNI plugin), `cmd/kedge-daemon/` (DaemonSet)
|
|
- **Build**: `make container-build` (production), `make dev && make build` (development)
|
|
|
|
## Conventions
|
|
|
|
- Go: Standard project layout. `cmd/` for entrypoints, `internal/` for non-exported.
|
|
- Explicit error returns, no panics.
|
|
- CNI plugin follows CNI spec exactly.
|
|
- Python: Type hints required. `pyang` for YANG validation.
|
|
- YAML: 2-space indent.
|
|
- Secrets: Never in repo. `.gitignore` excludes `*.key`, `*.pem`, `vault_pass*`.
|
|
- Testing: Go `_test.go` files. Python `pytest`. Ansible `molecule/`.
|
|
- Container engine: podman (dev), k3s (prod).
|
|
|
|
---
|
|
|
|
## Development Workflow
|
|
|
|
### Container-Based Development (podman-compose)
|
|
|
|
```bash
|
|
# Build the dev container
|
|
make dev-build
|
|
|
|
# Start dev environment
|
|
make dev
|
|
|
|
# Open a shell in the dev container
|
|
make dev-exec
|
|
|
|
# Build binaries (in container)
|
|
make build
|
|
|
|
# Run tests (in container)
|
|
make test
|
|
|
|
# Run linter (in container)
|
|
make lint
|
|
|
|
# Generate proto code (in container)
|
|
make proto-gen
|
|
|
|
# Resolve Go dependencies (in container)
|
|
make mod-tidy
|
|
|
|
# Tear down dev environment
|
|
make dev-down
|
|
```
|
|
|
|
### Production Container Images
|
|
|
|
```bash
|
|
# Build production images (CNI + DaemonSet)
|
|
make container-build
|
|
```
|
|
|
|
### YANG Compiler
|
|
|
|
```bash
|
|
# Validate YANG models
|
|
make yang-validate
|
|
|
|
# Run Python compiler tests
|
|
make python-test
|
|
```
|
|
|
|
Important: The dev environment uses podman-compose. Go, protoc, buf, golangci-lint, and wireguard-tools are all available inside the dev container. No local Go installation is required.
|
|
|
|
---
|
|
|
|
## Future: Dynamic Posture Framework
|
|
|
|
The Guildhouse ecosystem supports a dynamic governance posture system with five levels (5 = normal operations, 1 = lockdown). Each level has associated accords that define allowed capabilities, modes, and operational scope. Kedge enforces posture at the network layer.
|
|
|
|
**Kedge's role is network enforcement and mesh management during posture transitions.**
|
|
|
|
### DaemonSet Posture Behavior
|
|
|
|
The Kedge DaemonSet reads the current posture level from the same `posture_level` BPF map that the shell eBPF programs use (or receives it via notification from the Bascule governance daemon). On posture change, the DaemonSet adjusts:
|
|
|
|
**Overlay mode (WireGuard mesh):**
|
|
|
|
| Level | Mesh Behavior |
|
|
|-------|---------------|
|
|
| 5 | Full mesh. All tunnels active. Standard keepalive. |
|
|
| 4 | Full mesh. Increased health check frequency (detect degradation faster). |
|
|
| 3 | Full mesh maintained, but new Shellstream sessions require fresh SAT re-verification at handshake. DaemonSet rejects cached SAT validations. |
|
|
| 2 | Mesh maintained for management traffic only. Non-critical Shellstream sessions terminated. DaemonSet sends DRAINING signal to active non-essential sessions. |
|
|
| 1 | All remote overlay tunnels torn down. Cloud anchor disconnected. Only local loopback Shellstream permitted. The appliance is network-isolated except for physical console. |
|
|
|
|
**Underlay mode (physical infrastructure programming):**
|
|
|
|
| Level | Underlay Behavior |
|
|
|-------|-------------------|
|
|
| 5 | Normal operations. All authorized tiers can trigger infrastructure mutations via Bascule SDK dispatch. NetworkMutationArtifacts notarized per accord. |
|
|
| 4 | Normal operations. YANG drift detection frequency increased. |
|
|
| 3 | Underlay mutations restricted to master + site_owner tiers. Lower-tier SATs denied underlay mode at Shellstream handshake. Mutation proposals still accepted from all tiers but execution restricted. |
|
|
| 2 | Underlay mutations restricted to site_owner only. YANG compiler frozen — no new compilations. Only pre-approved emergency changes permitted. |
|
|
| 1 | All underlay operations frozen. No infrastructure mutations. Device configs frozen in place. VLAN interfaces maintained for monitoring but no changes dispatched. |
|
|
|
|
**Shellstream handshake behavior:**
|
|
|
|
The DaemonSet's Shellstream handshake termination adjusts per level:
|
|
|
|
```
|
|
Level 5: Standard 3-way handshake. SAT validated from cache. Full capability grant.
|
|
|
|
Level 4: Standard handshake. SAT validated from cache. Full capability grant.
|
|
Additional telemetry: log all new sessions (not just sampled).
|
|
|
|
Level 3: Strict handshake. SAT re-verified against SPIRE/Vigil (no cache).
|
|
Capability grant narrowed per posture accord (PROPOSE revoked for
|
|
lower tiers). Handshake latency increases (~50-100ms for re-verification)
|
|
but security posture tightens.
|
|
|
|
Level 2: Restricted handshake. SAT re-verified. Only critical session types
|
|
permitted (management, incident response, health monitoring).
|
|
Non-critical session types rejected at ATTEST-VERIFY step with
|
|
posture-level denial reason. Existing non-critical sessions receive
|
|
DRAIN signal.
|
|
|
|
Level 1: No remote handshakes. All inbound ATTEST-INIT rejected.
|
|
All outbound sessions blocked (eBPF shell also enforces this).
|
|
Only local Shellstream (loopback) permitted for console management.
|
|
```
|
|
|
|
**CNI plugin behavior:**
|
|
|
|
The CNI plugin's route programming adjusts on posture transitions:
|
|
|
|
| Level | Route Changes |
|
|
|-------|---------------|
|
|
| 5 | Full routes per `shell_state.allowed_targets` and `allowed_mode`. |
|
|
| 4 | Same as 5. No route changes. |
|
|
| 3 | Remove underlay routes for pods whose `shell_state` shows tiers below master. Overlay routes maintained. |
|
|
| 2 | Remove all non-critical routes. Only routes to management endpoints (Quartermaster, SPIRE, Prometheus) and incident response targets retained. |
|
|
| 1 | Remove all overlay routes (tunnels down). Remove all underlay routes except monitoring-only (read-only SNMP/health endpoints). |
|
|
|
|
Route changes on posture transition are triggered by the DaemonSet (which watches `posture_level` changes) notifying the CNI plugin's route manager. The CNI plugin does NOT poll the posture level — it receives route update commands from the DaemonSet.
|
|
|
|
### Quartermaster Notarization Fidelity
|
|
|
|
Kedge's DaemonSet adjusts `SessionTransitArtifact` and `NetworkMutationArtifact` submission behavior per level:
|
|
|
|
| Level | Notarization |
|
|
|-------|-------------|
|
|
| 5 | Sampled — 10% of routine session transits. All mutations. |
|
|
| 4 | Increased — 50% of session transits. All mutations. |
|
|
| 3 | Full — 100% of session transits. 100% of mutations. |
|
|
| 2 | Full + real-time — 100% with immediate anchor flushing (no batching). |
|
|
| 1 | Forensic — 100% with full payload hashing. Anchor flush on every record. Session metadata includes full Shellstream frame headers. |
|
|
|
|
At level 1, Kedge's DaemonSet also emits a `PostureForensicSnapshot` to Quartermaster containing the current mesh topology, all active tunnel states, all programmed routes, and all pending session state. This is the network-layer forensic evidence.
|
|
|
|
### VPP Governance Plugin Coordination (Appliance Deployments)
|
|
|
|
On Guildhouse SD-WAN appliance deployments where VPP handles the data plane, the Kedge DaemonSet coordinates posture changes with the VPP governance plugin:
|
|
|
|
```
|
|
Posture transition signal received
|
|
│
|
|
├─ DaemonSet updates its own behavior (overlay/underlay/handshake)
|
|
│
|
|
├─ DaemonSet notifies VPP governance plugin via shared memory
|
|
│ governance_state.posture_level = new_level
|
|
│
|
|
│ VPP plugin adjusts on next poll cycle (~microseconds):
|
|
│ Level 5: standard flow rules
|
|
│ Level 4: deep packet classification enabled
|
|
│ Level 3: non-essential inter-zone traffic deprioritized
|
|
│ Level 2: emergency zone policy (critical flows only)
|
|
│ Level 1: allowlist-only (everything not pre-approved dropped)
|
|
│
|
|
└─ Both DaemonSet and VPP plugin emit posture change telemetry
|
|
to their respective ring buffers → Quartermaster
|
|
```
|
|
|
|
The DaemonSet and VPP plugin must reach the new posture level within the same transition window. The DaemonSet updates VPP shared memory BEFORE adjusting its own Shellstream handshake behavior — this ensures the data plane tightens before (or simultaneously with) the session layer.
|
|
|
|
### Mesh-Wide Posture Visibility
|
|
|
|
In a multi-cluster deployment, each cluster's Kedge DaemonSet maintains its own posture level. Cross-cluster posture is communicated via Shellstream:
|
|
|
|
- Each DaemonSet includes the local posture level in Shellstream handshake metadata
|
|
- When a cloud anchor DaemonSet receives an ATTEST-INIT from a homelab DaemonSet at level 3, the cloud anchor knows to apply level 3 session restrictions even if the cloud anchor itself is at level 5
|
|
- The more restrictive posture wins at any boundary crossing: if the source is at level 3 and the destination is at level 5, the session is governed by level 3 rules
|
|
- This is enforced per-session, not per-cluster. A cluster at level 5 can host sessions from a level 3 peer — it just applies level 3 restrictions to those specific sessions
|
|
|
|
---
|
|
|
|
## Working Notes
|
|
|
|
- **Posture transition latency budget**: Kedge's network-level posture transition should complete within 5 seconds for the overlay path (tunnel adjustments, Shellstream session draining) and 10 seconds for the underlay path (route removal, VLAN interface adjustments). The VPP plugin path should complete within 1ms (shared memory update + next poll cycle). These targets ensure that a DEFCON 5→3 escalation restricts network access before an attacker can exploit the window.
|
|
- **Graceful session draining on escalation**: When escalating from level 5 to level 3, existing Shellstream sessions from lower-tier operators should not be abruptly terminated. The DaemonSet sends a DRAIN signal, allowing in-flight operations to complete (with a timeout), then tears down the session. Only level 2+ escalation does abrupt termination of non-critical sessions.
|
|
- **Posture-aware `SessionTransitArtifact`**: The `SessionTransitArtifact` should include a `posture_level_at_transit` field recording the posture level at the time of the session boundary crossing. This allows Quartermaster queries like "show all session transits that occurred while the cluster was at DEFCON 3" — critical for post-incident forensics.
|
|
- **De-escalation route restoration**: When de-escalating (e.g., level 3 → level 4 → level 5), the DaemonSet must restore routes that were removed during escalation. This requires the DaemonSet to maintain a "pre-escalation route table" snapshot so it can cleanly restore. The snapshot is taken at escalation time and stored locally (not in the BPF map — too large).
|
|
- **Level 1 cloud anchor behavior**: At level 1, the homelab cluster tears down overlay tunnels to the cloud anchor. The cloud anchor's Kedge DaemonSet should detect this as a posture-driven disconnection (not a failure) based on the DRAIN signal preceding the tunnel teardown. The cloud anchor should NOT attempt reconnection until it receives a de-escalation signal (via an out-of-band channel — physical console, SMS API, or pre-arranged re-attestation ceremony).
|
|
- **Testing posture transitions**: Add `posture-transition-test.yml` to Ansible playbooks. Should simulate escalation from 5→1 and back, verifying at each level: correct routes programmed, correct Shellstream handshake behavior, correct notarization fidelity, correct VPP plugin state (if applicable). This is a critical system-level integration test.
|