guildhall/DEPLOY-EXPLORATORY-2026-04-21.md
Tyler J King 115bd178a2 docs(deploy): capture exploratory reports for Talos + Forgejo registry
DEPLOY-EXPLORATORY documents the cluster state that shaped deployment
decisions (Keycloak as template, Hetzner LB + Cloudflare pattern, no
Postgres operator so sibling-Deployment pattern).

FORGEJO-REGISTRY-INVESTIGATION documents that the registry was already
operational in Forgejo 9.0.3 (packages enabled by default) and the
storage/credential path forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler J King <tking@guildhouse.dev>
2026-04-22 09:01:20 -04:00

304 lines
17 KiB
Markdown

# Guildhall deploy exploratory — Talos/Hetzner cluster state
**Date:** 2026-04-21
**Scope:** Read-only audit of the Talos/Hetzner Kubernetes cluster to inform Guildhall's initial deployment.
**Method:** `kubectl` against the cluster via `~/projects/substrate-project/guildhouse-talos-bootstrap/kubeconfig`. No mutations.
**Takeaway (synthesis at end):** Guildhall fits cleanly into the existing Keycloak/Forgejo deployment pattern: plain `Deployment` + `Deployment`-backed Postgres + Longhorn PVC + Hetzner LoadBalancer + Cloudflare-terminated TLS. No new infrastructure components required. The v1 substrate foundation (bascule / quartermaster / spire / chronicle / substrate-operator) is Flux-manifested but broken and not running; governance integration is explicitly follow-up work, not blocking.
---
## 1. Cluster basics
| | |
|---|---|
| Control plane endpoint | `https://178.104.100.159:6443` |
| kubectl client | v1.32.2 |
| kubectl server | v1.32.3 |
| Nodes | 5 (3 control-plane + 2 workers), all Ready |
| OS | Talos v1.9.5 |
| Kernel | 6.12.18-talos |
| Container runtime | containerd 2.0.3 |
| Cluster age | 10 days |
```
gsh-cp-01 control-plane 10.0.1.10
gsh-cp-02 control-plane 10.0.1.20
gsh-cp-03 control-plane 10.0.1.21
gsh-worker-01 worker 10.0.1.22
gsh-worker-02 worker 10.0.1.30
```
Matches the memory-carried description (Hetzner Talos cluster 2026-04-11: Talos 1.9.5, 5 nodes, 3 CP + 2 worker). No drift.
## 2. Namespace inventory
| Namespace | Purpose | Workloads |
|---|---|---|
| `cert-manager` | cert-manager 3 controllers (cert-manager, cainjector, webhook) | 3 Deployments |
| `flux-system` | Flux GitOps | 4 Deployments (source / kustomize / helm / notification controllers) |
| `forgejo` | Forgejo git (self-hosted) | Deployment + Postgres Deployment + Runner Deployment (0/1, stuck) |
| `keycloak` | Keycloak OIDC IdP | Deployment + Postgres Deployment |
| `longhorn-system` | Longhorn CSI storage | 5 DaemonSets + 6 Deployments + UI |
| `kube-system`, `kube-public`, `kube-node-lease` | K8s system | — |
| `default` | Empty | — |
**Application workloads:** `forgejo`, `keycloak`. These are the reference patterns for Guildhall.
## 3. Ingress / gateway state
**No traditional ingress controller, no Gateway API:**
- `kubectl get ingressclasses` → No resources found
- `kubectl get gatewayclasses` → server doesn't have the resource type (Gateway API CRDs not installed)
- No HAProxy / nginx / traefik / istio / kong pods anywhere
**Traffic reaches services via `type: LoadBalancer` directly**, backed by the **Hetzner Cloud Controller Manager** (`hcloud-cloud-controller-manager` running in `kube-system`). Each LoadBalancer Service provisions a real Hetzner Cloud Load Balancer via annotations:
```yaml
annotations:
load-balancer.hetzner.cloud/location: nbg1
load-balancer.hetzner.cloud/name: keycloak-lb-v2
load-balancer.hetzner.cloud/type: lb11
load-balancer.hetzner.cloud/use-private-ip: "false"
```
Cilium Envoy DaemonSet pods exist on every node (`cilium-envoy` for L7 proxy features — CiliumNetworkPolicy L7 filtering, not Gateway API). `enable-l7-proxy: true` is set in the Cilium config.
**Existing LoadBalancer services and their public addresses:**
| Service | Namespace | IPv4 | IPv6 | Ports |
|---|---|---|---|---|
| `forgejo-http` | forgejo | `46.225.47.75` | `2a01:4f8:1c1f:65bb::1` | 80 → 30000, 22 → 30022 |
| `keycloak` | keycloak | `162.55.157.168` | `2a01:4f8:1c1d:1109::1` | 80 → 30080 |
Both expose port 80 only. No in-cluster TLS termination. TLS is terminated **upstream at Cloudflare** (`git.guildhouse.dev` and `auth.guildhouse.dev` resolve to Cloudflare IPs; Cloudflare proxies to the Hetzner LB IPs).
## 4. cert-manager / TLS
cert-manager is installed and healthy but **no Certificate resources exist anywhere on the cluster**.
| ClusterIssuer | Status |
|---|---|
| `letsencrypt-prod` | Ready |
| `letsencrypt-staging` | Ready |
Both ClusterIssuers are provisioned and ready to issue. They just aren't being used yet — TLS is currently handled via Cloudflare's Universal SSL / Full mode at the edge, with HTTP between Cloudflare and the Hetzner LBs.
**Implication for Guildhall:** can choose between the existing Cloudflare-termination pattern (simplest, matches forgejo/keycloak) or start using cert-manager (more work, cluster-integrated certs). The former is tonight's path; the latter is a hygiene follow-up.
## 5. Database patterns
**No Postgres operator** installed. CRDs checked (all absent):
- CloudNativePG (`clusters.postgresql.cnpg.io`)
- Zalando postgres-operator (`postgresqls.acid.zalan.do`)
- Crunchy PGO (`postgresclusters.postgres-operator.crunchydata.com`)
**Existing pattern is plain Deployment + PVC:**
```
forgejo-postgres Deployment postgres:16 PVC: forgejo-db 10Gi longhorn
keycloak-postgres Deployment postgres:16 PVC: keycloak-db 5Gi longhorn
```
**Storage:** Longhorn 1.x, single StorageClass `longhorn` (default). All PVCs use it. 5 DaemonSet replicas of longhorn-manager confirm storage is healthy across all nodes.
**Current PVCs:**
| PVC | Namespace | Size | StorageClass |
|---|---|---|---|
| forgejo-data | forgejo | 20Gi | longhorn |
| forgejo-db | forgejo | 10Gi | longhorn |
| runner-cache | forgejo | 5Gi | longhorn |
| keycloak-db | keycloak | 5Gi | longhorn |
## 6. Secrets management
**None of the common secret managers are installed:**
- External Secrets Operator: absent
- Sealed Secrets: absent
- SPIRE/SPIFFE: absent (Flux has a Kustomization for it but the `spire-system` namespace doesn't exist — see §10 Flux state)
- Vault: absent
**Secrets are plain Opaque `Secret` resources.** Examples:
- `forgejo/forgejo-secrets` (3 keys)
- `keycloak/keycloak-secrets` (2 keys)
Managed out-of-band (likely committed to the private Flux source repo or applied via kubectl during bootstrap). No rotation mechanism visible.
## 7. Existing workload patterns
**Reference: `keycloak` Deployment** (cleanest example — the only Flux Kustomization that's `Ready`):
- **Image:** `quay.io/keycloak/keycloak:26.0` (public registry)
- **Env composition:** mix of literal `value:` (DB host, DB port, realm name) and `valueFrom.secretKeyRef` (admin password, DB password)
- **Labels:** `app.kubernetes.io/name=keycloak`, `app.kubernetes.io/part-of=guildhouse`
- **Config files:** ConfigMap-mounted realm import (`keycloak-realm-import`)
- **Resources:** resource requests/limits not aggressively set (defaults mostly)
- **Service:** `type: LoadBalancer` with Hetzner annotations, exposes port 80 only
- **TLS:** none in-cluster; Cloudflare upstream
**Reference: `forgejo-postgres` Deployment:**
- **Image:** `postgres:16` (public Docker Hub)
- **Env:** `POSTGRES_USER`, `POSTGRES_DB` literal; `POSTGRES_PASSWORD` from Secret
- **PGDATA:** `/var/lib/postgresql/data/pgdata` (standard subdirectory to avoid lost+found issues)
- **Volume:** PVC mounted at `/var/lib/postgresql/data`
**No existing Elixir/Phoenix deployment** to reference. Guildhall will be the first. The pattern will follow the Keycloak/Forgejo shape applied to Phoenix's runtime requirements.
## 8. Guildhouse-specific components (v1 foundation)
**Currently running: none.**
- No pod matching `substrate`, `bascule`, `chronicle`, `quartermaster`, or `spire` across all namespaces. The v1 substrate foundation is absent from the cluster's running state.
- Flux has Kustomizations for `bascule`, `quartermaster`, `spire`, `automation`, `governance-talos`, `gitops-controller` — all **failing** on a dependency chain:
```
spire → fails: namespace "spire-system" does not exist
quartermaster → fails: dependency flux-system/spire is not ready
bascule → fails: dependency flux-system/quartermaster is not ready
automation → fails: dependency flux-system/quartermaster is not ready
gitops-controller → fails: dependency flux-system/quartermaster is not ready
governance-talos → fails: dependency flux-system/cluster-infra is not ready
cluster-infra → SUSPENDED + YAML decode error on 10-cilium-values.yaml
```
This chain needs to be unblocked for the v1 substrate foundation to reach the cluster, but **this is explicitly NOT Guildhall's blocker**. Guildhall is the standalone orchestration/presentation layer; it composes with substrate via CRD watches once substrate is running, but doesn't require substrate present to stand up and serve its web UI.
## 9. Networking specifics
**Cilium version:** `v1.16.5` (Cilium 1.16 series, recent but not 1.17-cutting-edge)
**Key Cilium config** (from `kube-system/cilium-config`):
| Flag | Value | Notes |
|---|---|---|
| `kube-proxy-replacement` | `true` | Cilium replaces kube-proxy (full eBPF mode) |
| `enable-ipv4` | `true` | IPv4 on pod network |
| `enable-ipv6` | `false` | IPv6 NOT enabled at pod network (LBs get Hetzner-assigned v6 externally) |
| `enable-l7-proxy` | `true` | Envoy DaemonSet for L7 filtering |
| `enable-hubble` | `true` | Hubble observability |
| `ipam` | `kubernetes` | Host-IPAM, not cluster-pool |
**Not enabled / not present:**
- BGP control plane (`ciliumbgppeeringpolicies` CRD absent)
- L2 announcements (`ciliuml2announcementpolicies` CRD present but zero resources)
- LoadBalancerIPPool (CRD present but zero resources — Hetzner CCM handles LB IPs instead)
- Gateway API (`gatewayclasses` CRD absent)
- ClusterMesh (single-cluster)
**NetworkPolicies in place** (only 3, all in `flux-system`):
- `allow-egress`
- `allow-scraping`
- `allow-webhooks` (scoped to `app=notification-controller`)
**CiliumNetworkPolicies:** none. Workloads rely on default-allow between pods. Guildhall deployment can proceed without adding policies; adding them is hardening follow-up.
## 10. Deployment automation
**GitOps: Flux** is the sole mechanism. Running components:
- `source-controller`, `kustomize-controller`, `helm-controller`, `notification-controller` — all 1/1 Ready
**Sources:** one `GitRepository` registered:
```
flux-system / guildhouse-deploy
URL: https://github.com/gh-tking/guildhouse-deploy-talos-mirror
STATUS: Ready (artifact stored for main@169e077f)
```
**Kustomizations:** 9 total, summary:
| Name | Status |
|---|---|
| `keycloak` | ✅ Ready (applied revision `169e077f`) |
| `forgejo` | ❌ health check failed (forgejo-runner Deployment stuck InProgress) |
| `cluster-infra` | ❌ SUSPENDED + YAML decode error |
| `spire` | ❌ `spire-system` namespace not found |
| `quartermaster` | ❌ depends on spire (not ready) |
| `bascule` | ❌ depends on quartermaster (not ready) |
| `automation` | ❌ depends on quartermaster (not ready) |
| `gitops-controller` | ❌ depends on quartermaster (not ready) |
| `governance-talos` | ❌ depends on cluster-infra (not ready) |
**Key observation:** only `keycloak` flows through Flux successfully. Everything else is either suspended, blocked on missing upstream dependencies, or has a YAML error in the source repo.
**HelmRepositories and HelmReleases:** none.
**Changes land on the cluster:** currently via Flux against the GitHub-hosted source repo for the one working Kustomization (keycloak), otherwise via direct `kubectl apply` (given the broken Flux chain).
---
## Synthesis
### What Guildhall can leverage
- **Longhorn StorageClass** — works out of the box for Postgres PVC. 5Gi is ample for initial Guildhall DB (matches keycloak-db sizing).
- **Hetzner CCM LoadBalancer** — a LoadBalancer Service with `load-balancer.hetzner.cloud/*` annotations provisions a new Hetzner LB automatically. Cost is ~€5/mo for an `lb11` tier. Matches forgejo / keycloak exactly.
- **Cloudflare-at-the-edge TLS** — DNS at `guildhall.guildhouse.dev` points at the Hetzner LB IPv4, Cloudflare terminates TLS, origin is plain HTTP on port 80. Zero cert-manager work required for v1.
- **Keycloak as OIDC IdP** — already running at `auth.guildhouse.dev`. When Guildhall wires its OIDC config (currently commented out in `config/runtime.exs`), the endpoint is ready. Not blocking tonight.
- **cert-manager ClusterIssuers** — `letsencrypt-prod` and `letsencrypt-staging` are ready, available as upgrade-path from Cloudflare-edge TLS to cluster-terminated TLS if/when that hygiene pass happens.
- **Reference deployment pattern** — keycloak's Deployment shape (public image, env-from-secret, ConfigMap for data, Service type=LoadBalancer, Postgres sibling Deployment + PVC) maps directly to Guildhall. Apply the same template.
- **Flux GitOps pipeline exists** (if desired) — a new Kustomization in `guildhouse-deploy-talos-mirror` for Guildhall would auto-deploy. BUT the Flux state is currently messy — most Kustomizations are broken — so a direct `kubectl apply` path is cleaner for the v1 Guildhall deploy, with a follow-up Flux migration once the broader chain is healed.
### What Guildhall needs that the cluster doesn't have yet
- **Guildhall container image.** Must be built locally via `mix release` + Dockerfile and pushed to a registry the cluster can pull from. Registry options:
- `ghcr.io/gh-tking/guildhall:<tag>` — public GitHub Container Registry (requires packaging via the GitHub Actions or manual docker push)
- Docker Hub under a personal account
- **Forgejo container registry** at `git.guildhouse.dev/tking/guildhall:<tag>` — Forgejo 1.19+ supports OCI registry; this is the most consistent choice with the rest of the Guildhouse tooling
- A private Hetzner-region ghcr mirror
- **Secrets:** `guildhall-secrets` Opaque Secret with at minimum `SECRET_KEY_BASE` (64-byte Phoenix session key, `mix phx.gen.secret`) and `DATABASE_URL` (or discrete `DB_PASSWORD` + construct URL at runtime).
- **Namespace:** `guildhall` (new).
- **DNS record:** `guildhall.guildhouse.dev` → Hetzner LB IPv4 (via Cloudflare). Can be created after LB is provisioned, once the LB IP is known.
### Likely shape of the deployment
Based on the keycloak/forgejo pattern:
```
Namespace: guildhall
├── Deployment: guildhall-postgres (postgres:16, env POSTGRES_* from guildhall-secrets)
├── PVC: guildhall-db (longhorn, 5-10Gi)
├── Service: guildhall-postgres (ClusterIP, 5432)
├── Secret: guildhall-secrets (SECRET_KEY_BASE, DB_PASSWORD)
├── Deployment: guildhall (image from ghcr / forgejo registry / etc, envs DATABASE_URL + SECRET_KEY_BASE + PHX_HOST=guildhall.guildhouse.dev + PHX_SERVER=true + PORT=4000)
└── Service: guildhall (type=LoadBalancer, Hetzner annotations, port 80 → 4000)
```
Release build discipline:
- `mix release` in Docker multi-stage build (Elixir 1.17.3 / OTP 27 builder stage, debian-slim runtime stage)
- `mix ecto.migrate` on container start (or a Job, or mix release custom step)
- `PHX_SERVER=true` to start the HTTP server (per `config/runtime.exs`)
- Health check endpoint (Phoenix default or custom `/health`)
### Surprises
**What's present that wasn't expected:**
- **Keycloak is already serving at `auth.guildhouse.dev`.** The OIDC substrate Guildhall will eventually integrate with is live. Zero setup needed for that dependency when the time comes.
- **cert-manager is installed but unused.** Suggests a deliberate deferral in favor of Cloudflare-edge TLS; the ClusterIssuers are staged and ready for when in-cluster TLS is adopted.
- **Cilium Envoy DaemonSet is running on every node** but with no Gateway API / CiliumEnvoyConfig / L7 policies currently in play. Present for future L7 use, not actively load-bearing yet.
**What's expected but absent:**
- **No HAProxy.** Previous K3s-era cluster used HAProxy as ingress; this cluster doesn't. Hetzner LBs took its role.
- **v1 substrate foundation is entirely absent from the running cluster.** bascule, substrate-operator, chronicle, quartermaster, SPIRE — none running. Flux manifests exist (in the `guildhouse-deploy-talos-mirror` repo) but are blocked on a dependency chain rooted at missing `spire-system` namespace and a YAML decode error in `cluster-infra/10-cilium-values.yaml`. Unblocking this is real work that is NOT on the Red Hat path — governance integration is follow-up.
- **No existing Elixir/Phoenix deployment** to copy. Guildhall will be the first Phoenix app on this cluster.
- **Flux source is on GitHub (`guildhouse-deploy-talos-mirror`), not Forgejo.** Follows the same pattern as the substrate-project umbrella migration just completed — another GitHub→Forgejo item on the cleanup list, not blocking.
### Minimum path to Guildhall running at `guildhall.guildhouse.dev`
1. Dockerfile in `~/projects/substrate-project/guildhall/` — multi-stage with OTP 27, `mix release`
2. Build and push image to a registry (Forgejo container registry at `git.guildhouse.dev/tking/guildhall:v0.1.0` recommended for consistency)
3. Generate `SECRET_KEY_BASE` via `mix phx.gen.secret`
4. Create `guildhall` namespace; create `guildhall-secrets` Secret
5. Apply Deployment + Service + Postgres + PVC manifest (template from keycloak)
6. Wait for Hetzner LB to provision; note IPv4
7. Create Cloudflare DNS record `guildhall.guildhouse.dev` → LB IPv4 (proxied, so Cloudflare handles TLS)
8. Verify; run any first-time ecto migration
No cluster infrastructure changes. No cert-manager Certificates. No Flux reconfiguration. No governance-stack dependency. Just the same Deployment-shaped pattern that Keycloak and Forgejo already use, applied to Guildhall.
Governance integration (CRD watchers, SPIFFE identity, Chronicle wiring, Accord enforcement) is explicitly follow-up work for after Guildhall is reachable and the Red Hat submission is in.