# Guildhall deploy exploratory — Talos/Hetzner cluster state **Date:** 2026-04-21 **Scope:** Read-only audit of the Talos/Hetzner Kubernetes cluster to inform Guildhall's initial deployment. **Method:** `kubectl` against the cluster via `~/projects/substrate-project/guildhouse-talos-bootstrap/kubeconfig`. No mutations. **Takeaway (synthesis at end):** Guildhall fits cleanly into the existing Keycloak/Forgejo deployment pattern: plain `Deployment` + `Deployment`-backed Postgres + Longhorn PVC + Hetzner LoadBalancer + Cloudflare-terminated TLS. No new infrastructure components required. The v1 substrate foundation (bascule / quartermaster / spire / chronicle / substrate-operator) is Flux-manifested but broken and not running; governance integration is explicitly follow-up work, not blocking. --- ## 1. Cluster basics | | | |---|---| | Control plane endpoint | `https://178.104.100.159:6443` | | kubectl client | v1.32.2 | | kubectl server | v1.32.3 | | Nodes | 5 (3 control-plane + 2 workers), all Ready | | OS | Talos v1.9.5 | | Kernel | 6.12.18-talos | | Container runtime | containerd 2.0.3 | | Cluster age | 10 days | ``` gsh-cp-01 control-plane 10.0.1.10 gsh-cp-02 control-plane 10.0.1.20 gsh-cp-03 control-plane 10.0.1.21 gsh-worker-01 worker 10.0.1.22 gsh-worker-02 worker 10.0.1.30 ``` Matches the memory-carried description (Hetzner Talos cluster 2026-04-11: Talos 1.9.5, 5 nodes, 3 CP + 2 worker). No drift. ## 2. Namespace inventory | Namespace | Purpose | Workloads | |---|---|---| | `cert-manager` | cert-manager 3 controllers (cert-manager, cainjector, webhook) | 3 Deployments | | `flux-system` | Flux GitOps | 4 Deployments (source / kustomize / helm / notification controllers) | | `forgejo` | Forgejo git (self-hosted) | Deployment + Postgres Deployment + Runner Deployment (0/1, stuck) | | `keycloak` | Keycloak OIDC IdP | Deployment + Postgres Deployment | | `longhorn-system` | Longhorn CSI storage | 5 DaemonSets + 6 Deployments + UI | | `kube-system`, `kube-public`, `kube-node-lease` | K8s system | — | | `default` | Empty | — | **Application workloads:** `forgejo`, `keycloak`. These are the reference patterns for Guildhall. ## 3. Ingress / gateway state **No traditional ingress controller, no Gateway API:** - `kubectl get ingressclasses` → No resources found - `kubectl get gatewayclasses` → server doesn't have the resource type (Gateway API CRDs not installed) - No HAProxy / nginx / traefik / istio / kong pods anywhere **Traffic reaches services via `type: LoadBalancer` directly**, backed by the **Hetzner Cloud Controller Manager** (`hcloud-cloud-controller-manager` running in `kube-system`). Each LoadBalancer Service provisions a real Hetzner Cloud Load Balancer via annotations: ```yaml annotations: load-balancer.hetzner.cloud/location: nbg1 load-balancer.hetzner.cloud/name: keycloak-lb-v2 load-balancer.hetzner.cloud/type: lb11 load-balancer.hetzner.cloud/use-private-ip: "false" ``` Cilium Envoy DaemonSet pods exist on every node (`cilium-envoy` for L7 proxy features — CiliumNetworkPolicy L7 filtering, not Gateway API). `enable-l7-proxy: true` is set in the Cilium config. **Existing LoadBalancer services and their public addresses:** | Service | Namespace | IPv4 | IPv6 | Ports | |---|---|---|---|---| | `forgejo-http` | forgejo | `46.225.47.75` | `2a01:4f8:1c1f:65bb::1` | 80 → 30000, 22 → 30022 | | `keycloak` | keycloak | `162.55.157.168` | `2a01:4f8:1c1d:1109::1` | 80 → 30080 | Both expose port 80 only. No in-cluster TLS termination. TLS is terminated **upstream at Cloudflare** (`git.guildhouse.dev` and `auth.guildhouse.dev` resolve to Cloudflare IPs; Cloudflare proxies to the Hetzner LB IPs). ## 4. cert-manager / TLS cert-manager is installed and healthy but **no Certificate resources exist anywhere on the cluster**. | ClusterIssuer | Status | |---|---| | `letsencrypt-prod` | Ready | | `letsencrypt-staging` | Ready | Both ClusterIssuers are provisioned and ready to issue. They just aren't being used yet — TLS is currently handled via Cloudflare's Universal SSL / Full mode at the edge, with HTTP between Cloudflare and the Hetzner LBs. **Implication for Guildhall:** can choose between the existing Cloudflare-termination pattern (simplest, matches forgejo/keycloak) or start using cert-manager (more work, cluster-integrated certs). The former is tonight's path; the latter is a hygiene follow-up. ## 5. Database patterns **No Postgres operator** installed. CRDs checked (all absent): - CloudNativePG (`clusters.postgresql.cnpg.io`) - Zalando postgres-operator (`postgresqls.acid.zalan.do`) - Crunchy PGO (`postgresclusters.postgres-operator.crunchydata.com`) **Existing pattern is plain Deployment + PVC:** ``` forgejo-postgres Deployment postgres:16 PVC: forgejo-db 10Gi longhorn keycloak-postgres Deployment postgres:16 PVC: keycloak-db 5Gi longhorn ``` **Storage:** Longhorn 1.x, single StorageClass `longhorn` (default). All PVCs use it. 5 DaemonSet replicas of longhorn-manager confirm storage is healthy across all nodes. **Current PVCs:** | PVC | Namespace | Size | StorageClass | |---|---|---|---| | forgejo-data | forgejo | 20Gi | longhorn | | forgejo-db | forgejo | 10Gi | longhorn | | runner-cache | forgejo | 5Gi | longhorn | | keycloak-db | keycloak | 5Gi | longhorn | ## 6. Secrets management **None of the common secret managers are installed:** - External Secrets Operator: absent - Sealed Secrets: absent - SPIRE/SPIFFE: absent (Flux has a Kustomization for it but the `spire-system` namespace doesn't exist — see §10 Flux state) - Vault: absent **Secrets are plain Opaque `Secret` resources.** Examples: - `forgejo/forgejo-secrets` (3 keys) - `keycloak/keycloak-secrets` (2 keys) Managed out-of-band (likely committed to the private Flux source repo or applied via kubectl during bootstrap). No rotation mechanism visible. ## 7. Existing workload patterns **Reference: `keycloak` Deployment** (cleanest example — the only Flux Kustomization that's `Ready`): - **Image:** `quay.io/keycloak/keycloak:26.0` (public registry) - **Env composition:** mix of literal `value:` (DB host, DB port, realm name) and `valueFrom.secretKeyRef` (admin password, DB password) - **Labels:** `app.kubernetes.io/name=keycloak`, `app.kubernetes.io/part-of=guildhouse` - **Config files:** ConfigMap-mounted realm import (`keycloak-realm-import`) - **Resources:** resource requests/limits not aggressively set (defaults mostly) - **Service:** `type: LoadBalancer` with Hetzner annotations, exposes port 80 only - **TLS:** none in-cluster; Cloudflare upstream **Reference: `forgejo-postgres` Deployment:** - **Image:** `postgres:16` (public Docker Hub) - **Env:** `POSTGRES_USER`, `POSTGRES_DB` literal; `POSTGRES_PASSWORD` from Secret - **PGDATA:** `/var/lib/postgresql/data/pgdata` (standard subdirectory to avoid lost+found issues) - **Volume:** PVC mounted at `/var/lib/postgresql/data` **No existing Elixir/Phoenix deployment** to reference. Guildhall will be the first. The pattern will follow the Keycloak/Forgejo shape applied to Phoenix's runtime requirements. ## 8. Guildhouse-specific components (v1 foundation) **Currently running: none.** - No pod matching `substrate`, `bascule`, `chronicle`, `quartermaster`, or `spire` across all namespaces. The v1 substrate foundation is absent from the cluster's running state. - Flux has Kustomizations for `bascule`, `quartermaster`, `spire`, `automation`, `governance-talos`, `gitops-controller` — all **failing** on a dependency chain: ``` spire → fails: namespace "spire-system" does not exist quartermaster → fails: dependency flux-system/spire is not ready bascule → fails: dependency flux-system/quartermaster is not ready automation → fails: dependency flux-system/quartermaster is not ready gitops-controller → fails: dependency flux-system/quartermaster is not ready governance-talos → fails: dependency flux-system/cluster-infra is not ready cluster-infra → SUSPENDED + YAML decode error on 10-cilium-values.yaml ``` This chain needs to be unblocked for the v1 substrate foundation to reach the cluster, but **this is explicitly NOT Guildhall's blocker**. Guildhall is the standalone orchestration/presentation layer; it composes with substrate via CRD watches once substrate is running, but doesn't require substrate present to stand up and serve its web UI. ## 9. Networking specifics **Cilium version:** `v1.16.5` (Cilium 1.16 series, recent but not 1.17-cutting-edge) **Key Cilium config** (from `kube-system/cilium-config`): | Flag | Value | Notes | |---|---|---| | `kube-proxy-replacement` | `true` | Cilium replaces kube-proxy (full eBPF mode) | | `enable-ipv4` | `true` | IPv4 on pod network | | `enable-ipv6` | `false` | IPv6 NOT enabled at pod network (LBs get Hetzner-assigned v6 externally) | | `enable-l7-proxy` | `true` | Envoy DaemonSet for L7 filtering | | `enable-hubble` | `true` | Hubble observability | | `ipam` | `kubernetes` | Host-IPAM, not cluster-pool | **Not enabled / not present:** - BGP control plane (`ciliumbgppeeringpolicies` CRD absent) - L2 announcements (`ciliuml2announcementpolicies` CRD present but zero resources) - LoadBalancerIPPool (CRD present but zero resources — Hetzner CCM handles LB IPs instead) - Gateway API (`gatewayclasses` CRD absent) - ClusterMesh (single-cluster) **NetworkPolicies in place** (only 3, all in `flux-system`): - `allow-egress` - `allow-scraping` - `allow-webhooks` (scoped to `app=notification-controller`) **CiliumNetworkPolicies:** none. Workloads rely on default-allow between pods. Guildhall deployment can proceed without adding policies; adding them is hardening follow-up. ## 10. Deployment automation **GitOps: Flux** is the sole mechanism. Running components: - `source-controller`, `kustomize-controller`, `helm-controller`, `notification-controller` — all 1/1 Ready **Sources:** one `GitRepository` registered: ``` flux-system / guildhouse-deploy URL: https://github.com/gh-tking/guildhouse-deploy-talos-mirror STATUS: Ready (artifact stored for main@169e077f) ``` **Kustomizations:** 9 total, summary: | Name | Status | |---|---| | `keycloak` | ✅ Ready (applied revision `169e077f`) | | `forgejo` | ❌ health check failed (forgejo-runner Deployment stuck InProgress) | | `cluster-infra` | ❌ SUSPENDED + YAML decode error | | `spire` | ❌ `spire-system` namespace not found | | `quartermaster` | ❌ depends on spire (not ready) | | `bascule` | ❌ depends on quartermaster (not ready) | | `automation` | ❌ depends on quartermaster (not ready) | | `gitops-controller` | ❌ depends on quartermaster (not ready) | | `governance-talos` | ❌ depends on cluster-infra (not ready) | **Key observation:** only `keycloak` flows through Flux successfully. Everything else is either suspended, blocked on missing upstream dependencies, or has a YAML error in the source repo. **HelmRepositories and HelmReleases:** none. **Changes land on the cluster:** currently via Flux against the GitHub-hosted source repo for the one working Kustomization (keycloak), otherwise via direct `kubectl apply` (given the broken Flux chain). --- ## Synthesis ### What Guildhall can leverage - **Longhorn StorageClass** — works out of the box for Postgres PVC. 5Gi is ample for initial Guildhall DB (matches keycloak-db sizing). - **Hetzner CCM LoadBalancer** — a LoadBalancer Service with `load-balancer.hetzner.cloud/*` annotations provisions a new Hetzner LB automatically. Cost is ~€5/mo for an `lb11` tier. Matches forgejo / keycloak exactly. - **Cloudflare-at-the-edge TLS** — DNS at `guildhall.guildhouse.dev` points at the Hetzner LB IPv4, Cloudflare terminates TLS, origin is plain HTTP on port 80. Zero cert-manager work required for v1. - **Keycloak as OIDC IdP** — already running at `auth.guildhouse.dev`. When Guildhall wires its OIDC config (currently commented out in `config/runtime.exs`), the endpoint is ready. Not blocking tonight. - **cert-manager ClusterIssuers** — `letsencrypt-prod` and `letsencrypt-staging` are ready, available as upgrade-path from Cloudflare-edge TLS to cluster-terminated TLS if/when that hygiene pass happens. - **Reference deployment pattern** — keycloak's Deployment shape (public image, env-from-secret, ConfigMap for data, Service type=LoadBalancer, Postgres sibling Deployment + PVC) maps directly to Guildhall. Apply the same template. - **Flux GitOps pipeline exists** (if desired) — a new Kustomization in `guildhouse-deploy-talos-mirror` for Guildhall would auto-deploy. BUT the Flux state is currently messy — most Kustomizations are broken — so a direct `kubectl apply` path is cleaner for the v1 Guildhall deploy, with a follow-up Flux migration once the broader chain is healed. ### What Guildhall needs that the cluster doesn't have yet - **Guildhall container image.** Must be built locally via `mix release` + Dockerfile and pushed to a registry the cluster can pull from. Registry options: - `ghcr.io/gh-tking/guildhall:` — public GitHub Container Registry (requires packaging via the GitHub Actions or manual docker push) - Docker Hub under a personal account - **Forgejo container registry** at `git.guildhouse.dev/tking/guildhall:` — Forgejo 1.19+ supports OCI registry; this is the most consistent choice with the rest of the Guildhouse tooling - A private Hetzner-region ghcr mirror - **Secrets:** `guildhall-secrets` Opaque Secret with at minimum `SECRET_KEY_BASE` (64-byte Phoenix session key, `mix phx.gen.secret`) and `DATABASE_URL` (or discrete `DB_PASSWORD` + construct URL at runtime). - **Namespace:** `guildhall` (new). - **DNS record:** `guildhall.guildhouse.dev` → Hetzner LB IPv4 (via Cloudflare). Can be created after LB is provisioned, once the LB IP is known. ### Likely shape of the deployment Based on the keycloak/forgejo pattern: ``` Namespace: guildhall ├── Deployment: guildhall-postgres (postgres:16, env POSTGRES_* from guildhall-secrets) ├── PVC: guildhall-db (longhorn, 5-10Gi) ├── Service: guildhall-postgres (ClusterIP, 5432) ├── Secret: guildhall-secrets (SECRET_KEY_BASE, DB_PASSWORD) ├── Deployment: guildhall (image from ghcr / forgejo registry / etc, envs DATABASE_URL + SECRET_KEY_BASE + PHX_HOST=guildhall.guildhouse.dev + PHX_SERVER=true + PORT=4000) └── Service: guildhall (type=LoadBalancer, Hetzner annotations, port 80 → 4000) ``` Release build discipline: - `mix release` in Docker multi-stage build (Elixir 1.17.3 / OTP 27 builder stage, debian-slim runtime stage) - `mix ecto.migrate` on container start (or a Job, or mix release custom step) - `PHX_SERVER=true` to start the HTTP server (per `config/runtime.exs`) - Health check endpoint (Phoenix default or custom `/health`) ### Surprises **What's present that wasn't expected:** - **Keycloak is already serving at `auth.guildhouse.dev`.** The OIDC substrate Guildhall will eventually integrate with is live. Zero setup needed for that dependency when the time comes. - **cert-manager is installed but unused.** Suggests a deliberate deferral in favor of Cloudflare-edge TLS; the ClusterIssuers are staged and ready for when in-cluster TLS is adopted. - **Cilium Envoy DaemonSet is running on every node** but with no Gateway API / CiliumEnvoyConfig / L7 policies currently in play. Present for future L7 use, not actively load-bearing yet. **What's expected but absent:** - **No HAProxy.** Previous K3s-era cluster used HAProxy as ingress; this cluster doesn't. Hetzner LBs took its role. - **v1 substrate foundation is entirely absent from the running cluster.** bascule, substrate-operator, chronicle, quartermaster, SPIRE — none running. Flux manifests exist (in the `guildhouse-deploy-talos-mirror` repo) but are blocked on a dependency chain rooted at missing `spire-system` namespace and a YAML decode error in `cluster-infra/10-cilium-values.yaml`. Unblocking this is real work that is NOT on the Red Hat path — governance integration is follow-up. - **No existing Elixir/Phoenix deployment** to copy. Guildhall will be the first Phoenix app on this cluster. - **Flux source is on GitHub (`guildhouse-deploy-talos-mirror`), not Forgejo.** Follows the same pattern as the substrate-project umbrella migration just completed — another GitHub→Forgejo item on the cleanup list, not blocking. ### Minimum path to Guildhall running at `guildhall.guildhouse.dev` 1. Dockerfile in `~/projects/substrate-project/guildhall/` — multi-stage with OTP 27, `mix release` 2. Build and push image to a registry (Forgejo container registry at `git.guildhouse.dev/tking/guildhall:v0.1.0` recommended for consistency) 3. Generate `SECRET_KEY_BASE` via `mix phx.gen.secret` 4. Create `guildhall` namespace; create `guildhall-secrets` Secret 5. Apply Deployment + Service + Postgres + PVC manifest (template from keycloak) 6. Wait for Hetzner LB to provision; note IPv4 7. Create Cloudflare DNS record `guildhall.guildhouse.dev` → LB IPv4 (proxied, so Cloudflare handles TLS) 8. Verify; run any first-time ecto migration No cluster infrastructure changes. No cert-manager Certificates. No Flux reconfiguration. No governance-stack dependency. Just the same Deployment-shaped pattern that Keycloak and Forgejo already use, applied to Guildhall. Governance integration (CRD watchers, SPIFFE identity, Chronicle wiring, Accord enforcement) is explicitly follow-up work for after Guildhall is reachable and the Red Hat submission is in.