guildhall/DEPLOY-EXPLORATORY-2026-04-21.md
Tyler J King 115bd178a2 docs(deploy): capture exploratory reports for Talos + Forgejo registry
DEPLOY-EXPLORATORY documents the cluster state that shaped deployment
decisions (Keycloak as template, Hetzner LB + Cloudflare pattern, no
Postgres operator so sibling-Deployment pattern).

FORGEJO-REGISTRY-INVESTIGATION documents that the registry was already
operational in Forgejo 9.0.3 (packages enabled by default) and the
storage/credential path forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler J King <tking@guildhouse.dev>
2026-04-22 09:01:20 -04:00

17 KiB

Guildhall deploy exploratory — Talos/Hetzner cluster state

Date: 2026-04-21 Scope: Read-only audit of the Talos/Hetzner Kubernetes cluster to inform Guildhall's initial deployment. Method: kubectl against the cluster via ~/projects/substrate-project/guildhouse-talos-bootstrap/kubeconfig. No mutations. Takeaway (synthesis at end): Guildhall fits cleanly into the existing Keycloak/Forgejo deployment pattern: plain Deployment + Deployment-backed Postgres + Longhorn PVC + Hetzner LoadBalancer + Cloudflare-terminated TLS. No new infrastructure components required. The v1 substrate foundation (bascule / quartermaster / spire / chronicle / substrate-operator) is Flux-manifested but broken and not running; governance integration is explicitly follow-up work, not blocking.


1. Cluster basics

Control plane endpoint https://178.104.100.159:6443
kubectl client v1.32.2
kubectl server v1.32.3
Nodes 5 (3 control-plane + 2 workers), all Ready
OS Talos v1.9.5
Kernel 6.12.18-talos
Container runtime containerd 2.0.3
Cluster age 10 days
gsh-cp-01        control-plane   10.0.1.10
gsh-cp-02        control-plane   10.0.1.20
gsh-cp-03        control-plane   10.0.1.21
gsh-worker-01    worker          10.0.1.22
gsh-worker-02    worker          10.0.1.30

Matches the memory-carried description (Hetzner Talos cluster 2026-04-11: Talos 1.9.5, 5 nodes, 3 CP + 2 worker). No drift.

2. Namespace inventory

Namespace Purpose Workloads
cert-manager cert-manager 3 controllers (cert-manager, cainjector, webhook) 3 Deployments
flux-system Flux GitOps 4 Deployments (source / kustomize / helm / notification controllers)
forgejo Forgejo git (self-hosted) Deployment + Postgres Deployment + Runner Deployment (0/1, stuck)
keycloak Keycloak OIDC IdP Deployment + Postgres Deployment
longhorn-system Longhorn CSI storage 5 DaemonSets + 6 Deployments + UI
kube-system, kube-public, kube-node-lease K8s system
default Empty

Application workloads: forgejo, keycloak. These are the reference patterns for Guildhall.

3. Ingress / gateway state

No traditional ingress controller, no Gateway API:

  • kubectl get ingressclasses → No resources found
  • kubectl get gatewayclasses → server doesn't have the resource type (Gateway API CRDs not installed)
  • No HAProxy / nginx / traefik / istio / kong pods anywhere

Traffic reaches services via type: LoadBalancer directly, backed by the Hetzner Cloud Controller Manager (hcloud-cloud-controller-manager running in kube-system). Each LoadBalancer Service provisions a real Hetzner Cloud Load Balancer via annotations:

annotations:
  load-balancer.hetzner.cloud/location: nbg1
  load-balancer.hetzner.cloud/name: keycloak-lb-v2
  load-balancer.hetzner.cloud/type: lb11
  load-balancer.hetzner.cloud/use-private-ip: "false"

Cilium Envoy DaemonSet pods exist on every node (cilium-envoy for L7 proxy features — CiliumNetworkPolicy L7 filtering, not Gateway API). enable-l7-proxy: true is set in the Cilium config.

Existing LoadBalancer services and their public addresses:

Service Namespace IPv4 IPv6 Ports
forgejo-http forgejo 46.225.47.75 2a01:4f8:1c1f:65bb::1 80 → 30000, 22 → 30022
keycloak keycloak 162.55.157.168 2a01:4f8:1c1d:1109::1 80 → 30080

Both expose port 80 only. No in-cluster TLS termination. TLS is terminated upstream at Cloudflare (git.guildhouse.dev and auth.guildhouse.dev resolve to Cloudflare IPs; Cloudflare proxies to the Hetzner LB IPs).

4. cert-manager / TLS

cert-manager is installed and healthy but no Certificate resources exist anywhere on the cluster.

ClusterIssuer Status
letsencrypt-prod Ready
letsencrypt-staging Ready

Both ClusterIssuers are provisioned and ready to issue. They just aren't being used yet — TLS is currently handled via Cloudflare's Universal SSL / Full mode at the edge, with HTTP between Cloudflare and the Hetzner LBs.

Implication for Guildhall: can choose between the existing Cloudflare-termination pattern (simplest, matches forgejo/keycloak) or start using cert-manager (more work, cluster-integrated certs). The former is tonight's path; the latter is a hygiene follow-up.

5. Database patterns

No Postgres operator installed. CRDs checked (all absent):

  • CloudNativePG (clusters.postgresql.cnpg.io)
  • Zalando postgres-operator (postgresqls.acid.zalan.do)
  • Crunchy PGO (postgresclusters.postgres-operator.crunchydata.com)

Existing pattern is plain Deployment + PVC:

forgejo-postgres    Deployment    postgres:16    PVC: forgejo-db  10Gi   longhorn
keycloak-postgres   Deployment    postgres:16    PVC: keycloak-db  5Gi   longhorn

Storage: Longhorn 1.x, single StorageClass longhorn (default). All PVCs use it. 5 DaemonSet replicas of longhorn-manager confirm storage is healthy across all nodes.

Current PVCs:

PVC Namespace Size StorageClass
forgejo-data forgejo 20Gi longhorn
forgejo-db forgejo 10Gi longhorn
runner-cache forgejo 5Gi longhorn
keycloak-db keycloak 5Gi longhorn

6. Secrets management

None of the common secret managers are installed:

  • External Secrets Operator: absent
  • Sealed Secrets: absent
  • SPIRE/SPIFFE: absent (Flux has a Kustomization for it but the spire-system namespace doesn't exist — see §10 Flux state)
  • Vault: absent

Secrets are plain Opaque Secret resources. Examples:

  • forgejo/forgejo-secrets (3 keys)
  • keycloak/keycloak-secrets (2 keys)

Managed out-of-band (likely committed to the private Flux source repo or applied via kubectl during bootstrap). No rotation mechanism visible.

7. Existing workload patterns

Reference: keycloak Deployment (cleanest example — the only Flux Kustomization that's Ready):

  • Image: quay.io/keycloak/keycloak:26.0 (public registry)
  • Env composition: mix of literal value: (DB host, DB port, realm name) and valueFrom.secretKeyRef (admin password, DB password)
  • Labels: app.kubernetes.io/name=keycloak, app.kubernetes.io/part-of=guildhouse
  • Config files: ConfigMap-mounted realm import (keycloak-realm-import)
  • Resources: resource requests/limits not aggressively set (defaults mostly)
  • Service: type: LoadBalancer with Hetzner annotations, exposes port 80 only
  • TLS: none in-cluster; Cloudflare upstream

Reference: forgejo-postgres Deployment:

  • Image: postgres:16 (public Docker Hub)
  • Env: POSTGRES_USER, POSTGRES_DB literal; POSTGRES_PASSWORD from Secret
  • PGDATA: /var/lib/postgresql/data/pgdata (standard subdirectory to avoid lost+found issues)
  • Volume: PVC mounted at /var/lib/postgresql/data

No existing Elixir/Phoenix deployment to reference. Guildhall will be the first. The pattern will follow the Keycloak/Forgejo shape applied to Phoenix's runtime requirements.

8. Guildhouse-specific components (v1 foundation)

Currently running: none.

  • No pod matching substrate, bascule, chronicle, quartermaster, or spire across all namespaces. The v1 substrate foundation is absent from the cluster's running state.
  • Flux has Kustomizations for bascule, quartermaster, spire, automation, governance-talos, gitops-controller — all failing on a dependency chain:
spire           → fails: namespace "spire-system" does not exist
quartermaster   → fails: dependency flux-system/spire is not ready
bascule         → fails: dependency flux-system/quartermaster is not ready
automation      → fails: dependency flux-system/quartermaster is not ready
gitops-controller → fails: dependency flux-system/quartermaster is not ready
governance-talos  → fails: dependency flux-system/cluster-infra is not ready
cluster-infra   → SUSPENDED + YAML decode error on 10-cilium-values.yaml

This chain needs to be unblocked for the v1 substrate foundation to reach the cluster, but this is explicitly NOT Guildhall's blocker. Guildhall is the standalone orchestration/presentation layer; it composes with substrate via CRD watches once substrate is running, but doesn't require substrate present to stand up and serve its web UI.

9. Networking specifics

Cilium version: v1.16.5 (Cilium 1.16 series, recent but not 1.17-cutting-edge)

Key Cilium config (from kube-system/cilium-config):

Flag Value Notes
kube-proxy-replacement true Cilium replaces kube-proxy (full eBPF mode)
enable-ipv4 true IPv4 on pod network
enable-ipv6 false IPv6 NOT enabled at pod network (LBs get Hetzner-assigned v6 externally)
enable-l7-proxy true Envoy DaemonSet for L7 filtering
enable-hubble true Hubble observability
ipam kubernetes Host-IPAM, not cluster-pool

Not enabled / not present:

  • BGP control plane (ciliumbgppeeringpolicies CRD absent)
  • L2 announcements (ciliuml2announcementpolicies CRD present but zero resources)
  • LoadBalancerIPPool (CRD present but zero resources — Hetzner CCM handles LB IPs instead)
  • Gateway API (gatewayclasses CRD absent)
  • ClusterMesh (single-cluster)

NetworkPolicies in place (only 3, all in flux-system):

  • allow-egress
  • allow-scraping
  • allow-webhooks (scoped to app=notification-controller)

CiliumNetworkPolicies: none. Workloads rely on default-allow between pods. Guildhall deployment can proceed without adding policies; adding them is hardening follow-up.

10. Deployment automation

GitOps: Flux is the sole mechanism. Running components:

  • source-controller, kustomize-controller, helm-controller, notification-controller — all 1/1 Ready

Sources: one GitRepository registered:

flux-system / guildhouse-deploy
  URL:    https://github.com/gh-tking/guildhouse-deploy-talos-mirror
  STATUS: Ready (artifact stored for main@169e077f)

Kustomizations: 9 total, summary:

Name Status
keycloak Ready (applied revision 169e077f)
forgejo health check failed (forgejo-runner Deployment stuck InProgress)
cluster-infra SUSPENDED + YAML decode error
spire spire-system namespace not found
quartermaster depends on spire (not ready)
bascule depends on quartermaster (not ready)
automation depends on quartermaster (not ready)
gitops-controller depends on quartermaster (not ready)
governance-talos depends on cluster-infra (not ready)

Key observation: only keycloak flows through Flux successfully. Everything else is either suspended, blocked on missing upstream dependencies, or has a YAML error in the source repo.

HelmRepositories and HelmReleases: none.

Changes land on the cluster: currently via Flux against the GitHub-hosted source repo for the one working Kustomization (keycloak), otherwise via direct kubectl apply (given the broken Flux chain).


Synthesis

What Guildhall can leverage

  • Longhorn StorageClass — works out of the box for Postgres PVC. 5Gi is ample for initial Guildhall DB (matches keycloak-db sizing).
  • Hetzner CCM LoadBalancer — a LoadBalancer Service with load-balancer.hetzner.cloud/* annotations provisions a new Hetzner LB automatically. Cost is ~€5/mo for an lb11 tier. Matches forgejo / keycloak exactly.
  • Cloudflare-at-the-edge TLS — DNS at guildhall.guildhouse.dev points at the Hetzner LB IPv4, Cloudflare terminates TLS, origin is plain HTTP on port 80. Zero cert-manager work required for v1.
  • Keycloak as OIDC IdP — already running at auth.guildhouse.dev. When Guildhall wires its OIDC config (currently commented out in config/runtime.exs), the endpoint is ready. Not blocking tonight.
  • cert-manager ClusterIssuersletsencrypt-prod and letsencrypt-staging are ready, available as upgrade-path from Cloudflare-edge TLS to cluster-terminated TLS if/when that hygiene pass happens.
  • Reference deployment pattern — keycloak's Deployment shape (public image, env-from-secret, ConfigMap for data, Service type=LoadBalancer, Postgres sibling Deployment + PVC) maps directly to Guildhall. Apply the same template.
  • Flux GitOps pipeline exists (if desired) — a new Kustomization in guildhouse-deploy-talos-mirror for Guildhall would auto-deploy. BUT the Flux state is currently messy — most Kustomizations are broken — so a direct kubectl apply path is cleaner for the v1 Guildhall deploy, with a follow-up Flux migration once the broader chain is healed.

What Guildhall needs that the cluster doesn't have yet

  • Guildhall container image. Must be built locally via mix release + Dockerfile and pushed to a registry the cluster can pull from. Registry options:
    • ghcr.io/gh-tking/guildhall:<tag> — public GitHub Container Registry (requires packaging via the GitHub Actions or manual docker push)
    • Docker Hub under a personal account
    • Forgejo container registry at git.guildhouse.dev/tking/guildhall:<tag> — Forgejo 1.19+ supports OCI registry; this is the most consistent choice with the rest of the Guildhouse tooling
    • A private Hetzner-region ghcr mirror
  • Secrets: guildhall-secrets Opaque Secret with at minimum SECRET_KEY_BASE (64-byte Phoenix session key, mix phx.gen.secret) and DATABASE_URL (or discrete DB_PASSWORD + construct URL at runtime).
  • Namespace: guildhall (new).
  • DNS record: guildhall.guildhouse.dev → Hetzner LB IPv4 (via Cloudflare). Can be created after LB is provisioned, once the LB IP is known.

Likely shape of the deployment

Based on the keycloak/forgejo pattern:

Namespace:  guildhall
├── Deployment: guildhall-postgres (postgres:16, env POSTGRES_* from guildhall-secrets)
├── PVC:        guildhall-db (longhorn, 5-10Gi)
├── Service:    guildhall-postgres (ClusterIP, 5432)
├── Secret:     guildhall-secrets (SECRET_KEY_BASE, DB_PASSWORD)
├── Deployment: guildhall (image from ghcr / forgejo registry / etc, envs DATABASE_URL + SECRET_KEY_BASE + PHX_HOST=guildhall.guildhouse.dev + PHX_SERVER=true + PORT=4000)
└── Service:    guildhall (type=LoadBalancer, Hetzner annotations, port 80 → 4000)

Release build discipline:

  • mix release in Docker multi-stage build (Elixir 1.17.3 / OTP 27 builder stage, debian-slim runtime stage)
  • mix ecto.migrate on container start (or a Job, or mix release custom step)
  • PHX_SERVER=true to start the HTTP server (per config/runtime.exs)
  • Health check endpoint (Phoenix default or custom /health)

Surprises

What's present that wasn't expected:

  • Keycloak is already serving at auth.guildhouse.dev. The OIDC substrate Guildhall will eventually integrate with is live. Zero setup needed for that dependency when the time comes.
  • cert-manager is installed but unused. Suggests a deliberate deferral in favor of Cloudflare-edge TLS; the ClusterIssuers are staged and ready for when in-cluster TLS is adopted.
  • Cilium Envoy DaemonSet is running on every node but with no Gateway API / CiliumEnvoyConfig / L7 policies currently in play. Present for future L7 use, not actively load-bearing yet.

What's expected but absent:

  • No HAProxy. Previous K3s-era cluster used HAProxy as ingress; this cluster doesn't. Hetzner LBs took its role.
  • v1 substrate foundation is entirely absent from the running cluster. bascule, substrate-operator, chronicle, quartermaster, SPIRE — none running. Flux manifests exist (in the guildhouse-deploy-talos-mirror repo) but are blocked on a dependency chain rooted at missing spire-system namespace and a YAML decode error in cluster-infra/10-cilium-values.yaml. Unblocking this is real work that is NOT on the Red Hat path — governance integration is follow-up.
  • No existing Elixir/Phoenix deployment to copy. Guildhall will be the first Phoenix app on this cluster.
  • Flux source is on GitHub (guildhouse-deploy-talos-mirror), not Forgejo. Follows the same pattern as the substrate-project umbrella migration just completed — another GitHub→Forgejo item on the cleanup list, not blocking.

Minimum path to Guildhall running at guildhall.guildhouse.dev

  1. Dockerfile in ~/projects/substrate-project/guildhall/ — multi-stage with OTP 27, mix release
  2. Build and push image to a registry (Forgejo container registry at git.guildhouse.dev/tking/guildhall:v0.1.0 recommended for consistency)
  3. Generate SECRET_KEY_BASE via mix phx.gen.secret
  4. Create guildhall namespace; create guildhall-secrets Secret
  5. Apply Deployment + Service + Postgres + PVC manifest (template from keycloak)
  6. Wait for Hetzner LB to provision; note IPv4
  7. Create Cloudflare DNS record guildhall.guildhouse.dev → LB IPv4 (proxied, so Cloudflare handles TLS)
  8. Verify; run any first-time ecto migration

No cluster infrastructure changes. No cert-manager Certificates. No Flux reconfiguration. No governance-stack dependency. Just the same Deployment-shaped pattern that Keycloak and Forgejo already use, applied to Guildhall.

Governance integration (CRD watchers, SPIFFE identity, Chronicle wiring, Accord enforcement) is explicitly follow-up work for after Guildhall is reachable and the Red Hat submission is in.