DEPLOY-EXPLORATORY documents the cluster state that shaped deployment decisions (Keycloak as template, Hetzner LB + Cloudflare pattern, no Postgres operator so sibling-Deployment pattern). FORGEJO-REGISTRY-INVESTIGATION documents that the registry was already operational in Forgejo 9.0.3 (packages enabled by default) and the storage/credential path forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler J King <tking@guildhouse.dev>
304 lines
17 KiB
Markdown
304 lines
17 KiB
Markdown
# Guildhall deploy exploratory — Talos/Hetzner cluster state
|
|
|
|
**Date:** 2026-04-21
|
|
**Scope:** Read-only audit of the Talos/Hetzner Kubernetes cluster to inform Guildhall's initial deployment.
|
|
**Method:** `kubectl` against the cluster via `~/projects/substrate-project/guildhouse-talos-bootstrap/kubeconfig`. No mutations.
|
|
**Takeaway (synthesis at end):** Guildhall fits cleanly into the existing Keycloak/Forgejo deployment pattern: plain `Deployment` + `Deployment`-backed Postgres + Longhorn PVC + Hetzner LoadBalancer + Cloudflare-terminated TLS. No new infrastructure components required. The v1 substrate foundation (bascule / quartermaster / spire / chronicle / substrate-operator) is Flux-manifested but broken and not running; governance integration is explicitly follow-up work, not blocking.
|
|
|
|
---
|
|
|
|
## 1. Cluster basics
|
|
|
|
| | |
|
|
|---|---|
|
|
| Control plane endpoint | `https://178.104.100.159:6443` |
|
|
| kubectl client | v1.32.2 |
|
|
| kubectl server | v1.32.3 |
|
|
| Nodes | 5 (3 control-plane + 2 workers), all Ready |
|
|
| OS | Talos v1.9.5 |
|
|
| Kernel | 6.12.18-talos |
|
|
| Container runtime | containerd 2.0.3 |
|
|
| Cluster age | 10 days |
|
|
|
|
```
|
|
gsh-cp-01 control-plane 10.0.1.10
|
|
gsh-cp-02 control-plane 10.0.1.20
|
|
gsh-cp-03 control-plane 10.0.1.21
|
|
gsh-worker-01 worker 10.0.1.22
|
|
gsh-worker-02 worker 10.0.1.30
|
|
```
|
|
|
|
Matches the memory-carried description (Hetzner Talos cluster 2026-04-11: Talos 1.9.5, 5 nodes, 3 CP + 2 worker). No drift.
|
|
|
|
## 2. Namespace inventory
|
|
|
|
| Namespace | Purpose | Workloads |
|
|
|---|---|---|
|
|
| `cert-manager` | cert-manager 3 controllers (cert-manager, cainjector, webhook) | 3 Deployments |
|
|
| `flux-system` | Flux GitOps | 4 Deployments (source / kustomize / helm / notification controllers) |
|
|
| `forgejo` | Forgejo git (self-hosted) | Deployment + Postgres Deployment + Runner Deployment (0/1, stuck) |
|
|
| `keycloak` | Keycloak OIDC IdP | Deployment + Postgres Deployment |
|
|
| `longhorn-system` | Longhorn CSI storage | 5 DaemonSets + 6 Deployments + UI |
|
|
| `kube-system`, `kube-public`, `kube-node-lease` | K8s system | — |
|
|
| `default` | Empty | — |
|
|
|
|
**Application workloads:** `forgejo`, `keycloak`. These are the reference patterns for Guildhall.
|
|
|
|
## 3. Ingress / gateway state
|
|
|
|
**No traditional ingress controller, no Gateway API:**
|
|
|
|
- `kubectl get ingressclasses` → No resources found
|
|
- `kubectl get gatewayclasses` → server doesn't have the resource type (Gateway API CRDs not installed)
|
|
- No HAProxy / nginx / traefik / istio / kong pods anywhere
|
|
|
|
**Traffic reaches services via `type: LoadBalancer` directly**, backed by the **Hetzner Cloud Controller Manager** (`hcloud-cloud-controller-manager` running in `kube-system`). Each LoadBalancer Service provisions a real Hetzner Cloud Load Balancer via annotations:
|
|
|
|
```yaml
|
|
annotations:
|
|
load-balancer.hetzner.cloud/location: nbg1
|
|
load-balancer.hetzner.cloud/name: keycloak-lb-v2
|
|
load-balancer.hetzner.cloud/type: lb11
|
|
load-balancer.hetzner.cloud/use-private-ip: "false"
|
|
```
|
|
|
|
Cilium Envoy DaemonSet pods exist on every node (`cilium-envoy` for L7 proxy features — CiliumNetworkPolicy L7 filtering, not Gateway API). `enable-l7-proxy: true` is set in the Cilium config.
|
|
|
|
**Existing LoadBalancer services and their public addresses:**
|
|
|
|
| Service | Namespace | IPv4 | IPv6 | Ports |
|
|
|---|---|---|---|---|
|
|
| `forgejo-http` | forgejo | `46.225.47.75` | `2a01:4f8:1c1f:65bb::1` | 80 → 30000, 22 → 30022 |
|
|
| `keycloak` | keycloak | `162.55.157.168` | `2a01:4f8:1c1d:1109::1` | 80 → 30080 |
|
|
|
|
Both expose port 80 only. No in-cluster TLS termination. TLS is terminated **upstream at Cloudflare** (`git.guildhouse.dev` and `auth.guildhouse.dev` resolve to Cloudflare IPs; Cloudflare proxies to the Hetzner LB IPs).
|
|
|
|
## 4. cert-manager / TLS
|
|
|
|
cert-manager is installed and healthy but **no Certificate resources exist anywhere on the cluster**.
|
|
|
|
| ClusterIssuer | Status |
|
|
|---|---|
|
|
| `letsencrypt-prod` | Ready |
|
|
| `letsencrypt-staging` | Ready |
|
|
|
|
Both ClusterIssuers are provisioned and ready to issue. They just aren't being used yet — TLS is currently handled via Cloudflare's Universal SSL / Full mode at the edge, with HTTP between Cloudflare and the Hetzner LBs.
|
|
|
|
**Implication for Guildhall:** can choose between the existing Cloudflare-termination pattern (simplest, matches forgejo/keycloak) or start using cert-manager (more work, cluster-integrated certs). The former is tonight's path; the latter is a hygiene follow-up.
|
|
|
|
## 5. Database patterns
|
|
|
|
**No Postgres operator** installed. CRDs checked (all absent):
|
|
- CloudNativePG (`clusters.postgresql.cnpg.io`)
|
|
- Zalando postgres-operator (`postgresqls.acid.zalan.do`)
|
|
- Crunchy PGO (`postgresclusters.postgres-operator.crunchydata.com`)
|
|
|
|
**Existing pattern is plain Deployment + PVC:**
|
|
|
|
```
|
|
forgejo-postgres Deployment postgres:16 PVC: forgejo-db 10Gi longhorn
|
|
keycloak-postgres Deployment postgres:16 PVC: keycloak-db 5Gi longhorn
|
|
```
|
|
|
|
**Storage:** Longhorn 1.x, single StorageClass `longhorn` (default). All PVCs use it. 5 DaemonSet replicas of longhorn-manager confirm storage is healthy across all nodes.
|
|
|
|
**Current PVCs:**
|
|
|
|
| PVC | Namespace | Size | StorageClass |
|
|
|---|---|---|---|
|
|
| forgejo-data | forgejo | 20Gi | longhorn |
|
|
| forgejo-db | forgejo | 10Gi | longhorn |
|
|
| runner-cache | forgejo | 5Gi | longhorn |
|
|
| keycloak-db | keycloak | 5Gi | longhorn |
|
|
|
|
## 6. Secrets management
|
|
|
|
**None of the common secret managers are installed:**
|
|
|
|
- External Secrets Operator: absent
|
|
- Sealed Secrets: absent
|
|
- SPIRE/SPIFFE: absent (Flux has a Kustomization for it but the `spire-system` namespace doesn't exist — see §10 Flux state)
|
|
- Vault: absent
|
|
|
|
**Secrets are plain Opaque `Secret` resources.** Examples:
|
|
- `forgejo/forgejo-secrets` (3 keys)
|
|
- `keycloak/keycloak-secrets` (2 keys)
|
|
|
|
Managed out-of-band (likely committed to the private Flux source repo or applied via kubectl during bootstrap). No rotation mechanism visible.
|
|
|
|
## 7. Existing workload patterns
|
|
|
|
**Reference: `keycloak` Deployment** (cleanest example — the only Flux Kustomization that's `Ready`):
|
|
|
|
- **Image:** `quay.io/keycloak/keycloak:26.0` (public registry)
|
|
- **Env composition:** mix of literal `value:` (DB host, DB port, realm name) and `valueFrom.secretKeyRef` (admin password, DB password)
|
|
- **Labels:** `app.kubernetes.io/name=keycloak`, `app.kubernetes.io/part-of=guildhouse`
|
|
- **Config files:** ConfigMap-mounted realm import (`keycloak-realm-import`)
|
|
- **Resources:** resource requests/limits not aggressively set (defaults mostly)
|
|
- **Service:** `type: LoadBalancer` with Hetzner annotations, exposes port 80 only
|
|
- **TLS:** none in-cluster; Cloudflare upstream
|
|
|
|
**Reference: `forgejo-postgres` Deployment:**
|
|
|
|
- **Image:** `postgres:16` (public Docker Hub)
|
|
- **Env:** `POSTGRES_USER`, `POSTGRES_DB` literal; `POSTGRES_PASSWORD` from Secret
|
|
- **PGDATA:** `/var/lib/postgresql/data/pgdata` (standard subdirectory to avoid lost+found issues)
|
|
- **Volume:** PVC mounted at `/var/lib/postgresql/data`
|
|
|
|
**No existing Elixir/Phoenix deployment** to reference. Guildhall will be the first. The pattern will follow the Keycloak/Forgejo shape applied to Phoenix's runtime requirements.
|
|
|
|
## 8. Guildhouse-specific components (v1 foundation)
|
|
|
|
**Currently running: none.**
|
|
|
|
- No pod matching `substrate`, `bascule`, `chronicle`, `quartermaster`, or `spire` across all namespaces. The v1 substrate foundation is absent from the cluster's running state.
|
|
- Flux has Kustomizations for `bascule`, `quartermaster`, `spire`, `automation`, `governance-talos`, `gitops-controller` — all **failing** on a dependency chain:
|
|
|
|
```
|
|
spire → fails: namespace "spire-system" does not exist
|
|
quartermaster → fails: dependency flux-system/spire is not ready
|
|
bascule → fails: dependency flux-system/quartermaster is not ready
|
|
automation → fails: dependency flux-system/quartermaster is not ready
|
|
gitops-controller → fails: dependency flux-system/quartermaster is not ready
|
|
governance-talos → fails: dependency flux-system/cluster-infra is not ready
|
|
cluster-infra → SUSPENDED + YAML decode error on 10-cilium-values.yaml
|
|
```
|
|
|
|
This chain needs to be unblocked for the v1 substrate foundation to reach the cluster, but **this is explicitly NOT Guildhall's blocker**. Guildhall is the standalone orchestration/presentation layer; it composes with substrate via CRD watches once substrate is running, but doesn't require substrate present to stand up and serve its web UI.
|
|
|
|
## 9. Networking specifics
|
|
|
|
**Cilium version:** `v1.16.5` (Cilium 1.16 series, recent but not 1.17-cutting-edge)
|
|
|
|
**Key Cilium config** (from `kube-system/cilium-config`):
|
|
|
|
| Flag | Value | Notes |
|
|
|---|---|---|
|
|
| `kube-proxy-replacement` | `true` | Cilium replaces kube-proxy (full eBPF mode) |
|
|
| `enable-ipv4` | `true` | IPv4 on pod network |
|
|
| `enable-ipv6` | `false` | IPv6 NOT enabled at pod network (LBs get Hetzner-assigned v6 externally) |
|
|
| `enable-l7-proxy` | `true` | Envoy DaemonSet for L7 filtering |
|
|
| `enable-hubble` | `true` | Hubble observability |
|
|
| `ipam` | `kubernetes` | Host-IPAM, not cluster-pool |
|
|
|
|
**Not enabled / not present:**
|
|
- BGP control plane (`ciliumbgppeeringpolicies` CRD absent)
|
|
- L2 announcements (`ciliuml2announcementpolicies` CRD present but zero resources)
|
|
- LoadBalancerIPPool (CRD present but zero resources — Hetzner CCM handles LB IPs instead)
|
|
- Gateway API (`gatewayclasses` CRD absent)
|
|
- ClusterMesh (single-cluster)
|
|
|
|
**NetworkPolicies in place** (only 3, all in `flux-system`):
|
|
- `allow-egress`
|
|
- `allow-scraping`
|
|
- `allow-webhooks` (scoped to `app=notification-controller`)
|
|
|
|
**CiliumNetworkPolicies:** none. Workloads rely on default-allow between pods. Guildhall deployment can proceed without adding policies; adding them is hardening follow-up.
|
|
|
|
## 10. Deployment automation
|
|
|
|
**GitOps: Flux** is the sole mechanism. Running components:
|
|
- `source-controller`, `kustomize-controller`, `helm-controller`, `notification-controller` — all 1/1 Ready
|
|
|
|
**Sources:** one `GitRepository` registered:
|
|
|
|
```
|
|
flux-system / guildhouse-deploy
|
|
URL: https://github.com/gh-tking/guildhouse-deploy-talos-mirror
|
|
STATUS: Ready (artifact stored for main@169e077f)
|
|
```
|
|
|
|
**Kustomizations:** 9 total, summary:
|
|
|
|
| Name | Status |
|
|
|---|---|
|
|
| `keycloak` | ✅ Ready (applied revision `169e077f`) |
|
|
| `forgejo` | ❌ health check failed (forgejo-runner Deployment stuck InProgress) |
|
|
| `cluster-infra` | ❌ SUSPENDED + YAML decode error |
|
|
| `spire` | ❌ `spire-system` namespace not found |
|
|
| `quartermaster` | ❌ depends on spire (not ready) |
|
|
| `bascule` | ❌ depends on quartermaster (not ready) |
|
|
| `automation` | ❌ depends on quartermaster (not ready) |
|
|
| `gitops-controller` | ❌ depends on quartermaster (not ready) |
|
|
| `governance-talos` | ❌ depends on cluster-infra (not ready) |
|
|
|
|
**Key observation:** only `keycloak` flows through Flux successfully. Everything else is either suspended, blocked on missing upstream dependencies, or has a YAML error in the source repo.
|
|
|
|
**HelmRepositories and HelmReleases:** none.
|
|
|
|
**Changes land on the cluster:** currently via Flux against the GitHub-hosted source repo for the one working Kustomization (keycloak), otherwise via direct `kubectl apply` (given the broken Flux chain).
|
|
|
|
---
|
|
|
|
## Synthesis
|
|
|
|
### What Guildhall can leverage
|
|
|
|
- **Longhorn StorageClass** — works out of the box for Postgres PVC. 5Gi is ample for initial Guildhall DB (matches keycloak-db sizing).
|
|
- **Hetzner CCM LoadBalancer** — a LoadBalancer Service with `load-balancer.hetzner.cloud/*` annotations provisions a new Hetzner LB automatically. Cost is ~€5/mo for an `lb11` tier. Matches forgejo / keycloak exactly.
|
|
- **Cloudflare-at-the-edge TLS** — DNS at `guildhall.guildhouse.dev` points at the Hetzner LB IPv4, Cloudflare terminates TLS, origin is plain HTTP on port 80. Zero cert-manager work required for v1.
|
|
- **Keycloak as OIDC IdP** — already running at `auth.guildhouse.dev`. When Guildhall wires its OIDC config (currently commented out in `config/runtime.exs`), the endpoint is ready. Not blocking tonight.
|
|
- **cert-manager ClusterIssuers** — `letsencrypt-prod` and `letsencrypt-staging` are ready, available as upgrade-path from Cloudflare-edge TLS to cluster-terminated TLS if/when that hygiene pass happens.
|
|
- **Reference deployment pattern** — keycloak's Deployment shape (public image, env-from-secret, ConfigMap for data, Service type=LoadBalancer, Postgres sibling Deployment + PVC) maps directly to Guildhall. Apply the same template.
|
|
- **Flux GitOps pipeline exists** (if desired) — a new Kustomization in `guildhouse-deploy-talos-mirror` for Guildhall would auto-deploy. BUT the Flux state is currently messy — most Kustomizations are broken — so a direct `kubectl apply` path is cleaner for the v1 Guildhall deploy, with a follow-up Flux migration once the broader chain is healed.
|
|
|
|
### What Guildhall needs that the cluster doesn't have yet
|
|
|
|
- **Guildhall container image.** Must be built locally via `mix release` + Dockerfile and pushed to a registry the cluster can pull from. Registry options:
|
|
- `ghcr.io/gh-tking/guildhall:<tag>` — public GitHub Container Registry (requires packaging via the GitHub Actions or manual docker push)
|
|
- Docker Hub under a personal account
|
|
- **Forgejo container registry** at `git.guildhouse.dev/tking/guildhall:<tag>` — Forgejo 1.19+ supports OCI registry; this is the most consistent choice with the rest of the Guildhouse tooling
|
|
- A private Hetzner-region ghcr mirror
|
|
- **Secrets:** `guildhall-secrets` Opaque Secret with at minimum `SECRET_KEY_BASE` (64-byte Phoenix session key, `mix phx.gen.secret`) and `DATABASE_URL` (or discrete `DB_PASSWORD` + construct URL at runtime).
|
|
- **Namespace:** `guildhall` (new).
|
|
- **DNS record:** `guildhall.guildhouse.dev` → Hetzner LB IPv4 (via Cloudflare). Can be created after LB is provisioned, once the LB IP is known.
|
|
|
|
### Likely shape of the deployment
|
|
|
|
Based on the keycloak/forgejo pattern:
|
|
|
|
```
|
|
Namespace: guildhall
|
|
├── Deployment: guildhall-postgres (postgres:16, env POSTGRES_* from guildhall-secrets)
|
|
├── PVC: guildhall-db (longhorn, 5-10Gi)
|
|
├── Service: guildhall-postgres (ClusterIP, 5432)
|
|
├── Secret: guildhall-secrets (SECRET_KEY_BASE, DB_PASSWORD)
|
|
├── Deployment: guildhall (image from ghcr / forgejo registry / etc, envs DATABASE_URL + SECRET_KEY_BASE + PHX_HOST=guildhall.guildhouse.dev + PHX_SERVER=true + PORT=4000)
|
|
└── Service: guildhall (type=LoadBalancer, Hetzner annotations, port 80 → 4000)
|
|
```
|
|
|
|
Release build discipline:
|
|
- `mix release` in Docker multi-stage build (Elixir 1.17.3 / OTP 27 builder stage, debian-slim runtime stage)
|
|
- `mix ecto.migrate` on container start (or a Job, or mix release custom step)
|
|
- `PHX_SERVER=true` to start the HTTP server (per `config/runtime.exs`)
|
|
- Health check endpoint (Phoenix default or custom `/health`)
|
|
|
|
### Surprises
|
|
|
|
**What's present that wasn't expected:**
|
|
|
|
- **Keycloak is already serving at `auth.guildhouse.dev`.** The OIDC substrate Guildhall will eventually integrate with is live. Zero setup needed for that dependency when the time comes.
|
|
- **cert-manager is installed but unused.** Suggests a deliberate deferral in favor of Cloudflare-edge TLS; the ClusterIssuers are staged and ready for when in-cluster TLS is adopted.
|
|
- **Cilium Envoy DaemonSet is running on every node** but with no Gateway API / CiliumEnvoyConfig / L7 policies currently in play. Present for future L7 use, not actively load-bearing yet.
|
|
|
|
**What's expected but absent:**
|
|
|
|
- **No HAProxy.** Previous K3s-era cluster used HAProxy as ingress; this cluster doesn't. Hetzner LBs took its role.
|
|
- **v1 substrate foundation is entirely absent from the running cluster.** bascule, substrate-operator, chronicle, quartermaster, SPIRE — none running. Flux manifests exist (in the `guildhouse-deploy-talos-mirror` repo) but are blocked on a dependency chain rooted at missing `spire-system` namespace and a YAML decode error in `cluster-infra/10-cilium-values.yaml`. Unblocking this is real work that is NOT on the Red Hat path — governance integration is follow-up.
|
|
- **No existing Elixir/Phoenix deployment** to copy. Guildhall will be the first Phoenix app on this cluster.
|
|
- **Flux source is on GitHub (`guildhouse-deploy-talos-mirror`), not Forgejo.** Follows the same pattern as the substrate-project umbrella migration just completed — another GitHub→Forgejo item on the cleanup list, not blocking.
|
|
|
|
### Minimum path to Guildhall running at `guildhall.guildhouse.dev`
|
|
|
|
1. Dockerfile in `~/projects/substrate-project/guildhall/` — multi-stage with OTP 27, `mix release`
|
|
2. Build and push image to a registry (Forgejo container registry at `git.guildhouse.dev/tking/guildhall:v0.1.0` recommended for consistency)
|
|
3. Generate `SECRET_KEY_BASE` via `mix phx.gen.secret`
|
|
4. Create `guildhall` namespace; create `guildhall-secrets` Secret
|
|
5. Apply Deployment + Service + Postgres + PVC manifest (template from keycloak)
|
|
6. Wait for Hetzner LB to provision; note IPv4
|
|
7. Create Cloudflare DNS record `guildhall.guildhouse.dev` → LB IPv4 (proxied, so Cloudflare handles TLS)
|
|
8. Verify; run any first-time ecto migration
|
|
|
|
No cluster infrastructure changes. No cert-manager Certificates. No Flux reconfiguration. No governance-stack dependency. Just the same Deployment-shaped pattern that Keycloak and Forgejo already use, applied to Guildhall.
|
|
|
|
Governance integration (CRD watchers, SPIFFE identity, Chronicle wiring, Accord enforcement) is explicitly follow-up work for after Guildhall is reachable and the Red Hat submission is in.
|