From 115bd178a21e63f16d7b53e6bebd9c53ccb481c789220100c8c17cc221761afc Mon Sep 17 00:00:00 2001 From: Tyler J King Date: Wed, 22 Apr 2026 09:01:20 -0400 Subject: [PATCH] docs(deploy): capture exploratory reports for Talos + Forgejo registry DEPLOY-EXPLORATORY documents the cluster state that shaped deployment decisions (Keycloak as template, Hetzner LB + Cloudflare pattern, no Postgres operator so sibling-Deployment pattern). FORGEJO-REGISTRY-INVESTIGATION documents that the registry was already operational in Forgejo 9.0.3 (packages enabled by default) and the storage/credential path forward. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: Tyler J King --- DEPLOY-EXPLORATORY-2026-04-21.md | 304 +++++++++++++++++++ FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md | 222 ++++++++++++++ 2 files changed, 526 insertions(+) create mode 100644 DEPLOY-EXPLORATORY-2026-04-21.md create mode 100644 FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md diff --git a/DEPLOY-EXPLORATORY-2026-04-21.md b/DEPLOY-EXPLORATORY-2026-04-21.md new file mode 100644 index 0000000..207036e --- /dev/null +++ b/DEPLOY-EXPLORATORY-2026-04-21.md @@ -0,0 +1,304 @@ +# Guildhall deploy exploratory — Talos/Hetzner cluster state + +**Date:** 2026-04-21 +**Scope:** Read-only audit of the Talos/Hetzner Kubernetes cluster to inform Guildhall's initial deployment. +**Method:** `kubectl` against the cluster via `~/projects/substrate-project/guildhouse-talos-bootstrap/kubeconfig`. No mutations. +**Takeaway (synthesis at end):** Guildhall fits cleanly into the existing Keycloak/Forgejo deployment pattern: plain `Deployment` + `Deployment`-backed Postgres + Longhorn PVC + Hetzner LoadBalancer + Cloudflare-terminated TLS. No new infrastructure components required. The v1 substrate foundation (bascule / quartermaster / spire / chronicle / substrate-operator) is Flux-manifested but broken and not running; governance integration is explicitly follow-up work, not blocking. + +--- + +## 1. Cluster basics + +| | | +|---|---| +| Control plane endpoint | `https://178.104.100.159:6443` | +| kubectl client | v1.32.2 | +| kubectl server | v1.32.3 | +| Nodes | 5 (3 control-plane + 2 workers), all Ready | +| OS | Talos v1.9.5 | +| Kernel | 6.12.18-talos | +| Container runtime | containerd 2.0.3 | +| Cluster age | 10 days | + +``` +gsh-cp-01 control-plane 10.0.1.10 +gsh-cp-02 control-plane 10.0.1.20 +gsh-cp-03 control-plane 10.0.1.21 +gsh-worker-01 worker 10.0.1.22 +gsh-worker-02 worker 10.0.1.30 +``` + +Matches the memory-carried description (Hetzner Talos cluster 2026-04-11: Talos 1.9.5, 5 nodes, 3 CP + 2 worker). No drift. + +## 2. Namespace inventory + +| Namespace | Purpose | Workloads | +|---|---|---| +| `cert-manager` | cert-manager 3 controllers (cert-manager, cainjector, webhook) | 3 Deployments | +| `flux-system` | Flux GitOps | 4 Deployments (source / kustomize / helm / notification controllers) | +| `forgejo` | Forgejo git (self-hosted) | Deployment + Postgres Deployment + Runner Deployment (0/1, stuck) | +| `keycloak` | Keycloak OIDC IdP | Deployment + Postgres Deployment | +| `longhorn-system` | Longhorn CSI storage | 5 DaemonSets + 6 Deployments + UI | +| `kube-system`, `kube-public`, `kube-node-lease` | K8s system | — | +| `default` | Empty | — | + +**Application workloads:** `forgejo`, `keycloak`. These are the reference patterns for Guildhall. + +## 3. Ingress / gateway state + +**No traditional ingress controller, no Gateway API:** + +- `kubectl get ingressclasses` → No resources found +- `kubectl get gatewayclasses` → server doesn't have the resource type (Gateway API CRDs not installed) +- No HAProxy / nginx / traefik / istio / kong pods anywhere + +**Traffic reaches services via `type: LoadBalancer` directly**, backed by the **Hetzner Cloud Controller Manager** (`hcloud-cloud-controller-manager` running in `kube-system`). Each LoadBalancer Service provisions a real Hetzner Cloud Load Balancer via annotations: + +```yaml +annotations: + load-balancer.hetzner.cloud/location: nbg1 + load-balancer.hetzner.cloud/name: keycloak-lb-v2 + load-balancer.hetzner.cloud/type: lb11 + load-balancer.hetzner.cloud/use-private-ip: "false" +``` + +Cilium Envoy DaemonSet pods exist on every node (`cilium-envoy` for L7 proxy features — CiliumNetworkPolicy L7 filtering, not Gateway API). `enable-l7-proxy: true` is set in the Cilium config. + +**Existing LoadBalancer services and their public addresses:** + +| Service | Namespace | IPv4 | IPv6 | Ports | +|---|---|---|---|---| +| `forgejo-http` | forgejo | `46.225.47.75` | `2a01:4f8:1c1f:65bb::1` | 80 → 30000, 22 → 30022 | +| `keycloak` | keycloak | `162.55.157.168` | `2a01:4f8:1c1d:1109::1` | 80 → 30080 | + +Both expose port 80 only. No in-cluster TLS termination. TLS is terminated **upstream at Cloudflare** (`git.guildhouse.dev` and `auth.guildhouse.dev` resolve to Cloudflare IPs; Cloudflare proxies to the Hetzner LB IPs). + +## 4. cert-manager / TLS + +cert-manager is installed and healthy but **no Certificate resources exist anywhere on the cluster**. + +| ClusterIssuer | Status | +|---|---| +| `letsencrypt-prod` | Ready | +| `letsencrypt-staging` | Ready | + +Both ClusterIssuers are provisioned and ready to issue. They just aren't being used yet — TLS is currently handled via Cloudflare's Universal SSL / Full mode at the edge, with HTTP between Cloudflare and the Hetzner LBs. + +**Implication for Guildhall:** can choose between the existing Cloudflare-termination pattern (simplest, matches forgejo/keycloak) or start using cert-manager (more work, cluster-integrated certs). The former is tonight's path; the latter is a hygiene follow-up. + +## 5. Database patterns + +**No Postgres operator** installed. CRDs checked (all absent): +- CloudNativePG (`clusters.postgresql.cnpg.io`) +- Zalando postgres-operator (`postgresqls.acid.zalan.do`) +- Crunchy PGO (`postgresclusters.postgres-operator.crunchydata.com`) + +**Existing pattern is plain Deployment + PVC:** + +``` +forgejo-postgres Deployment postgres:16 PVC: forgejo-db 10Gi longhorn +keycloak-postgres Deployment postgres:16 PVC: keycloak-db 5Gi longhorn +``` + +**Storage:** Longhorn 1.x, single StorageClass `longhorn` (default). All PVCs use it. 5 DaemonSet replicas of longhorn-manager confirm storage is healthy across all nodes. + +**Current PVCs:** + +| PVC | Namespace | Size | StorageClass | +|---|---|---|---| +| forgejo-data | forgejo | 20Gi | longhorn | +| forgejo-db | forgejo | 10Gi | longhorn | +| runner-cache | forgejo | 5Gi | longhorn | +| keycloak-db | keycloak | 5Gi | longhorn | + +## 6. Secrets management + +**None of the common secret managers are installed:** + +- External Secrets Operator: absent +- Sealed Secrets: absent +- SPIRE/SPIFFE: absent (Flux has a Kustomization for it but the `spire-system` namespace doesn't exist — see §10 Flux state) +- Vault: absent + +**Secrets are plain Opaque `Secret` resources.** Examples: +- `forgejo/forgejo-secrets` (3 keys) +- `keycloak/keycloak-secrets` (2 keys) + +Managed out-of-band (likely committed to the private Flux source repo or applied via kubectl during bootstrap). No rotation mechanism visible. + +## 7. Existing workload patterns + +**Reference: `keycloak` Deployment** (cleanest example — the only Flux Kustomization that's `Ready`): + +- **Image:** `quay.io/keycloak/keycloak:26.0` (public registry) +- **Env composition:** mix of literal `value:` (DB host, DB port, realm name) and `valueFrom.secretKeyRef` (admin password, DB password) +- **Labels:** `app.kubernetes.io/name=keycloak`, `app.kubernetes.io/part-of=guildhouse` +- **Config files:** ConfigMap-mounted realm import (`keycloak-realm-import`) +- **Resources:** resource requests/limits not aggressively set (defaults mostly) +- **Service:** `type: LoadBalancer` with Hetzner annotations, exposes port 80 only +- **TLS:** none in-cluster; Cloudflare upstream + +**Reference: `forgejo-postgres` Deployment:** + +- **Image:** `postgres:16` (public Docker Hub) +- **Env:** `POSTGRES_USER`, `POSTGRES_DB` literal; `POSTGRES_PASSWORD` from Secret +- **PGDATA:** `/var/lib/postgresql/data/pgdata` (standard subdirectory to avoid lost+found issues) +- **Volume:** PVC mounted at `/var/lib/postgresql/data` + +**No existing Elixir/Phoenix deployment** to reference. Guildhall will be the first. The pattern will follow the Keycloak/Forgejo shape applied to Phoenix's runtime requirements. + +## 8. Guildhouse-specific components (v1 foundation) + +**Currently running: none.** + +- No pod matching `substrate`, `bascule`, `chronicle`, `quartermaster`, or `spire` across all namespaces. The v1 substrate foundation is absent from the cluster's running state. +- Flux has Kustomizations for `bascule`, `quartermaster`, `spire`, `automation`, `governance-talos`, `gitops-controller` — all **failing** on a dependency chain: + +``` +spire → fails: namespace "spire-system" does not exist +quartermaster → fails: dependency flux-system/spire is not ready +bascule → fails: dependency flux-system/quartermaster is not ready +automation → fails: dependency flux-system/quartermaster is not ready +gitops-controller → fails: dependency flux-system/quartermaster is not ready +governance-talos → fails: dependency flux-system/cluster-infra is not ready +cluster-infra → SUSPENDED + YAML decode error on 10-cilium-values.yaml +``` + +This chain needs to be unblocked for the v1 substrate foundation to reach the cluster, but **this is explicitly NOT Guildhall's blocker**. Guildhall is the standalone orchestration/presentation layer; it composes with substrate via CRD watches once substrate is running, but doesn't require substrate present to stand up and serve its web UI. + +## 9. Networking specifics + +**Cilium version:** `v1.16.5` (Cilium 1.16 series, recent but not 1.17-cutting-edge) + +**Key Cilium config** (from `kube-system/cilium-config`): + +| Flag | Value | Notes | +|---|---|---| +| `kube-proxy-replacement` | `true` | Cilium replaces kube-proxy (full eBPF mode) | +| `enable-ipv4` | `true` | IPv4 on pod network | +| `enable-ipv6` | `false` | IPv6 NOT enabled at pod network (LBs get Hetzner-assigned v6 externally) | +| `enable-l7-proxy` | `true` | Envoy DaemonSet for L7 filtering | +| `enable-hubble` | `true` | Hubble observability | +| `ipam` | `kubernetes` | Host-IPAM, not cluster-pool | + +**Not enabled / not present:** +- BGP control plane (`ciliumbgppeeringpolicies` CRD absent) +- L2 announcements (`ciliuml2announcementpolicies` CRD present but zero resources) +- LoadBalancerIPPool (CRD present but zero resources — Hetzner CCM handles LB IPs instead) +- Gateway API (`gatewayclasses` CRD absent) +- ClusterMesh (single-cluster) + +**NetworkPolicies in place** (only 3, all in `flux-system`): +- `allow-egress` +- `allow-scraping` +- `allow-webhooks` (scoped to `app=notification-controller`) + +**CiliumNetworkPolicies:** none. Workloads rely on default-allow between pods. Guildhall deployment can proceed without adding policies; adding them is hardening follow-up. + +## 10. Deployment automation + +**GitOps: Flux** is the sole mechanism. Running components: +- `source-controller`, `kustomize-controller`, `helm-controller`, `notification-controller` — all 1/1 Ready + +**Sources:** one `GitRepository` registered: + +``` +flux-system / guildhouse-deploy + URL: https://github.com/gh-tking/guildhouse-deploy-talos-mirror + STATUS: Ready (artifact stored for main@169e077f) +``` + +**Kustomizations:** 9 total, summary: + +| Name | Status | +|---|---| +| `keycloak` | ✅ Ready (applied revision `169e077f`) | +| `forgejo` | ❌ health check failed (forgejo-runner Deployment stuck InProgress) | +| `cluster-infra` | ❌ SUSPENDED + YAML decode error | +| `spire` | ❌ `spire-system` namespace not found | +| `quartermaster` | ❌ depends on spire (not ready) | +| `bascule` | ❌ depends on quartermaster (not ready) | +| `automation` | ❌ depends on quartermaster (not ready) | +| `gitops-controller` | ❌ depends on quartermaster (not ready) | +| `governance-talos` | ❌ depends on cluster-infra (not ready) | + +**Key observation:** only `keycloak` flows through Flux successfully. Everything else is either suspended, blocked on missing upstream dependencies, or has a YAML error in the source repo. + +**HelmRepositories and HelmReleases:** none. + +**Changes land on the cluster:** currently via Flux against the GitHub-hosted source repo for the one working Kustomization (keycloak), otherwise via direct `kubectl apply` (given the broken Flux chain). + +--- + +## Synthesis + +### What Guildhall can leverage + +- **Longhorn StorageClass** — works out of the box for Postgres PVC. 5Gi is ample for initial Guildhall DB (matches keycloak-db sizing). +- **Hetzner CCM LoadBalancer** — a LoadBalancer Service with `load-balancer.hetzner.cloud/*` annotations provisions a new Hetzner LB automatically. Cost is ~€5/mo for an `lb11` tier. Matches forgejo / keycloak exactly. +- **Cloudflare-at-the-edge TLS** — DNS at `guildhall.guildhouse.dev` points at the Hetzner LB IPv4, Cloudflare terminates TLS, origin is plain HTTP on port 80. Zero cert-manager work required for v1. +- **Keycloak as OIDC IdP** — already running at `auth.guildhouse.dev`. When Guildhall wires its OIDC config (currently commented out in `config/runtime.exs`), the endpoint is ready. Not blocking tonight. +- **cert-manager ClusterIssuers** — `letsencrypt-prod` and `letsencrypt-staging` are ready, available as upgrade-path from Cloudflare-edge TLS to cluster-terminated TLS if/when that hygiene pass happens. +- **Reference deployment pattern** — keycloak's Deployment shape (public image, env-from-secret, ConfigMap for data, Service type=LoadBalancer, Postgres sibling Deployment + PVC) maps directly to Guildhall. Apply the same template. +- **Flux GitOps pipeline exists** (if desired) — a new Kustomization in `guildhouse-deploy-talos-mirror` for Guildhall would auto-deploy. BUT the Flux state is currently messy — most Kustomizations are broken — so a direct `kubectl apply` path is cleaner for the v1 Guildhall deploy, with a follow-up Flux migration once the broader chain is healed. + +### What Guildhall needs that the cluster doesn't have yet + +- **Guildhall container image.** Must be built locally via `mix release` + Dockerfile and pushed to a registry the cluster can pull from. Registry options: + - `ghcr.io/gh-tking/guildhall:` — public GitHub Container Registry (requires packaging via the GitHub Actions or manual docker push) + - Docker Hub under a personal account + - **Forgejo container registry** at `git.guildhouse.dev/tking/guildhall:` — Forgejo 1.19+ supports OCI registry; this is the most consistent choice with the rest of the Guildhouse tooling + - A private Hetzner-region ghcr mirror +- **Secrets:** `guildhall-secrets` Opaque Secret with at minimum `SECRET_KEY_BASE` (64-byte Phoenix session key, `mix phx.gen.secret`) and `DATABASE_URL` (or discrete `DB_PASSWORD` + construct URL at runtime). +- **Namespace:** `guildhall` (new). +- **DNS record:** `guildhall.guildhouse.dev` → Hetzner LB IPv4 (via Cloudflare). Can be created after LB is provisioned, once the LB IP is known. + +### Likely shape of the deployment + +Based on the keycloak/forgejo pattern: + +``` +Namespace: guildhall +├── Deployment: guildhall-postgres (postgres:16, env POSTGRES_* from guildhall-secrets) +├── PVC: guildhall-db (longhorn, 5-10Gi) +├── Service: guildhall-postgres (ClusterIP, 5432) +├── Secret: guildhall-secrets (SECRET_KEY_BASE, DB_PASSWORD) +├── Deployment: guildhall (image from ghcr / forgejo registry / etc, envs DATABASE_URL + SECRET_KEY_BASE + PHX_HOST=guildhall.guildhouse.dev + PHX_SERVER=true + PORT=4000) +└── Service: guildhall (type=LoadBalancer, Hetzner annotations, port 80 → 4000) +``` + +Release build discipline: +- `mix release` in Docker multi-stage build (Elixir 1.17.3 / OTP 27 builder stage, debian-slim runtime stage) +- `mix ecto.migrate` on container start (or a Job, or mix release custom step) +- `PHX_SERVER=true` to start the HTTP server (per `config/runtime.exs`) +- Health check endpoint (Phoenix default or custom `/health`) + +### Surprises + +**What's present that wasn't expected:** + +- **Keycloak is already serving at `auth.guildhouse.dev`.** The OIDC substrate Guildhall will eventually integrate with is live. Zero setup needed for that dependency when the time comes. +- **cert-manager is installed but unused.** Suggests a deliberate deferral in favor of Cloudflare-edge TLS; the ClusterIssuers are staged and ready for when in-cluster TLS is adopted. +- **Cilium Envoy DaemonSet is running on every node** but with no Gateway API / CiliumEnvoyConfig / L7 policies currently in play. Present for future L7 use, not actively load-bearing yet. + +**What's expected but absent:** + +- **No HAProxy.** Previous K3s-era cluster used HAProxy as ingress; this cluster doesn't. Hetzner LBs took its role. +- **v1 substrate foundation is entirely absent from the running cluster.** bascule, substrate-operator, chronicle, quartermaster, SPIRE — none running. Flux manifests exist (in the `guildhouse-deploy-talos-mirror` repo) but are blocked on a dependency chain rooted at missing `spire-system` namespace and a YAML decode error in `cluster-infra/10-cilium-values.yaml`. Unblocking this is real work that is NOT on the Red Hat path — governance integration is follow-up. +- **No existing Elixir/Phoenix deployment** to copy. Guildhall will be the first Phoenix app on this cluster. +- **Flux source is on GitHub (`guildhouse-deploy-talos-mirror`), not Forgejo.** Follows the same pattern as the substrate-project umbrella migration just completed — another GitHub→Forgejo item on the cleanup list, not blocking. + +### Minimum path to Guildhall running at `guildhall.guildhouse.dev` + +1. Dockerfile in `~/projects/substrate-project/guildhall/` — multi-stage with OTP 27, `mix release` +2. Build and push image to a registry (Forgejo container registry at `git.guildhouse.dev/tking/guildhall:v0.1.0` recommended for consistency) +3. Generate `SECRET_KEY_BASE` via `mix phx.gen.secret` +4. Create `guildhall` namespace; create `guildhall-secrets` Secret +5. Apply Deployment + Service + Postgres + PVC manifest (template from keycloak) +6. Wait for Hetzner LB to provision; note IPv4 +7. Create Cloudflare DNS record `guildhall.guildhouse.dev` → LB IPv4 (proxied, so Cloudflare handles TLS) +8. Verify; run any first-time ecto migration + +No cluster infrastructure changes. No cert-manager Certificates. No Flux reconfiguration. No governance-stack dependency. Just the same Deployment-shaped pattern that Keycloak and Forgejo already use, applied to Guildhall. + +Governance integration (CRD watchers, SPIFFE identity, Chronicle wiring, Accord enforcement) is explicitly follow-up work for after Guildhall is reachable and the Red Hat submission is in. diff --git a/FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md b/FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md new file mode 100644 index 0000000..8cadc62 --- /dev/null +++ b/FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md @@ -0,0 +1,222 @@ +# Forgejo container registry — pre-enablement investigation + +**Date:** 2026-04-21 +**Scope:** Read-only audit of Forgejo's running state + registry configuration to determine what enablement work (if any) is needed before Guildhall's image push. +**Method:** `kubectl` + `curl` against `https://git.guildhouse.dev`. No mutations. +**Headline:** **The container registry is already enabled.** `/v2/` returns a standard OCI 401, storage headroom is ample (19.4 GB free on 20 GB PVC), and no Forgejo config change is required. Enablement work collapses to credential setup + `docker push`. Estimated time to operational registry for Guildhall: **~30 minutes.** + +--- + +## 1. Forgejo deployment details + +| | | +|---|---| +| Namespace | `forgejo` | +| Workload | `Deployment/forgejo` (1 replica, Running) | +| Image | `codeberg.org/forgejo/forgejo:9` | +| Running version | **9.0.3 (Gitea 1.22.0 base)** — confirmed via `GET /api/v1/version` | +| Scheduled node | `gsh-cp-01` (control-plane node, workloads permitted) | +| Companion | `Deployment/forgejo-postgres` (`postgres:16`, 1/1 Running) | +| Init container | `init-config` (renders `/data/gitea/conf/app.ini` from ConfigMap) | +| Runner | `Deployment/forgejo-runner` (0/1 — scaled to zero, source of the Flux health-check warning) | + +**Volume mounts on the forgejo container:** one PVC, `data: /data` (the root Forgejo data path; Forgejo 9.x uses `/data` internally, not `/var/lib/gitea` as older Gitea installs did). + +**PVCs in the namespace:** + +| PVC | Size | StorageClass | Mount | +|---|---|---|---| +| `forgejo-data` | 20 Gi | longhorn | `/data` on forgejo | +| `forgejo-db` | 10 Gi | longhorn | Postgres data | +| `runner-cache` | 5 Gi | longhorn | forgejo-runner (scaled to zero) | + +## 2. Forgejo version and config state + +### Version + +`GET https://git.guildhouse.dev/api/v1/version` → `{"version":"9.0.3+gitea-1.22.0"}` + +Forgejo 9.0.3 is a recent release. Container registry / OCI Distribution API support has been GA in Forgejo since the project forked from Gitea (Gitea 1.17+); this version fully supports the container package type. + +### Configuration + +`forgejo-config` ConfigMap contains the full `app.ini` (40 lines, managed by Flux at path `./k8s/forgejo` in the `guildhouse-deploy-talos-mirror` source repo). Notable sections: + +- `[server]` — `DOMAIN=git.guildhouse.dev`, `ROOT_URL=https://git.guildhouse.dev/`, `HTTP_PORT=3000`, `SSH_PORT=22`, `SSH_LISTEN_PORT=2222`, `LFS_START_SERVER=true` +- `[service]` — `DISABLE_REGISTRATION=true` (invite-only signup) +- `[lfs]` — `STORAGE_TYPE=local` +- `[repository]`, `[actions]` — with an `ENABLED = true` that belongs to Actions, not Packages +- **No explicit `[packages]` section.** This is normal for Forgejo 9.x because packages (including container registry) are enabled by default without requiring config-level opt-in. + +### Verification that container registry is live + +The decisive probe is the OCI Distribution API endpoint root: + +``` +$ curl -sS -w '%{http_code}\n' https://git.guildhouse.dev/v2/ +{"errors":[{"code":"UNAUTHORIZED","message":""}]} +401 +``` + +This is **a standards-compliant OCI registry response** to an unauthenticated request. If the registry were disabled, Forgejo would serve 404 (the endpoint would not be registered). The 401 with a well-formed `errors` object means the registry is routing correctly and simply requires authentication — the default and correct behavior. + +Equivalent probe against `/v2/_catalog` returns the same 401 shape. + +API-layer probe `GET /api/v1/packages/tking` also returns 401 (`token is required`), consistent with packages being enabled but requiring auth. + +### Storage backend + +No overridden `[packages.storage]` in app.ini, which means packages use the default local filesystem path under the Forgejo data volume: `/data/gitea/packages/` (or similar Forgejo 9.x path). This lives on `forgejo-data` (the Longhorn 20 Gi PVC), same volume as git repositories, LFS objects, and Forgejo's own state. + +## 3. How Forgejo is managed + +Forgejo is managed by **Flux**. A `Kustomization` `flux-system/forgejo` reconciles the manifests from: + +- **Source:** `GitRepository/flux-system/guildhouse-deploy` +- **URL:** `https://github.com/gh-tking/guildhouse-deploy-talos-mirror` +- **Branch:** `main` +- **Path:** `./k8s/forgejo` +- **Current revision:** `main@169e077f` +- **Interval:** 1 minute + +**Kustomization inventory** (what Flux claims to own in this path): + +``` +_forgejo__Namespace +forgejo_forgejo-config__ConfigMap ← this is where app.ini lives +forgejo_runner-config__ConfigMap +forgejo_forgejo-secrets__Secret +forgejo_forgejo-http__Service +forgejo_forgejo-postgres__Service +forgejo_forgejo_apps_Deployment +forgejo_forgejo-postgres_apps_Deployment +forgejo_forgejo-runner_apps_Deployment +forgejo_forgejo-data__PersistentVolumeClaim +forgejo_forgejo-db__PersistentVolumeClaim +``` + +**Status:** `Ready: False` / `Healthy: False` because of a health-check timeout on `forgejo-runner` — but this is a scaled-to-zero sidecar Deployment, not a problem with core Forgejo. The core Forgejo Deployment is Ready, the registry is live, and the Kustomization IS reconciling successfully against new commits — the health condition is just stuck on the runner. + +**Consequence:** if we ever needed to change Forgejo's `app.ini` (we don't, for registry work), the mechanism is to edit `k8s/forgejo/forgejo-config.yaml` in the `gh-tking/guildhouse-deploy-talos-mirror` GitHub repo, push to `main`, and wait for Flux to reconcile (1-minute interval). This path is functional today despite the runner health warning. + +## 4. The cluster-infra Flux error + +`kubectl describe kustomization cluster-infra -n flux-system`: + +- **Suspend: true** (explicitly suspended by an operator earlier) +- **Source:** `guildhouse-deploy` GitRepository, path `./talos/manifests/cluster-infra` +- **Error message:** + +``` +failed to decode Kubernetes YAML from /tmp/kustomization-.../talos/manifests/cluster-infra/ +10-cilium-values.yaml: missing Resource metadata +``` + +**Diagnosis:** `10-cilium-values.yaml` is a Helm values file being handed to kustomize-controller as if it were a raw Kubernetes manifest. The file doesn't have a `kind` or `metadata` — it's a values document intended to be consumed by `helm install --values`, not a standalone Kubernetes resource. Kustomize chokes because every file in a Kustomization source path is expected to be Resource-shaped. + +**Fix severity:** trivial. One of: +- Move `10-cilium-values.yaml` into a `values/` subdirectory that isn't referenced by `kustomization.yaml` +- Rename the file so it doesn't get picked up (e.g., `10-cilium-values.yaml.hold`) +- Add a `kustomization.yaml` with explicit `resources:` that excludes it +- Replace the file with a proper `HelmRelease` CR that references the values externally + +Any of these is a single-file source edit, Flux reconciles on next push. + +**Time estimate:** ~30–60 minutes including the commit+push+reconcile+verify cycle. The main complication is that `cluster-infra` has `Suspend: true` — whoever suspended it did so deliberately (likely because the error was cascading to blocked downstream Kustomizations). Un-suspending should probably wait until the underlying YAML is fixed, otherwise the same error re-appears. + +**Crucially: this error does NOT block Forgejo registry work or Guildhall deployment.** The two Kustomizations are independent. Guildhall deployment can proceed entirely outside the Flux chain (direct `kubectl apply` or a new Guildhall-specific Kustomization once registry+deploy are working). The cluster-infra/spire/quartermaster/bascule chain is substrate-foundation work that's explicitly follow-up. + +## 5. Cluster image pull pattern + +**No existing pattern for private-registry pulls.** The entire cluster currently pulls only from public registries: +- `quay.io/keycloak/keycloak:26.0` +- `codeberg.org/forgejo/forgejo:9` +- `postgres:16` (Docker Hub) +- `quay.io/cilium/cilium:v1.16.5` and `quay.io/cilium/cilium-envoy` +- Longhorn and Flux images (all public) + +Specifically: + +``` +$ kubectl get secrets -A --field-selector type=kubernetes.io/dockerconfigjson +No resources found +``` + +Zero `dockerconfigjson` secrets cluster-wide. Zero `imagePullSecrets` referenced on any Deployment. + +**Guildhall will be the first workload pulling from a private Forgejo registry.** It introduces the pattern, which then becomes the template for subsequent workloads. Two options: + +1. **Make the `tking/guildhall` Forgejo package public.** Forgejo packages can be scoped public or private; a public container package allows anonymous pulls and no pull secret is needed. This matches the rest of the cluster's zero-pull-secret state. Appropriate if there's nothing sensitive in the image itself. +2. **Keep the package private and add a `dockerconfigjson` Secret.** Standard pattern: `kubectl create secret docker-registry guildhall-registry --docker-server=git.guildhouse.dev --docker-username= --docker-password=`, then reference in the Deployment via `imagePullSecrets: [name: guildhall-registry]`. + +Option 1 is simplest for v0.1. Option 2 is better hygiene long-term. + +## 6. Storage headroom on Forgejo's volume + +`kubectl exec -n forgejo deployment/forgejo -- df -h` (inside the forgejo container): + +``` +/dev/longhorn/pvc-683ec33a-... 19.5G 137.2M 19.4G 1% /data +``` + +**Headroom is ample.** 19.4 GB free on a 20 GB PVC. Current Forgejo usage after 10 days is 137 MB (git repos + LFS + internal state). + +A Guildhall container image — Elixir release on debian-slim, typically 100-300 MB compressed per tag, with OCI layer deduplication across tags — would add maybe 1-3 GB of package storage over dozens of iterations. No pressure on the volume for the foreseeable future. + +**No resize required.** If long-term registry growth becomes an issue (multiple applications all pushing many tags, or large binary releases), Longhorn supports online expansion of the PVC — but that's a much-later concern. + +--- + +## Synthesis + +### Is the registry already enabled? + +**Yes.** The `/v2/` and `/v2/_catalog` endpoints return proper OCI Distribution API responses (401 unauthenticated with well-formed `errors` objects). Forgejo 9.x enables packages by default; no `[packages]` config section is needed, and none is present. The registry is live and waiting for an authenticated client. + +### What enablement work is required? + +**None at the Forgejo-config layer.** The only work is client-side: + +1. **Create a Forgejo Personal Access Token** (scope: `package:write`) via the Forgejo UI at `https://git.guildhouse.dev/-/user/settings/applications` +2. **Docker login from the build machine:** `docker login git.guildhouse.dev -u tking -p ` +3. **Build + push** the Guildhall image: `docker build -t git.guildhouse.dev/tking/guildhall:v0.1.0 . && docker push …` +4. **Set package visibility** in Forgejo — public (anon-pull, no imagePullSecret needed) or private (create a `dockerconfigjson` Secret in the `guildhall` namespace, reference in Deployment) + +No Flux source edits. No Kustomization changes. No ConfigMap changes. No `cluster-infra` unblock required. + +### Is the `cluster-infra` Flux error a blocker? + +**No.** The Forgejo registry operates entirely outside the cluster-infra / spire / quartermaster / bascule Flux chain. Forgejo is managed by its own independent Kustomization (`flux-system/forgejo`), which is successfully reconciling against source revisions even though its Ready condition is flagged False by the unrelated forgejo-runner health check. + +The `cluster-infra` error is real and worth fixing separately (trivial single-file fix in the GitHub source repo) but it has zero coupling to registry enablement or Guildhall deployment. Treat as a cleanup backlog item, not a pre-req. + +### Estimated time to registry operational + +| Step | Time | +|---|---| +| Create Forgejo PAT (Forgejo UI) | 2 min | +| `docker login git.guildhouse.dev` | <1 min | +| Dockerfile + `mix release` setup in Guildhall repo | 15-20 min (real work) | +| `docker build` (cold build for Elixir + OTP + mix deps + assets) | 5-10 min | +| `docker push` | 1-3 min (single tag, ~200 MB compressed) | +| Set package visibility (public or private + pull secret) | 2-5 min | +| **Total to first successful image in the registry** | **~30-45 min** | + +Most of the time is the Dockerfile + release-build setup, not the registry interaction itself. + +### Recommended next step + +**Build the Guildhall Dockerfile and push a first image.** Sequencing: + +1. Author `Dockerfile` in `~/projects/substrate-project/guildhall/` — multi-stage (Elixir 1.17.3/OTP 27 builder → debian-slim runtime, `mix release`, non-root user, expose 4000, healthcheck endpoint) +2. Author `.dockerignore` that excludes `_build/`, `deps/`, `.git/`, `priv/static/` (if built separately) — matches Phoenix release conventions +3. Create Forgejo PAT with `package:write` scope +4. `docker login git.guildhouse.dev` from the desktop +5. `docker build -t git.guildhouse.dev/tking/guildhall:v0.1.0 .` +6. `docker push git.guildhouse.dev/tking/guildhall:v0.1.0` +7. Verify via Forgejo UI at `https://git.guildhouse.dev/tking/-/packages/container/guildhall` and via `curl` to `/v2/tking/guildhall/manifests/v0.1.0` (authenticated) +8. Decide package visibility, and if private, create `guildhall-registry` Secret in the `guildhall` namespace (namespace doesn't exist yet — create at deploy time) + +The Kubernetes-side deploy (Deployment + Service + Postgres + PVC + Secret) proceeds in parallel with or immediately after the image build, following the Keycloak pattern captured in the earlier `DEPLOY-EXPLORATORY-2026-04-21.md`. + +No pre-work needed on Forgejo itself. The registry is ready.