docs(deploy): capture exploratory reports for Talos + Forgejo registry
DEPLOY-EXPLORATORY documents the cluster state that shaped deployment decisions (Keycloak as template, Hetzner LB + Cloudflare pattern, no Postgres operator so sibling-Deployment pattern). FORGEJO-REGISTRY-INVESTIGATION documents that the registry was already operational in Forgejo 9.0.3 (packages enabled by default) and the storage/credential path forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler J King <tking@guildhouse.dev>
This commit is contained in:
parent
c6f1d07ed9
commit
115bd178a2
2 changed files with 526 additions and 0 deletions
304
DEPLOY-EXPLORATORY-2026-04-21.md
Normal file
304
DEPLOY-EXPLORATORY-2026-04-21.md
Normal file
|
|
@ -0,0 +1,304 @@
|
||||||
|
# Guildhall deploy exploratory — Talos/Hetzner cluster state
|
||||||
|
|
||||||
|
**Date:** 2026-04-21
|
||||||
|
**Scope:** Read-only audit of the Talos/Hetzner Kubernetes cluster to inform Guildhall's initial deployment.
|
||||||
|
**Method:** `kubectl` against the cluster via `~/projects/substrate-project/guildhouse-talos-bootstrap/kubeconfig`. No mutations.
|
||||||
|
**Takeaway (synthesis at end):** Guildhall fits cleanly into the existing Keycloak/Forgejo deployment pattern: plain `Deployment` + `Deployment`-backed Postgres + Longhorn PVC + Hetzner LoadBalancer + Cloudflare-terminated TLS. No new infrastructure components required. The v1 substrate foundation (bascule / quartermaster / spire / chronicle / substrate-operator) is Flux-manifested but broken and not running; governance integration is explicitly follow-up work, not blocking.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Cluster basics
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Control plane endpoint | `https://178.104.100.159:6443` |
|
||||||
|
| kubectl client | v1.32.2 |
|
||||||
|
| kubectl server | v1.32.3 |
|
||||||
|
| Nodes | 5 (3 control-plane + 2 workers), all Ready |
|
||||||
|
| OS | Talos v1.9.5 |
|
||||||
|
| Kernel | 6.12.18-talos |
|
||||||
|
| Container runtime | containerd 2.0.3 |
|
||||||
|
| Cluster age | 10 days |
|
||||||
|
|
||||||
|
```
|
||||||
|
gsh-cp-01 control-plane 10.0.1.10
|
||||||
|
gsh-cp-02 control-plane 10.0.1.20
|
||||||
|
gsh-cp-03 control-plane 10.0.1.21
|
||||||
|
gsh-worker-01 worker 10.0.1.22
|
||||||
|
gsh-worker-02 worker 10.0.1.30
|
||||||
|
```
|
||||||
|
|
||||||
|
Matches the memory-carried description (Hetzner Talos cluster 2026-04-11: Talos 1.9.5, 5 nodes, 3 CP + 2 worker). No drift.
|
||||||
|
|
||||||
|
## 2. Namespace inventory
|
||||||
|
|
||||||
|
| Namespace | Purpose | Workloads |
|
||||||
|
|---|---|---|
|
||||||
|
| `cert-manager` | cert-manager 3 controllers (cert-manager, cainjector, webhook) | 3 Deployments |
|
||||||
|
| `flux-system` | Flux GitOps | 4 Deployments (source / kustomize / helm / notification controllers) |
|
||||||
|
| `forgejo` | Forgejo git (self-hosted) | Deployment + Postgres Deployment + Runner Deployment (0/1, stuck) |
|
||||||
|
| `keycloak` | Keycloak OIDC IdP | Deployment + Postgres Deployment |
|
||||||
|
| `longhorn-system` | Longhorn CSI storage | 5 DaemonSets + 6 Deployments + UI |
|
||||||
|
| `kube-system`, `kube-public`, `kube-node-lease` | K8s system | — |
|
||||||
|
| `default` | Empty | — |
|
||||||
|
|
||||||
|
**Application workloads:** `forgejo`, `keycloak`. These are the reference patterns for Guildhall.
|
||||||
|
|
||||||
|
## 3. Ingress / gateway state
|
||||||
|
|
||||||
|
**No traditional ingress controller, no Gateway API:**
|
||||||
|
|
||||||
|
- `kubectl get ingressclasses` → No resources found
|
||||||
|
- `kubectl get gatewayclasses` → server doesn't have the resource type (Gateway API CRDs not installed)
|
||||||
|
- No HAProxy / nginx / traefik / istio / kong pods anywhere
|
||||||
|
|
||||||
|
**Traffic reaches services via `type: LoadBalancer` directly**, backed by the **Hetzner Cloud Controller Manager** (`hcloud-cloud-controller-manager` running in `kube-system`). Each LoadBalancer Service provisions a real Hetzner Cloud Load Balancer via annotations:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
annotations:
|
||||||
|
load-balancer.hetzner.cloud/location: nbg1
|
||||||
|
load-balancer.hetzner.cloud/name: keycloak-lb-v2
|
||||||
|
load-balancer.hetzner.cloud/type: lb11
|
||||||
|
load-balancer.hetzner.cloud/use-private-ip: "false"
|
||||||
|
```
|
||||||
|
|
||||||
|
Cilium Envoy DaemonSet pods exist on every node (`cilium-envoy` for L7 proxy features — CiliumNetworkPolicy L7 filtering, not Gateway API). `enable-l7-proxy: true` is set in the Cilium config.
|
||||||
|
|
||||||
|
**Existing LoadBalancer services and their public addresses:**
|
||||||
|
|
||||||
|
| Service | Namespace | IPv4 | IPv6 | Ports |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| `forgejo-http` | forgejo | `46.225.47.75` | `2a01:4f8:1c1f:65bb::1` | 80 → 30000, 22 → 30022 |
|
||||||
|
| `keycloak` | keycloak | `162.55.157.168` | `2a01:4f8:1c1d:1109::1` | 80 → 30080 |
|
||||||
|
|
||||||
|
Both expose port 80 only. No in-cluster TLS termination. TLS is terminated **upstream at Cloudflare** (`git.guildhouse.dev` and `auth.guildhouse.dev` resolve to Cloudflare IPs; Cloudflare proxies to the Hetzner LB IPs).
|
||||||
|
|
||||||
|
## 4. cert-manager / TLS
|
||||||
|
|
||||||
|
cert-manager is installed and healthy but **no Certificate resources exist anywhere on the cluster**.
|
||||||
|
|
||||||
|
| ClusterIssuer | Status |
|
||||||
|
|---|---|
|
||||||
|
| `letsencrypt-prod` | Ready |
|
||||||
|
| `letsencrypt-staging` | Ready |
|
||||||
|
|
||||||
|
Both ClusterIssuers are provisioned and ready to issue. They just aren't being used yet — TLS is currently handled via Cloudflare's Universal SSL / Full mode at the edge, with HTTP between Cloudflare and the Hetzner LBs.
|
||||||
|
|
||||||
|
**Implication for Guildhall:** can choose between the existing Cloudflare-termination pattern (simplest, matches forgejo/keycloak) or start using cert-manager (more work, cluster-integrated certs). The former is tonight's path; the latter is a hygiene follow-up.
|
||||||
|
|
||||||
|
## 5. Database patterns
|
||||||
|
|
||||||
|
**No Postgres operator** installed. CRDs checked (all absent):
|
||||||
|
- CloudNativePG (`clusters.postgresql.cnpg.io`)
|
||||||
|
- Zalando postgres-operator (`postgresqls.acid.zalan.do`)
|
||||||
|
- Crunchy PGO (`postgresclusters.postgres-operator.crunchydata.com`)
|
||||||
|
|
||||||
|
**Existing pattern is plain Deployment + PVC:**
|
||||||
|
|
||||||
|
```
|
||||||
|
forgejo-postgres Deployment postgres:16 PVC: forgejo-db 10Gi longhorn
|
||||||
|
keycloak-postgres Deployment postgres:16 PVC: keycloak-db 5Gi longhorn
|
||||||
|
```
|
||||||
|
|
||||||
|
**Storage:** Longhorn 1.x, single StorageClass `longhorn` (default). All PVCs use it. 5 DaemonSet replicas of longhorn-manager confirm storage is healthy across all nodes.
|
||||||
|
|
||||||
|
**Current PVCs:**
|
||||||
|
|
||||||
|
| PVC | Namespace | Size | StorageClass |
|
||||||
|
|---|---|---|---|
|
||||||
|
| forgejo-data | forgejo | 20Gi | longhorn |
|
||||||
|
| forgejo-db | forgejo | 10Gi | longhorn |
|
||||||
|
| runner-cache | forgejo | 5Gi | longhorn |
|
||||||
|
| keycloak-db | keycloak | 5Gi | longhorn |
|
||||||
|
|
||||||
|
## 6. Secrets management
|
||||||
|
|
||||||
|
**None of the common secret managers are installed:**
|
||||||
|
|
||||||
|
- External Secrets Operator: absent
|
||||||
|
- Sealed Secrets: absent
|
||||||
|
- SPIRE/SPIFFE: absent (Flux has a Kustomization for it but the `spire-system` namespace doesn't exist — see §10 Flux state)
|
||||||
|
- Vault: absent
|
||||||
|
|
||||||
|
**Secrets are plain Opaque `Secret` resources.** Examples:
|
||||||
|
- `forgejo/forgejo-secrets` (3 keys)
|
||||||
|
- `keycloak/keycloak-secrets` (2 keys)
|
||||||
|
|
||||||
|
Managed out-of-band (likely committed to the private Flux source repo or applied via kubectl during bootstrap). No rotation mechanism visible.
|
||||||
|
|
||||||
|
## 7. Existing workload patterns
|
||||||
|
|
||||||
|
**Reference: `keycloak` Deployment** (cleanest example — the only Flux Kustomization that's `Ready`):
|
||||||
|
|
||||||
|
- **Image:** `quay.io/keycloak/keycloak:26.0` (public registry)
|
||||||
|
- **Env composition:** mix of literal `value:` (DB host, DB port, realm name) and `valueFrom.secretKeyRef` (admin password, DB password)
|
||||||
|
- **Labels:** `app.kubernetes.io/name=keycloak`, `app.kubernetes.io/part-of=guildhouse`
|
||||||
|
- **Config files:** ConfigMap-mounted realm import (`keycloak-realm-import`)
|
||||||
|
- **Resources:** resource requests/limits not aggressively set (defaults mostly)
|
||||||
|
- **Service:** `type: LoadBalancer` with Hetzner annotations, exposes port 80 only
|
||||||
|
- **TLS:** none in-cluster; Cloudflare upstream
|
||||||
|
|
||||||
|
**Reference: `forgejo-postgres` Deployment:**
|
||||||
|
|
||||||
|
- **Image:** `postgres:16` (public Docker Hub)
|
||||||
|
- **Env:** `POSTGRES_USER`, `POSTGRES_DB` literal; `POSTGRES_PASSWORD` from Secret
|
||||||
|
- **PGDATA:** `/var/lib/postgresql/data/pgdata` (standard subdirectory to avoid lost+found issues)
|
||||||
|
- **Volume:** PVC mounted at `/var/lib/postgresql/data`
|
||||||
|
|
||||||
|
**No existing Elixir/Phoenix deployment** to reference. Guildhall will be the first. The pattern will follow the Keycloak/Forgejo shape applied to Phoenix's runtime requirements.
|
||||||
|
|
||||||
|
## 8. Guildhouse-specific components (v1 foundation)
|
||||||
|
|
||||||
|
**Currently running: none.**
|
||||||
|
|
||||||
|
- No pod matching `substrate`, `bascule`, `chronicle`, `quartermaster`, or `spire` across all namespaces. The v1 substrate foundation is absent from the cluster's running state.
|
||||||
|
- Flux has Kustomizations for `bascule`, `quartermaster`, `spire`, `automation`, `governance-talos`, `gitops-controller` — all **failing** on a dependency chain:
|
||||||
|
|
||||||
|
```
|
||||||
|
spire → fails: namespace "spire-system" does not exist
|
||||||
|
quartermaster → fails: dependency flux-system/spire is not ready
|
||||||
|
bascule → fails: dependency flux-system/quartermaster is not ready
|
||||||
|
automation → fails: dependency flux-system/quartermaster is not ready
|
||||||
|
gitops-controller → fails: dependency flux-system/quartermaster is not ready
|
||||||
|
governance-talos → fails: dependency flux-system/cluster-infra is not ready
|
||||||
|
cluster-infra → SUSPENDED + YAML decode error on 10-cilium-values.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
This chain needs to be unblocked for the v1 substrate foundation to reach the cluster, but **this is explicitly NOT Guildhall's blocker**. Guildhall is the standalone orchestration/presentation layer; it composes with substrate via CRD watches once substrate is running, but doesn't require substrate present to stand up and serve its web UI.
|
||||||
|
|
||||||
|
## 9. Networking specifics
|
||||||
|
|
||||||
|
**Cilium version:** `v1.16.5` (Cilium 1.16 series, recent but not 1.17-cutting-edge)
|
||||||
|
|
||||||
|
**Key Cilium config** (from `kube-system/cilium-config`):
|
||||||
|
|
||||||
|
| Flag | Value | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| `kube-proxy-replacement` | `true` | Cilium replaces kube-proxy (full eBPF mode) |
|
||||||
|
| `enable-ipv4` | `true` | IPv4 on pod network |
|
||||||
|
| `enable-ipv6` | `false` | IPv6 NOT enabled at pod network (LBs get Hetzner-assigned v6 externally) |
|
||||||
|
| `enable-l7-proxy` | `true` | Envoy DaemonSet for L7 filtering |
|
||||||
|
| `enable-hubble` | `true` | Hubble observability |
|
||||||
|
| `ipam` | `kubernetes` | Host-IPAM, not cluster-pool |
|
||||||
|
|
||||||
|
**Not enabled / not present:**
|
||||||
|
- BGP control plane (`ciliumbgppeeringpolicies` CRD absent)
|
||||||
|
- L2 announcements (`ciliuml2announcementpolicies` CRD present but zero resources)
|
||||||
|
- LoadBalancerIPPool (CRD present but zero resources — Hetzner CCM handles LB IPs instead)
|
||||||
|
- Gateway API (`gatewayclasses` CRD absent)
|
||||||
|
- ClusterMesh (single-cluster)
|
||||||
|
|
||||||
|
**NetworkPolicies in place** (only 3, all in `flux-system`):
|
||||||
|
- `allow-egress`
|
||||||
|
- `allow-scraping`
|
||||||
|
- `allow-webhooks` (scoped to `app=notification-controller`)
|
||||||
|
|
||||||
|
**CiliumNetworkPolicies:** none. Workloads rely on default-allow between pods. Guildhall deployment can proceed without adding policies; adding them is hardening follow-up.
|
||||||
|
|
||||||
|
## 10. Deployment automation
|
||||||
|
|
||||||
|
**GitOps: Flux** is the sole mechanism. Running components:
|
||||||
|
- `source-controller`, `kustomize-controller`, `helm-controller`, `notification-controller` — all 1/1 Ready
|
||||||
|
|
||||||
|
**Sources:** one `GitRepository` registered:
|
||||||
|
|
||||||
|
```
|
||||||
|
flux-system / guildhouse-deploy
|
||||||
|
URL: https://github.com/gh-tking/guildhouse-deploy-talos-mirror
|
||||||
|
STATUS: Ready (artifact stored for main@169e077f)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Kustomizations:** 9 total, summary:
|
||||||
|
|
||||||
|
| Name | Status |
|
||||||
|
|---|---|
|
||||||
|
| `keycloak` | ✅ Ready (applied revision `169e077f`) |
|
||||||
|
| `forgejo` | ❌ health check failed (forgejo-runner Deployment stuck InProgress) |
|
||||||
|
| `cluster-infra` | ❌ SUSPENDED + YAML decode error |
|
||||||
|
| `spire` | ❌ `spire-system` namespace not found |
|
||||||
|
| `quartermaster` | ❌ depends on spire (not ready) |
|
||||||
|
| `bascule` | ❌ depends on quartermaster (not ready) |
|
||||||
|
| `automation` | ❌ depends on quartermaster (not ready) |
|
||||||
|
| `gitops-controller` | ❌ depends on quartermaster (not ready) |
|
||||||
|
| `governance-talos` | ❌ depends on cluster-infra (not ready) |
|
||||||
|
|
||||||
|
**Key observation:** only `keycloak` flows through Flux successfully. Everything else is either suspended, blocked on missing upstream dependencies, or has a YAML error in the source repo.
|
||||||
|
|
||||||
|
**HelmRepositories and HelmReleases:** none.
|
||||||
|
|
||||||
|
**Changes land on the cluster:** currently via Flux against the GitHub-hosted source repo for the one working Kustomization (keycloak), otherwise via direct `kubectl apply` (given the broken Flux chain).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Synthesis
|
||||||
|
|
||||||
|
### What Guildhall can leverage
|
||||||
|
|
||||||
|
- **Longhorn StorageClass** — works out of the box for Postgres PVC. 5Gi is ample for initial Guildhall DB (matches keycloak-db sizing).
|
||||||
|
- **Hetzner CCM LoadBalancer** — a LoadBalancer Service with `load-balancer.hetzner.cloud/*` annotations provisions a new Hetzner LB automatically. Cost is ~€5/mo for an `lb11` tier. Matches forgejo / keycloak exactly.
|
||||||
|
- **Cloudflare-at-the-edge TLS** — DNS at `guildhall.guildhouse.dev` points at the Hetzner LB IPv4, Cloudflare terminates TLS, origin is plain HTTP on port 80. Zero cert-manager work required for v1.
|
||||||
|
- **Keycloak as OIDC IdP** — already running at `auth.guildhouse.dev`. When Guildhall wires its OIDC config (currently commented out in `config/runtime.exs`), the endpoint is ready. Not blocking tonight.
|
||||||
|
- **cert-manager ClusterIssuers** — `letsencrypt-prod` and `letsencrypt-staging` are ready, available as upgrade-path from Cloudflare-edge TLS to cluster-terminated TLS if/when that hygiene pass happens.
|
||||||
|
- **Reference deployment pattern** — keycloak's Deployment shape (public image, env-from-secret, ConfigMap for data, Service type=LoadBalancer, Postgres sibling Deployment + PVC) maps directly to Guildhall. Apply the same template.
|
||||||
|
- **Flux GitOps pipeline exists** (if desired) — a new Kustomization in `guildhouse-deploy-talos-mirror` for Guildhall would auto-deploy. BUT the Flux state is currently messy — most Kustomizations are broken — so a direct `kubectl apply` path is cleaner for the v1 Guildhall deploy, with a follow-up Flux migration once the broader chain is healed.
|
||||||
|
|
||||||
|
### What Guildhall needs that the cluster doesn't have yet
|
||||||
|
|
||||||
|
- **Guildhall container image.** Must be built locally via `mix release` + Dockerfile and pushed to a registry the cluster can pull from. Registry options:
|
||||||
|
- `ghcr.io/gh-tking/guildhall:<tag>` — public GitHub Container Registry (requires packaging via the GitHub Actions or manual docker push)
|
||||||
|
- Docker Hub under a personal account
|
||||||
|
- **Forgejo container registry** at `git.guildhouse.dev/tking/guildhall:<tag>` — Forgejo 1.19+ supports OCI registry; this is the most consistent choice with the rest of the Guildhouse tooling
|
||||||
|
- A private Hetzner-region ghcr mirror
|
||||||
|
- **Secrets:** `guildhall-secrets` Opaque Secret with at minimum `SECRET_KEY_BASE` (64-byte Phoenix session key, `mix phx.gen.secret`) and `DATABASE_URL` (or discrete `DB_PASSWORD` + construct URL at runtime).
|
||||||
|
- **Namespace:** `guildhall` (new).
|
||||||
|
- **DNS record:** `guildhall.guildhouse.dev` → Hetzner LB IPv4 (via Cloudflare). Can be created after LB is provisioned, once the LB IP is known.
|
||||||
|
|
||||||
|
### Likely shape of the deployment
|
||||||
|
|
||||||
|
Based on the keycloak/forgejo pattern:
|
||||||
|
|
||||||
|
```
|
||||||
|
Namespace: guildhall
|
||||||
|
├── Deployment: guildhall-postgres (postgres:16, env POSTGRES_* from guildhall-secrets)
|
||||||
|
├── PVC: guildhall-db (longhorn, 5-10Gi)
|
||||||
|
├── Service: guildhall-postgres (ClusterIP, 5432)
|
||||||
|
├── Secret: guildhall-secrets (SECRET_KEY_BASE, DB_PASSWORD)
|
||||||
|
├── Deployment: guildhall (image from ghcr / forgejo registry / etc, envs DATABASE_URL + SECRET_KEY_BASE + PHX_HOST=guildhall.guildhouse.dev + PHX_SERVER=true + PORT=4000)
|
||||||
|
└── Service: guildhall (type=LoadBalancer, Hetzner annotations, port 80 → 4000)
|
||||||
|
```
|
||||||
|
|
||||||
|
Release build discipline:
|
||||||
|
- `mix release` in Docker multi-stage build (Elixir 1.17.3 / OTP 27 builder stage, debian-slim runtime stage)
|
||||||
|
- `mix ecto.migrate` on container start (or a Job, or mix release custom step)
|
||||||
|
- `PHX_SERVER=true` to start the HTTP server (per `config/runtime.exs`)
|
||||||
|
- Health check endpoint (Phoenix default or custom `/health`)
|
||||||
|
|
||||||
|
### Surprises
|
||||||
|
|
||||||
|
**What's present that wasn't expected:**
|
||||||
|
|
||||||
|
- **Keycloak is already serving at `auth.guildhouse.dev`.** The OIDC substrate Guildhall will eventually integrate with is live. Zero setup needed for that dependency when the time comes.
|
||||||
|
- **cert-manager is installed but unused.** Suggests a deliberate deferral in favor of Cloudflare-edge TLS; the ClusterIssuers are staged and ready for when in-cluster TLS is adopted.
|
||||||
|
- **Cilium Envoy DaemonSet is running on every node** but with no Gateway API / CiliumEnvoyConfig / L7 policies currently in play. Present for future L7 use, not actively load-bearing yet.
|
||||||
|
|
||||||
|
**What's expected but absent:**
|
||||||
|
|
||||||
|
- **No HAProxy.** Previous K3s-era cluster used HAProxy as ingress; this cluster doesn't. Hetzner LBs took its role.
|
||||||
|
- **v1 substrate foundation is entirely absent from the running cluster.** bascule, substrate-operator, chronicle, quartermaster, SPIRE — none running. Flux manifests exist (in the `guildhouse-deploy-talos-mirror` repo) but are blocked on a dependency chain rooted at missing `spire-system` namespace and a YAML decode error in `cluster-infra/10-cilium-values.yaml`. Unblocking this is real work that is NOT on the Red Hat path — governance integration is follow-up.
|
||||||
|
- **No existing Elixir/Phoenix deployment** to copy. Guildhall will be the first Phoenix app on this cluster.
|
||||||
|
- **Flux source is on GitHub (`guildhouse-deploy-talos-mirror`), not Forgejo.** Follows the same pattern as the substrate-project umbrella migration just completed — another GitHub→Forgejo item on the cleanup list, not blocking.
|
||||||
|
|
||||||
|
### Minimum path to Guildhall running at `guildhall.guildhouse.dev`
|
||||||
|
|
||||||
|
1. Dockerfile in `~/projects/substrate-project/guildhall/` — multi-stage with OTP 27, `mix release`
|
||||||
|
2. Build and push image to a registry (Forgejo container registry at `git.guildhouse.dev/tking/guildhall:v0.1.0` recommended for consistency)
|
||||||
|
3. Generate `SECRET_KEY_BASE` via `mix phx.gen.secret`
|
||||||
|
4. Create `guildhall` namespace; create `guildhall-secrets` Secret
|
||||||
|
5. Apply Deployment + Service + Postgres + PVC manifest (template from keycloak)
|
||||||
|
6. Wait for Hetzner LB to provision; note IPv4
|
||||||
|
7. Create Cloudflare DNS record `guildhall.guildhouse.dev` → LB IPv4 (proxied, so Cloudflare handles TLS)
|
||||||
|
8. Verify; run any first-time ecto migration
|
||||||
|
|
||||||
|
No cluster infrastructure changes. No cert-manager Certificates. No Flux reconfiguration. No governance-stack dependency. Just the same Deployment-shaped pattern that Keycloak and Forgejo already use, applied to Guildhall.
|
||||||
|
|
||||||
|
Governance integration (CRD watchers, SPIFFE identity, Chronicle wiring, Accord enforcement) is explicitly follow-up work for after Guildhall is reachable and the Red Hat submission is in.
|
||||||
222
FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md
Normal file
222
FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md
Normal file
|
|
@ -0,0 +1,222 @@
|
||||||
|
# Forgejo container registry — pre-enablement investigation
|
||||||
|
|
||||||
|
**Date:** 2026-04-21
|
||||||
|
**Scope:** Read-only audit of Forgejo's running state + registry configuration to determine what enablement work (if any) is needed before Guildhall's image push.
|
||||||
|
**Method:** `kubectl` + `curl` against `https://git.guildhouse.dev`. No mutations.
|
||||||
|
**Headline:** **The container registry is already enabled.** `/v2/` returns a standard OCI 401, storage headroom is ample (19.4 GB free on 20 GB PVC), and no Forgejo config change is required. Enablement work collapses to credential setup + `docker push`. Estimated time to operational registry for Guildhall: **~30 minutes.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Forgejo deployment details
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Namespace | `forgejo` |
|
||||||
|
| Workload | `Deployment/forgejo` (1 replica, Running) |
|
||||||
|
| Image | `codeberg.org/forgejo/forgejo:9` |
|
||||||
|
| Running version | **9.0.3 (Gitea 1.22.0 base)** — confirmed via `GET /api/v1/version` |
|
||||||
|
| Scheduled node | `gsh-cp-01` (control-plane node, workloads permitted) |
|
||||||
|
| Companion | `Deployment/forgejo-postgres` (`postgres:16`, 1/1 Running) |
|
||||||
|
| Init container | `init-config` (renders `/data/gitea/conf/app.ini` from ConfigMap) |
|
||||||
|
| Runner | `Deployment/forgejo-runner` (0/1 — scaled to zero, source of the Flux health-check warning) |
|
||||||
|
|
||||||
|
**Volume mounts on the forgejo container:** one PVC, `data: /data` (the root Forgejo data path; Forgejo 9.x uses `/data` internally, not `/var/lib/gitea` as older Gitea installs did).
|
||||||
|
|
||||||
|
**PVCs in the namespace:**
|
||||||
|
|
||||||
|
| PVC | Size | StorageClass | Mount |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `forgejo-data` | 20 Gi | longhorn | `/data` on forgejo |
|
||||||
|
| `forgejo-db` | 10 Gi | longhorn | Postgres data |
|
||||||
|
| `runner-cache` | 5 Gi | longhorn | forgejo-runner (scaled to zero) |
|
||||||
|
|
||||||
|
## 2. Forgejo version and config state
|
||||||
|
|
||||||
|
### Version
|
||||||
|
|
||||||
|
`GET https://git.guildhouse.dev/api/v1/version` → `{"version":"9.0.3+gitea-1.22.0"}`
|
||||||
|
|
||||||
|
Forgejo 9.0.3 is a recent release. Container registry / OCI Distribution API support has been GA in Forgejo since the project forked from Gitea (Gitea 1.17+); this version fully supports the container package type.
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
`forgejo-config` ConfigMap contains the full `app.ini` (40 lines, managed by Flux at path `./k8s/forgejo` in the `guildhouse-deploy-talos-mirror` source repo). Notable sections:
|
||||||
|
|
||||||
|
- `[server]` — `DOMAIN=git.guildhouse.dev`, `ROOT_URL=https://git.guildhouse.dev/`, `HTTP_PORT=3000`, `SSH_PORT=22`, `SSH_LISTEN_PORT=2222`, `LFS_START_SERVER=true`
|
||||||
|
- `[service]` — `DISABLE_REGISTRATION=true` (invite-only signup)
|
||||||
|
- `[lfs]` — `STORAGE_TYPE=local`
|
||||||
|
- `[repository]`, `[actions]` — with an `ENABLED = true` that belongs to Actions, not Packages
|
||||||
|
- **No explicit `[packages]` section.** This is normal for Forgejo 9.x because packages (including container registry) are enabled by default without requiring config-level opt-in.
|
||||||
|
|
||||||
|
### Verification that container registry is live
|
||||||
|
|
||||||
|
The decisive probe is the OCI Distribution API endpoint root:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ curl -sS -w '%{http_code}\n' https://git.guildhouse.dev/v2/
|
||||||
|
{"errors":[{"code":"UNAUTHORIZED","message":""}]}
|
||||||
|
401
|
||||||
|
```
|
||||||
|
|
||||||
|
This is **a standards-compliant OCI registry response** to an unauthenticated request. If the registry were disabled, Forgejo would serve 404 (the endpoint would not be registered). The 401 with a well-formed `errors` object means the registry is routing correctly and simply requires authentication — the default and correct behavior.
|
||||||
|
|
||||||
|
Equivalent probe against `/v2/_catalog` returns the same 401 shape.
|
||||||
|
|
||||||
|
API-layer probe `GET /api/v1/packages/tking` also returns 401 (`token is required`), consistent with packages being enabled but requiring auth.
|
||||||
|
|
||||||
|
### Storage backend
|
||||||
|
|
||||||
|
No overridden `[packages.storage]` in app.ini, which means packages use the default local filesystem path under the Forgejo data volume: `/data/gitea/packages/` (or similar Forgejo 9.x path). This lives on `forgejo-data` (the Longhorn 20 Gi PVC), same volume as git repositories, LFS objects, and Forgejo's own state.
|
||||||
|
|
||||||
|
## 3. How Forgejo is managed
|
||||||
|
|
||||||
|
Forgejo is managed by **Flux**. A `Kustomization` `flux-system/forgejo` reconciles the manifests from:
|
||||||
|
|
||||||
|
- **Source:** `GitRepository/flux-system/guildhouse-deploy`
|
||||||
|
- **URL:** `https://github.com/gh-tking/guildhouse-deploy-talos-mirror`
|
||||||
|
- **Branch:** `main`
|
||||||
|
- **Path:** `./k8s/forgejo`
|
||||||
|
- **Current revision:** `main@169e077f`
|
||||||
|
- **Interval:** 1 minute
|
||||||
|
|
||||||
|
**Kustomization inventory** (what Flux claims to own in this path):
|
||||||
|
|
||||||
|
```
|
||||||
|
_forgejo__Namespace
|
||||||
|
forgejo_forgejo-config__ConfigMap ← this is where app.ini lives
|
||||||
|
forgejo_runner-config__ConfigMap
|
||||||
|
forgejo_forgejo-secrets__Secret
|
||||||
|
forgejo_forgejo-http__Service
|
||||||
|
forgejo_forgejo-postgres__Service
|
||||||
|
forgejo_forgejo_apps_Deployment
|
||||||
|
forgejo_forgejo-postgres_apps_Deployment
|
||||||
|
forgejo_forgejo-runner_apps_Deployment
|
||||||
|
forgejo_forgejo-data__PersistentVolumeClaim
|
||||||
|
forgejo_forgejo-db__PersistentVolumeClaim
|
||||||
|
```
|
||||||
|
|
||||||
|
**Status:** `Ready: False` / `Healthy: False` because of a health-check timeout on `forgejo-runner` — but this is a scaled-to-zero sidecar Deployment, not a problem with core Forgejo. The core Forgejo Deployment is Ready, the registry is live, and the Kustomization IS reconciling successfully against new commits — the health condition is just stuck on the runner.
|
||||||
|
|
||||||
|
**Consequence:** if we ever needed to change Forgejo's `app.ini` (we don't, for registry work), the mechanism is to edit `k8s/forgejo/forgejo-config.yaml` in the `gh-tking/guildhouse-deploy-talos-mirror` GitHub repo, push to `main`, and wait for Flux to reconcile (1-minute interval). This path is functional today despite the runner health warning.
|
||||||
|
|
||||||
|
## 4. The cluster-infra Flux error
|
||||||
|
|
||||||
|
`kubectl describe kustomization cluster-infra -n flux-system`:
|
||||||
|
|
||||||
|
- **Suspend: true** (explicitly suspended by an operator earlier)
|
||||||
|
- **Source:** `guildhouse-deploy` GitRepository, path `./talos/manifests/cluster-infra`
|
||||||
|
- **Error message:**
|
||||||
|
|
||||||
|
```
|
||||||
|
failed to decode Kubernetes YAML from /tmp/kustomization-.../talos/manifests/cluster-infra/
|
||||||
|
10-cilium-values.yaml: missing Resource metadata <nil>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Diagnosis:** `10-cilium-values.yaml` is a Helm values file being handed to kustomize-controller as if it were a raw Kubernetes manifest. The file doesn't have a `kind` or `metadata` — it's a values document intended to be consumed by `helm install --values`, not a standalone Kubernetes resource. Kustomize chokes because every file in a Kustomization source path is expected to be Resource-shaped.
|
||||||
|
|
||||||
|
**Fix severity:** trivial. One of:
|
||||||
|
- Move `10-cilium-values.yaml` into a `values/` subdirectory that isn't referenced by `kustomization.yaml`
|
||||||
|
- Rename the file so it doesn't get picked up (e.g., `10-cilium-values.yaml.hold`)
|
||||||
|
- Add a `kustomization.yaml` with explicit `resources:` that excludes it
|
||||||
|
- Replace the file with a proper `HelmRelease` CR that references the values externally
|
||||||
|
|
||||||
|
Any of these is a single-file source edit, Flux reconciles on next push.
|
||||||
|
|
||||||
|
**Time estimate:** ~30–60 minutes including the commit+push+reconcile+verify cycle. The main complication is that `cluster-infra` has `Suspend: true` — whoever suspended it did so deliberately (likely because the error was cascading to blocked downstream Kustomizations). Un-suspending should probably wait until the underlying YAML is fixed, otherwise the same error re-appears.
|
||||||
|
|
||||||
|
**Crucially: this error does NOT block Forgejo registry work or Guildhall deployment.** The two Kustomizations are independent. Guildhall deployment can proceed entirely outside the Flux chain (direct `kubectl apply` or a new Guildhall-specific Kustomization once registry+deploy are working). The cluster-infra/spire/quartermaster/bascule chain is substrate-foundation work that's explicitly follow-up.
|
||||||
|
|
||||||
|
## 5. Cluster image pull pattern
|
||||||
|
|
||||||
|
**No existing pattern for private-registry pulls.** The entire cluster currently pulls only from public registries:
|
||||||
|
- `quay.io/keycloak/keycloak:26.0`
|
||||||
|
- `codeberg.org/forgejo/forgejo:9`
|
||||||
|
- `postgres:16` (Docker Hub)
|
||||||
|
- `quay.io/cilium/cilium:v1.16.5` and `quay.io/cilium/cilium-envoy`
|
||||||
|
- Longhorn and Flux images (all public)
|
||||||
|
|
||||||
|
Specifically:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ kubectl get secrets -A --field-selector type=kubernetes.io/dockerconfigjson
|
||||||
|
No resources found
|
||||||
|
```
|
||||||
|
|
||||||
|
Zero `dockerconfigjson` secrets cluster-wide. Zero `imagePullSecrets` referenced on any Deployment.
|
||||||
|
|
||||||
|
**Guildhall will be the first workload pulling from a private Forgejo registry.** It introduces the pattern, which then becomes the template for subsequent workloads. Two options:
|
||||||
|
|
||||||
|
1. **Make the `tking/guildhall` Forgejo package public.** Forgejo packages can be scoped public or private; a public container package allows anonymous pulls and no pull secret is needed. This matches the rest of the cluster's zero-pull-secret state. Appropriate if there's nothing sensitive in the image itself.
|
||||||
|
2. **Keep the package private and add a `dockerconfigjson` Secret.** Standard pattern: `kubectl create secret docker-registry guildhall-registry --docker-server=git.guildhouse.dev --docker-username=<user> --docker-password=<token>`, then reference in the Deployment via `imagePullSecrets: [name: guildhall-registry]`.
|
||||||
|
|
||||||
|
Option 1 is simplest for v0.1. Option 2 is better hygiene long-term.
|
||||||
|
|
||||||
|
## 6. Storage headroom on Forgejo's volume
|
||||||
|
|
||||||
|
`kubectl exec -n forgejo deployment/forgejo -- df -h` (inside the forgejo container):
|
||||||
|
|
||||||
|
```
|
||||||
|
/dev/longhorn/pvc-683ec33a-... 19.5G 137.2M 19.4G 1% /data
|
||||||
|
```
|
||||||
|
|
||||||
|
**Headroom is ample.** 19.4 GB free on a 20 GB PVC. Current Forgejo usage after 10 days is 137 MB (git repos + LFS + internal state).
|
||||||
|
|
||||||
|
A Guildhall container image — Elixir release on debian-slim, typically 100-300 MB compressed per tag, with OCI layer deduplication across tags — would add maybe 1-3 GB of package storage over dozens of iterations. No pressure on the volume for the foreseeable future.
|
||||||
|
|
||||||
|
**No resize required.** If long-term registry growth becomes an issue (multiple applications all pushing many tags, or large binary releases), Longhorn supports online expansion of the PVC — but that's a much-later concern.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Synthesis
|
||||||
|
|
||||||
|
### Is the registry already enabled?
|
||||||
|
|
||||||
|
**Yes.** The `/v2/` and `/v2/_catalog` endpoints return proper OCI Distribution API responses (401 unauthenticated with well-formed `errors` objects). Forgejo 9.x enables packages by default; no `[packages]` config section is needed, and none is present. The registry is live and waiting for an authenticated client.
|
||||||
|
|
||||||
|
### What enablement work is required?
|
||||||
|
|
||||||
|
**None at the Forgejo-config layer.** The only work is client-side:
|
||||||
|
|
||||||
|
1. **Create a Forgejo Personal Access Token** (scope: `package:write`) via the Forgejo UI at `https://git.guildhouse.dev/-/user/settings/applications`
|
||||||
|
2. **Docker login from the build machine:** `docker login git.guildhouse.dev -u tking -p <PAT>`
|
||||||
|
3. **Build + push** the Guildhall image: `docker build -t git.guildhouse.dev/tking/guildhall:v0.1.0 . && docker push …`
|
||||||
|
4. **Set package visibility** in Forgejo — public (anon-pull, no imagePullSecret needed) or private (create a `dockerconfigjson` Secret in the `guildhall` namespace, reference in Deployment)
|
||||||
|
|
||||||
|
No Flux source edits. No Kustomization changes. No ConfigMap changes. No `cluster-infra` unblock required.
|
||||||
|
|
||||||
|
### Is the `cluster-infra` Flux error a blocker?
|
||||||
|
|
||||||
|
**No.** The Forgejo registry operates entirely outside the cluster-infra / spire / quartermaster / bascule Flux chain. Forgejo is managed by its own independent Kustomization (`flux-system/forgejo`), which is successfully reconciling against source revisions even though its Ready condition is flagged False by the unrelated forgejo-runner health check.
|
||||||
|
|
||||||
|
The `cluster-infra` error is real and worth fixing separately (trivial single-file fix in the GitHub source repo) but it has zero coupling to registry enablement or Guildhall deployment. Treat as a cleanup backlog item, not a pre-req.
|
||||||
|
|
||||||
|
### Estimated time to registry operational
|
||||||
|
|
||||||
|
| Step | Time |
|
||||||
|
|---|---|
|
||||||
|
| Create Forgejo PAT (Forgejo UI) | 2 min |
|
||||||
|
| `docker login git.guildhouse.dev` | <1 min |
|
||||||
|
| Dockerfile + `mix release` setup in Guildhall repo | 15-20 min (real work) |
|
||||||
|
| `docker build` (cold build for Elixir + OTP + mix deps + assets) | 5-10 min |
|
||||||
|
| `docker push` | 1-3 min (single tag, ~200 MB compressed) |
|
||||||
|
| Set package visibility (public or private + pull secret) | 2-5 min |
|
||||||
|
| **Total to first successful image in the registry** | **~30-45 min** |
|
||||||
|
|
||||||
|
Most of the time is the Dockerfile + release-build setup, not the registry interaction itself.
|
||||||
|
|
||||||
|
### Recommended next step
|
||||||
|
|
||||||
|
**Build the Guildhall Dockerfile and push a first image.** Sequencing:
|
||||||
|
|
||||||
|
1. Author `Dockerfile` in `~/projects/substrate-project/guildhall/` — multi-stage (Elixir 1.17.3/OTP 27 builder → debian-slim runtime, `mix release`, non-root user, expose 4000, healthcheck endpoint)
|
||||||
|
2. Author `.dockerignore` that excludes `_build/`, `deps/`, `.git/`, `priv/static/` (if built separately) — matches Phoenix release conventions
|
||||||
|
3. Create Forgejo PAT with `package:write` scope
|
||||||
|
4. `docker login git.guildhouse.dev` from the desktop
|
||||||
|
5. `docker build -t git.guildhouse.dev/tking/guildhall:v0.1.0 .`
|
||||||
|
6. `docker push git.guildhouse.dev/tking/guildhall:v0.1.0`
|
||||||
|
7. Verify via Forgejo UI at `https://git.guildhouse.dev/tking/-/packages/container/guildhall` and via `curl` to `/v2/tking/guildhall/manifests/v0.1.0` (authenticated)
|
||||||
|
8. Decide package visibility, and if private, create `guildhall-registry` Secret in the `guildhall` namespace (namespace doesn't exist yet — create at deploy time)
|
||||||
|
|
||||||
|
The Kubernetes-side deploy (Deployment + Service + Postgres + PVC + Secret) proceeds in parallel with or immediately after the image build, following the Keycloak pattern captured in the earlier `DEPLOY-EXPLORATORY-2026-04-21.md`.
|
||||||
|
|
||||||
|
No pre-work needed on Forgejo itself. The registry is ready.
|
||||||
Loading…
Reference in a new issue