guildhall/DEPLOY-RUNBOOK.md
Tyler J King c6f1d07ed9 feat(deploy): Dockerfile + k8s manifests for Talos deployment
Multi-stage Elixir/OTP Dockerfile, Kubernetes manifests following
Keycloak pattern, mix release migration module, and deploy runbook.

Target: guildhall.guildhouse.dev via Hetzner LB + Cloudflare (orange
cloud). Forgejo container registry at git.guildhouse.dev/tking/guildhall.

Not yet deployed; artifacts only. See DEPLOY-RUNBOOK.md for execution.

Artifacts produced:

- Dockerfile — multi-stage, Elixir 1.17.3 / OTP 27.1.2, debian-bookworm
  builder + debian-bookworm-slim runtime. Dep-layer caching via
  explicit apps/*/mix.exs copy before source. Asset pipeline runs
  mix assets.setup + mix assets.deploy (tailwind + esbuild + phx.digest).
  Non-root uid 1000, tini as pid-1, HEALTHCHECK against /health.
- .dockerignore — excludes _build/, deps/, k8s/, .git/, test artifacts,
  and apps/guildhall_web/priv/static/assets/ (regenerated by phx.digest
  inside the builder).
- apps/guildhall_web/.../router.ex — adds `/health` route under :api
  pipeline. Unauthenticated by design (Kubernetes probes + LB target).
- apps/guildhall_web/.../controllers/health_controller.ex — shallow
  health: Phoenix up + Ecto pool can `SELECT 1`. Returns 200 ok or 503
  degraded with reason.
- apps/guildhall_ops_db/lib/guildhall/ops_db/release.ex — Release
  module for migrations. `Guildhall.OpsDb.Release.migrate/0` and
  `rollback/2`. Called from the migration Job via
  `bin/guildhall eval`. Module path reflects actual repo location
  (repo is `Guildhall.OpsDb.Repo` in `:guildhall_ops_db`, not the
  prompt's suggested `Guildhall.Repo`).

Kubernetes manifests in k8s/ (numbered for apply order):
  00-namespace.yaml                  — guildhall namespace w/ guildhouse labels
  10-registry-secret-template.yaml   — doc-only template for dockerconfigjson
  20-postgres-pvc.yaml               — 5Gi longhorn RWO
  30-postgres-deployment.yaml        — postgres:16, keycloak-matched resources
                                       + pg_isready probes, PGDATA subpath
  40-postgres-service.yaml           — ClusterIP :5432
  50-guildhall-secrets-template.yaml — doc-only template for app + DB secrets
  60-migration-job.yaml              — ecto migration Job, name includes tag
                                       for per-deploy uniqueness, TTL 24h
  70-guildhall-deployment.yaml       — RollingUpdate maxSurge 1 maxUnavailable 0,
                                       /health probes, 200m/256Mi requests
                                       and 1/1Gi limits, 5s preStop sleep
  80-guildhall-service.yaml          — LoadBalancer with exact Keycloak-
                                       matched Hetzner annotations
                                       (location nbg1, type lb11, name
                                       guildhall, use-private-ip false),
                                       port 80 origin (Cloudflare TLS)

- DEPLOY-RUNBOOK.md — 6-phase deploy sequence (build + push, cluster
  prep, DB, migrate, app rollout, DNS + smoke), iteration helper with
  sed-based tag-bump, rollback procedure (image rollback, schema
  rollback via Release.rollback, full teardown), and v0.1 limitations
  (Cloudflare-edge TLS not cluster-terminated; no Flux integration;
  no OIDC wiring; no substrate CRD integration; single replica).

Decisions made during artifact production that weren't explicit in
the prompt:

- Release module name is `Guildhall.OpsDb.Release` (not
  `Guildhall.Release`) matching the actual repo namespace. Migration
  Job command adjusted to `Guildhall.OpsDb.Release.migrate()`.
- Dockerfile uses `-slim` builder variant (not the full bookworm
  builder) to keep the builder stage closer to the runtime image
  size, reducing multi-stage layer transfer during build.
- Asset compilation runs `mix assets.setup` before `mix assets.deploy`
  so tailwind + esbuild binaries install cleanly inside the container
  (the dev-only :runtime flag on those deps means they need explicit
  install in a prod builder).
- tini added as pid-1 in the runtime stage. Not in the prompt, but
  standard-practice for OTP containers to ensure signal propagation
  and zombie reaping under Kubernetes.
- Rolling update strategy: maxSurge 1 / maxUnavailable 0 (zero-
  downtime rollout at replicas=1; the new pod comes up alongside the
  old, health-checks, then the old is terminated). Matches typical
  single-replica LiveView pattern.
- preStop `sleep 5` — gives in-flight HTTP + LiveView connections a
  grace window before termination.
- Hetzner LB annotations: verified exact set from cluster keycloak
  service — location=nbg1, name=guildhall, type=lb11,
  use-private-ip=false. The prompt asked about uses-proxyprotocol
  and algorithm-type; neither is set on Keycloak's service and both
  are omitted here for consistency.
- Migration Job name includes the tag (`guildhall-migrate-v0-1-0`) so
  multiple deploys don't collide on Job name reuse. Runbook documents
  the sed helper to bump both the image tag and the Job name for
  subsequent deploys.
- Both exploratory docs (`DEPLOY-EXPLORATORY-2026-04-21.md`,
  `FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md`) are currently
  untracked in the repo. They're left out of this commit per the
  prompt's explicit `git add` list. They can be committed separately
  (or ignored) at Tyler's discretion.

Not done tonight (per prompt's NOT PERMITTED list):
- docker build / docker push
- kubectl apply of any manifest
- Forgejo PAT creation
- Cloudflare DNS changes
- git push (this commit is local-only pending review)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler J King <tking@guildhouse.dev>
2026-04-22 04:00:40 -04:00

12 KiB

Guildhall deploy runbook

Target: guildhall.guildhouse.dev on the Hetzner Talos cluster, via Forgejo container registry at git.guildhouse.dev/tking/guildhall. Pattern: direct kubectl apply against the cluster; Flux integration deferred. TLS terminates at Cloudflare (orange cloud); origin is plain HTTP on the Hetzner LB. Required reference docs: DEPLOY-EXPLORATORY-2026-04-21.md (cluster state), FORGEJO-REGISTRY-INVESTIGATION-2026-04-21.md (registry state).

Tag referenced throughout this runbook: v0.1.0. When deploying a subsequent tag, substitute throughout OR use the sed helper at the bottom.


Prerequisites

  • kubectl configured against the Talos cluster (KUBECONFIG=~/projects/substrate-project/guildhouse-talos-bootstrap/kubeconfig)
  • docker available on the build host with enough disk for an Elixir build image (~2 GB)
  • Cloudflare account access for guildhouse.dev DNS
  • Forgejo account tking at git.guildhouse.dev

Phase 1 — Build and push the image

1.1 Create a Forgejo Personal Access Token

Navigate to https://git.guildhouse.dev/-/user/settings/applications. Generate a new token:

  • Token name: guildhall-registry-push (or similar)
  • Scopes: package:write (this token will both push and pull; scope down to package:read for a separate in-cluster-pull token if splitting)
  • Expiry: operator's choice; 30-90 days is reasonable for the push token

Copy the token value immediately (Forgejo won't show it again). Save it in your password manager.

1.2 Docker login

docker login git.guildhouse.dev -u tking
# paste PAT when prompted

Verify with cat ~/.docker/config.json | jq '.auths | keys'git.guildhouse.dev should appear.

1.3 Build the image

cd /home/tking/projects/substrate-project/guildhall
docker build -t git.guildhouse.dev/tking/guildhall:v0.1.0 .

Cold build takes ~5-10 minutes (mix deps + erlang compile + tailwind + esbuild + phx.digest + mix release). Subsequent builds hit Docker layer cache and are much faster.

Verify the image runs before pushing:

docker run --rm -it --entrypoint /bin/sh \
  git.guildhouse.dev/tking/guildhall:v0.1.0 \
  -c 'ls -la /app/bin && /app/bin/guildhall version'

Expected: the guildhall release binary is present and version returns the release version without error.

1.4 Push to Forgejo registry

docker push git.guildhouse.dev/tking/guildhall:v0.1.0

1.5 Verify image is in the registry

Via Forgejo UI: https://git.guildhouse.dev/tking/-/packages → should list guildhall with a v0.1.0 tag.

Via registry API (authenticated):

curl -sS -u tking:<PAT> https://git.guildhouse.dev/v2/tking/guildhall/tags/list
# → {"name":"tking/guildhall","tags":["v0.1.0"]}

1.6 Decide package visibility

In the Forgejo UI, for the new guildhall container package:

  • Private (default, recommended for tonight): cluster needs guildhall-registry pull secret (Phase 2.2 below creates it)
  • Public: anonymous pulls work; skip Phase 2.2 and remove imagePullSecrets from k8s/60-migration-job.yaml and k8s/70-guildhall-deployment.yaml before applying

Phase 2 — Cluster-side preparation

2.1 Create the namespace

kubectl apply -f k8s/00-namespace.yaml

Verify: kubectl get ns guildhallActive.

2.2 Create the registry pull secret (if package is private)

kubectl create secret docker-registry guildhall-registry \
  --docker-server=git.guildhouse.dev \
  --docker-username=tking \
  --docker-password='<PAT-with-package:read>' \
  --namespace=guildhall

Optionally use a read-only PAT here instead of the push PAT from Phase 1.1. Skip this step entirely if the package is public.

2.3 Create the database credentials secret

Generate a strong password and save it to your password manager before running:

DB_PASSWORD="$(openssl rand -base64 32 | tr -d '/+=' | head -c 32)"
echo "Save this: $DB_PASSWORD"

kubectl create secret generic guildhall-db-credentials \
  --from-literal=POSTGRES_DB=guildhall \
  --from-literal=POSTGRES_USER=guildhall \
  --from-literal=POSTGRES_PASSWORD="$DB_PASSWORD" \
  --namespace=guildhall

2.4 Create the application secrets

SECRET_KEY_BASE="$(cd /home/tking/projects/substrate-project/guildhall && mix phx.gen.secret)"

kubectl create secret generic guildhall-app-secrets \
  --from-literal=SECRET_KEY_BASE="$SECRET_KEY_BASE" \
  --from-literal=DATABASE_URL="ecto://guildhall:$DB_PASSWORD@guildhall-postgres:5432/guildhall" \
  --namespace=guildhall

Verify secrets exist:

kubectl get secrets -n guildhall
# expect: guildhall-registry, guildhall-db-credentials, guildhall-app-secrets

Phase 3 — Database provisioning

3.1 Apply Postgres PVC, Deployment, Service

kubectl apply -f k8s/20-postgres-pvc.yaml
kubectl apply -f k8s/30-postgres-deployment.yaml
kubectl apply -f k8s/40-postgres-service.yaml

3.2 Wait for Postgres Ready

kubectl rollout status deployment/guildhall-postgres -n guildhall --timeout=5m
kubectl wait --for=condition=Ready pod \
  -l app=guildhall-postgres -n guildhall --timeout=3m

Verify it accepts connections:

kubectl exec -n guildhall deployment/guildhall-postgres -- \
  pg_isready -U guildhall
# → /var/run/postgresql:5432 - accepting connections

Phase 4 — Schema migration

4.1 Run the migration Job

kubectl apply -f k8s/60-migration-job.yaml

4.2 Wait for Job completion

kubectl wait --for=condition=complete job/guildhall-migrate-v0-1-0 \
  -n guildhall --timeout=3m

4.3 Verify migration output

kubectl logs job/guildhall-migrate-v0-1-0 -n guildhall

Look for Migrations already up (no-op if Guildhall has no migrations yet) or a list of == Running 20xx... / == Migrated entries.

If the Job fails, inspect events + logs:

kubectl describe job guildhall-migrate-v0-1-0 -n guildhall
kubectl logs job/guildhall-migrate-v0-1-0 -n guildhall

Common failures and remediation: DATABASE_URL pointing at a wrong host (check guildhall-app-secrets); Postgres not yet accepting auth (wait longer); migration SQL error (fix in source, rebuild image, re-push, re-apply Job).


Phase 5 — Application deployment

5.1 Apply Guildhall Deployment + Service

kubectl apply -f k8s/70-guildhall-deployment.yaml
kubectl apply -f k8s/80-guildhall-service.yaml

5.2 Wait for Deployment rollout

kubectl rollout status deployment/guildhall -n guildhall --timeout=5m

If this hangs, check pod events + logs:

kubectl get pods -n guildhall
kubectl describe pod -n guildhall -l app=guildhall
kubectl logs -n guildhall -l app=guildhall --tail=100

5.3 Obtain the LoadBalancer IP

Hetzner CCM provisions a new LB; allow 30-90 seconds after the Service is applied.

kubectl get svc guildhall -n guildhall -w
# ^C once EXTERNAL-IP transitions from <pending> to a public address

Record the IPv4 in EXTERNAL-IP. IPv6 will also be assigned; note both.


Phase 6 — DNS + end-to-end verification

6.1 Create Cloudflare DNS records

In the Cloudflare dashboard for guildhouse.dev (or via flarectl / terraform if automated), create:

  • A record: guildhall<Hetzner-LB-IPv4>proxied (orange cloud)
  • AAAA record (optional, recommended): guildhall<Hetzner-LB-IPv6> — proxied

Proxied is load-bearing: it's what provides TLS. Do NOT grey-cloud this record.

6.2 Smoke test

Allow Cloudflare's edge to pick up the record (1-2 minutes).

# Health endpoint — unauthenticated, should return 200
curl -sS -w '\n-- HTTP %{http_code} --\n' https://guildhall.guildhouse.dev/health

# Root — should return 200 with LiveView-rendered HTML
curl -sS -w '\n-- HTTP %{http_code} --\n' -I https://guildhall.guildhouse.dev/

Expected: /health returns 200 with {"status":"ok","checks":{"db":"ok"}}; / returns 200 with Phoenix's rendered HTML.

6.3 Manual walkthrough

In a browser, visit https://guildhall.guildhouse.dev/:

  • Dashboard LiveView should render
  • /ceremonies and /artifacts should render (will be empty — no data yet)
  • No certificate warnings (Cloudflare-terminated TLS)

Iterating on subsequent tags

For v0.1.1, v0.1.2, etc.:

  1. Build + push the new image
  2. Update the image: tag in k8s/60-migration-job.yaml and k8s/70-guildhall-deployment.yaml
  3. Update the Job name in k8s/60-migration-job.yaml (e.g. guildhall-migrate-v0-1-1)
  4. kubectl apply -f k8s/60-migration-job.yaml — run the new migration Job
  5. kubectl apply -f k8s/70-guildhall-deployment.yaml — rolling update of Guildhall

A sed helper to bump everything at once:

OLD=v0.1.0; NEW=v0.1.1
sed -i "s|guildhall:${OLD}|guildhall:${NEW}|g" \
    k8s/60-migration-job.yaml k8s/70-guildhall-deployment.yaml
sed -i "s|guildhall-migrate-${OLD//./-}|guildhall-migrate-${NEW//./-}|g" \
    k8s/60-migration-job.yaml

Rollback

Back out the current deployment

Rolling back to a prior image tag (assuming the prior tag is still in the registry):

kubectl set image -n guildhall deployment/guildhall \
  guildhall=git.guildhouse.dev/tking/guildhall:<prior-tag>
kubectl rollout status -n guildhall deployment/guildhall

Schema rollback (only if the current deploy introduced migrations that need to be reverted):

kubectl run guildhall-rollback --rm -it \
  --image=git.guildhouse.dev/tking/guildhall:<current-tag> \
  --overrides='{"spec":{"imagePullSecrets":[{"name":"guildhall-registry"}]}}' \
  -n guildhall -- \
  /app/bin/guildhall eval "Guildhall.OpsDb.Release.rollback(Guildhall.OpsDb.Repo, <migration_version>)"

Tear down the whole deployment

# Delete in reverse order; namespace deletion cascades everything
# attached to it (Deployments, Services, Pods, PVC... note that
# deleting the namespace ALSO deletes the PVC, which destroys the
# database. For non-destructive teardown, preserve the PVC first.)

kubectl delete svc guildhall -n guildhall                # triggers Hetzner LB deprovision
kubectl delete deployment guildhall -n guildhall
kubectl delete job -l app.kubernetes.io/name=guildhall,app.kubernetes.io/component=migration -n guildhall
kubectl delete deployment guildhall-postgres -n guildhall
kubectl delete svc guildhall-postgres -n guildhall

# PVC delete is destructive (Longhorn reclaim policy is Delete).
# Uncomment only if the database state should be destroyed:
# kubectl delete pvc guildhall-db -n guildhall

kubectl delete secret guildhall-registry guildhall-db-credentials guildhall-app-secrets -n guildhall

# Finally the namespace itself (retained if you want to keep PVC):
# kubectl delete namespace guildhall

Remove the Cloudflare DNS record for guildhall.guildhouse.dev if fully tearing down.


Known v0.1 limitations

  • Cloudflare-edge TLS, not cluster-terminated. Upgrading to cert-manager Certificate + in-cluster TLS is hygiene follow-up once the first deploy stabilizes. The letsencrypt-prod ClusterIssuer is already ready.
  • No Flux integration. Direct kubectl apply is the deploy mechanism for v0.1. Flux Kustomization for Guildhall is follow-up — especially once the broader Flux chain (cluster-infra, spire, quartermaster) is healed.
  • No OIDC / Keycloak integration. Guildhall's config/runtime.exs has commented-out OIDC env vars; wiring them to the existing auth.guildhouse.dev Keycloak is follow-up.
  • No substrate CRD integration. The CeremonyOrchestrator and ChronicleConsumer stubs are not yet watching real substrate CRDs — those integrations land after the substrate foundation is reconciling on this cluster.
  • Single replica. Safe for LiveView (no cluster sticky-session concerns at replicas=1). Scale once DNS cluster / horizontal-pod-autoscaler is configured.