Phase-0 hazard: floating staging/production OCI tags are reapable while HRs are suspended, can freeze go-live deploy #3

Closed
opened 2026-06-02 04:55:52 +03:00 by oleks · 2 comments
Owner

Problem

The cms-plugins deploy pins the pod image by digest, but that digest is resolved by Flux through a floating <branch> OCI tag (staging / production). During Phase-0 the HelmReleases are suspend: true with zero running pods, so those floating tags are not referenced by any live workload. Because gitea-oci-cleanup is fleet-aware and only auto-pins tags referenced by LIVE k8s workloads, the floating staging/production tags are currently reapable. If the cleaner reaps the floating tag before a deliberate flux resume, the ImagePolicy will have no tag to resolve at go-live → ImageUpdateAutomation cannot write a digest → the HelmRelease cannot roll a real image. Result: a frozen/failed first deploy that looks mysterious because the immutable 0.1.<N> audit tags still exist in the registry.

Evidence

  • CI publishes the floating tag every build:
    • .woodpecker/container.yaml:74-82TAGS="-t $IMAGE:$VERSION -t $IMAGE:$BRANCH -t $IMAGE:$BRANCH-latest" then docker buildx build ... $TAGS --push.
    • .woodpecker/container.yaml:61-73 — comment confirms $BRANCH is "the floating channel pointer ... what Flux's ImagePolicy tracks".
  • Flux tracks the floating tag and reflects its digest:
    • deploy/fleet-overlay/cms-plugins-staging/image-automation.yaml:29-37filterTags.pattern: '^staging$', digestReflectionPolicy: Always.
    • deploy/fleet-overlay/cms-plugins-production/image-automation.yaml:29-37 — same with ^production$.
  • Digest is the actual pin:
    • deploy/fleet-overlay/cms-plugins-staging/helmrelease.yaml:33-34tag: staging / digest: "" # {"$imagepolicy": "kotkan:cms-plugins-staging:digest"}.
  • Phase-0 / suspended state: both HRs suspend: true, 0 pods in kotkan → floating tags unreferenced.
  • Docs describe the mechanism but never warn about reapability while suspended: ARCHITECTURE.md:78-101, DEPLOYMENT.md:22, DEPLOYMENT.md:116.

Options

  1. Allowlist the floating tags in gitea-oci-cleanup (another repo — armer/cluster config): explicitly pin/keep git.oleks.space/oleks/cms-plugins:{staging,production} so they survive even while no live workload references them. Most robust; survives indefinite Phase-0.
  2. Make the cleaner Phase-0-aware of suspended HRs: teach it to treat tags referenced by suspended HelmReleases as live (cross-repo change to the cleaner's reference-scan logic).
  3. Pin by immutable tag for first go-live: temporarily point the HelmRelease at a concrete 0.1.<N> tag/digest for the very first resume, then switch to floating-tag tracking once a live reference exists and the cleaner auto-pins it. Repo-file-only but changes deploy-digest resolution semantics → owner design decision.
  4. Re-run CI immediately before flux resume so the floating tags are freshly (re)published right before they become live-referenced — operational mitigation only; narrow race window remains.

Recommendation

Option 1 (allowlist cms-plugins:{staging,production} in gitea-oci-cleanup) as the durable fix, plus a go-live runbook note in DEPLOYMENT.md that the floating tags must exist before flux resume. Keeps the floating-tag→digest design intact and removes the Phase-0 reapability window without touching deploy-digest semantics. Cross-repo + live-cluster verification required, so it cannot be done as a cms-plugins repo edit.


Surfaced by the deploy-hardening review pass; deferred from auto-fix because it spans the gitea-oci-cleanup config and needs operational verification.

## Problem The cms-plugins deploy pins the pod image by **digest**, but that digest is resolved by Flux *through* a floating `<branch>` OCI tag (`staging` / `production`). During Phase-0 the HelmReleases are `suspend: true` with zero running pods, so those floating tags are **not referenced by any live workload**. Because `gitea-oci-cleanup` is fleet-aware and only auto-pins tags referenced by LIVE k8s workloads, the floating `staging`/`production` tags are currently **reapable**. If the cleaner reaps the floating tag before a deliberate `flux resume`, the `ImagePolicy` will have no tag to resolve at go-live → `ImageUpdateAutomation` cannot write a digest → the HelmRelease cannot roll a real image. Result: a frozen/failed first deploy that looks mysterious because the immutable `0.1.<N>` audit tags still exist in the registry. ## Evidence - CI publishes the floating tag every build: - `.woodpecker/container.yaml:74-82` — `TAGS="-t $IMAGE:$VERSION -t $IMAGE:$BRANCH -t $IMAGE:$BRANCH-latest"` then `docker buildx build ... $TAGS --push`. - `.woodpecker/container.yaml:61-73` — comment confirms `$BRANCH` is "the floating channel pointer ... what Flux's ImagePolicy tracks". - Flux tracks the floating tag and reflects its digest: - `deploy/fleet-overlay/cms-plugins-staging/image-automation.yaml:29-37` — `filterTags.pattern: '^staging$'`, `digestReflectionPolicy: Always`. - `deploy/fleet-overlay/cms-plugins-production/image-automation.yaml:29-37` — same with `^production$`. - Digest is the actual pin: - `deploy/fleet-overlay/cms-plugins-staging/helmrelease.yaml:33-34` — `tag: staging` / `digest: "" # {"$imagepolicy": "kotkan:cms-plugins-staging:digest"}`. - Phase-0 / suspended state: both HRs `suspend: true`, 0 pods in `kotkan` → floating tags unreferenced. - Docs describe the mechanism but never warn about reapability while suspended: `ARCHITECTURE.md:78-101`, `DEPLOYMENT.md:22`, `DEPLOYMENT.md:116`. ## Options 1. **Allowlist the floating tags in `gitea-oci-cleanup`** (another repo — armer/cluster config): explicitly pin/keep `git.oleks.space/oleks/cms-plugins:{staging,production}` so they survive even while no live workload references them. Most robust; survives indefinite Phase-0. 2. **Make the cleaner Phase-0-aware of suspended HRs**: teach it to treat tags referenced by suspended HelmReleases as live (cross-repo change to the cleaner's reference-scan logic). 3. **Pin by immutable tag for first go-live**: temporarily point the HelmRelease at a concrete `0.1.<N>` tag/digest for the very first resume, then switch to floating-tag tracking once a live reference exists and the cleaner auto-pins it. Repo-file-only but changes deploy-digest resolution semantics → owner design decision. 4. **Re-run CI immediately before `flux resume`** so the floating tags are freshly (re)published right before they become live-referenced — operational mitigation only; narrow race window remains. ## Recommendation Option 1 (allowlist `cms-plugins:{staging,production}` in `gitea-oci-cleanup`) as the durable fix, plus a go-live runbook note in `DEPLOYMENT.md` that the floating tags must exist before `flux resume`. Keeps the floating-tag→digest design intact and removes the Phase-0 reapability window without touching deploy-digest semantics. Cross-repo + live-cluster verification required, so it cannot be done as a cms-plugins repo edit. --- _Surfaced by the deploy-hardening review pass; deferred from auto-fix because it spans the `gitea-oci-cleanup` config and needs operational verification._
Author
Owner

Durable fix applied (Option 1) — pending deploy

Pinned the floating tags explicitly in the cleaner's allowlist so they're never reaped, plus a go-live runbook note.

  • servers/armer/scripts/registry-pins.txt — added container/cms-plugins==staging and container/cms-plugins==production (commit a8b8fd8, pushed to oleks/armer@main). The cleaner always keeps pinned type/name==version tuples (gitea-oci-cleanup.py load_pins), independent of live-workload references — so the Phase-0 reapability window is closed.
  • DEPLOYMENT.md — first-deploy checklist now carries a go-live prerequisite: confirm both floating tags exist before flux resume, with the pins mitigation documented (commit 0b1f2eb on develop).

Remaining to close: the pins change only takes effect after a deploy-rs deploy to armer (cd ~/projects/servers/armer && nix run .#deploy) — it edits the declarative config but the running gitea-oci-cleanup still uses the old /etc/registry-pins.txt until redeployed. Not deploying from here (deploy timing is yours). Once deployed, this can be closed.

Note: once the releases are eventually resumed, the fleet-aware auto-pin also covers these tags; the explicit pins are harmless to leave and protect the still-suspended window.

## Durable fix applied (Option 1) — pending deploy Pinned the floating tags explicitly in the cleaner's allowlist so they're never reaped, plus a go-live runbook note. - **`servers/armer/scripts/registry-pins.txt`** — added `container/cms-plugins==staging` and `container/cms-plugins==production` (commit `a8b8fd8`, pushed to `oleks/armer@main`). The cleaner always keeps pinned `type/name==version` tuples (`gitea-oci-cleanup.py` `load_pins`), independent of live-workload references — so the Phase-0 reapability window is closed. - **`DEPLOYMENT.md`** — first-deploy checklist now carries a go-live prerequisite: confirm both floating tags exist before `flux resume`, with the pins mitigation documented (commit `0b1f2eb` on `develop`). **⏳ Remaining to close:** the pins change only takes effect after a `deploy-rs` deploy to armer (`cd ~/projects/servers/armer && nix run .#deploy`) — it edits the declarative config but the running `gitea-oci-cleanup` still uses the old `/etc/registry-pins.txt` until redeployed. Not deploying from here (deploy timing is yours). Once deployed, this can be closed. _Note: once the releases are eventually resumed, the fleet-aware auto-pin also covers these tags; the explicit pins are harmless to leave and protect the still-suspended window._
Author
Owner

Deployed & verified — closing

nix run .#deploy to armer completed (activation succeeded, magic-rollback confirmed). The running gitea-oci-cleanup now loads the pins from /nix/store/23glbn9snjkv2zjsviwish2yd2r0skvv-registry-pins/etc/registry-pins.txt, which contains:

container/cms-plugins==staging
container/cms-plugins==production

So both floating tags are now in the cleaner's always-keep allowlist regardless of live-workload references — the Phase-0 reapability window is closed. Go-live note is on develop (DEPLOYMENT.md). Resolved.

## Deployed & verified — closing `nix run .#deploy` to armer completed (activation succeeded, magic-rollback confirmed). The running `gitea-oci-cleanup` now loads the pins from `/nix/store/23glbn9snjkv2zjsviwish2yd2r0skvv-registry-pins/etc/registry-pins.txt`, which contains: ``` container/cms-plugins==staging container/cms-plugins==production ``` So both floating tags are now in the cleaner's always-keep allowlist regardless of live-workload references — the Phase-0 reapability window is closed. Go-live note is on `develop` (`DEPLOYMENT.md`). Resolved.
oleks closed this issue 2026-06-02 05:41:12 +03:00
Sign in to join this conversation.
No labels
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: oleks/cms-plugins#3