Documentation Menu

Production Readiness Checklist

A go-live decision should not be made because the demo looked good. It should be made when security, observability, rollback, ownership, and support behavior are ready. This checklist helps measure whether a BriqMind deployment is truly manageable at production level.

01Areas That Must Be Ready

Security and access

Identity, permissions, secrets, and audit scope must be inspectable in the live environment.

RBAC roles must be defined separately for production users, service accounts, and admin access.

API keys must be separated by environment; production keys must not be used in test environments.

Secret management must be separated from the code repository, and access records must be traceable.

Observability

The team must receive signals not only when an error occurs, but before degradation becomes visible.

Dashboards must cover latency, error rate, queue length, and model response quality.

Alert thresholds and notification channels must be clear for P1/P2 incidents.

Connector, model, and pipeline logs must be correlatable in the same event timeline.

Change management

Model, connector, prompt, and pipeline changes must follow a controlled release discipline.

The rollback plan must not only be documented; it must be rehearsed at least once.

Maintenance window, user impact, and communication channel must be announced in advance.

Every production change must keep an owner, scope, risk, and rollback note.

Operational ownership

Ownership of the live system and the first response process must not be ambiguous.

Incident owner, technical owner, and customer communication owner must be defined separately.

The runbook must include first checks, isolation, escalation, and closure steps.

The support team must interpret P1/P2/P3 incident levels consistently.

02Go-Live Gates

Data boundary

Where production data is processed, logged, and stored must be clearly documented.

Permission boundary

User, agent, connector, and service account permissions must be limited to the minimum required access.

Rollback

A rollback path must exist for incorrect agent actions, faulty releases, or broken connector behavior.

Monitoring

Dashboards, alerts, and a daily check rhythm must be active for the first production week.

Support

The channel to use when internal users or customers are affected must be defined in advance.

03Questions to Ask Before Go-Live

Who does what in the first 15 minutes when an error occurs?

How will an incorrect or risky agent action be rolled back?

Which dashboard will trigger an alert at which threshold?

Which communication channel will be used if customers or internal users are affected?

Is ownership and permission scope clear for every connector that accesses production data?

If a model or prompt change reduces quality, how long does it take to return to the previous version?

04First Production Week Rhythm

Daily production check

During the first week, error rate, latency, failed jobs, user feedback, and connector health should be reviewed through a short daily check.

Risky action tracking

Agent actions that write data, update systems, or call external services should be monitored separately, and unexpected behavior should be isolated quickly.

Closure review

At the end of the first week, P1/P2 incidents, support requests, quality deviations, and missing runbook steps should be collected into one go-live note.

Going live before this list is complete usually creates operational risk, not just technical debt. If any area remains incomplete, the decision should be treated as a controlled pilot expansion rather than a production launch.