SRE Playbook: How to Discover and Catalog Non‑Human Identities (NHI) in Your Organization

Non-human identities (NHIs)—service accounts, API keys, OAuth apps, bots, certificates, and automation tokens—often outnumber human users and can quietly become one of the largest sources of security and reliability risk. This guide shows an SRE-style, repeatable approach to discover, catalog, and keep current an inventory of NHIs across your infrastructure.

What counts as a Non‑Human Identity (NHI)?

An NHI is any identity used by software or automation to authenticate and perform actions. Common examples:

Cloud identities: AWS IAM roles/users for workloads, GCP service accounts, Azure managed identities
CI/CD identities: GitHub Actions tokens, GitLab runners, Jenkins credentials, deploy keys
SaaS integrations: Slack bots, Jira/ServiceNow integrations, monitoring webhooks
Secrets & cryptographic material: API keys, database passwords, TLS certificates, SSH keys
Runtime/workload identities: Kubernetes service accounts, SPIFFE/SPIRE IDs, workload identity federation

Step 1: Define scope and success criteria

Start with a narrow but high-impact scope to avoid boiling the ocean. A strong first milestone is: “All production-facing systems have an owner-attributed NHI inventory entry with rotation policy and last-used telemetry.”

Scope choices: production cloud accounts/projects, Kubernetes clusters, CI/CD orgs, critical SaaS
Define what “discovered” means: found in at least one authoritative system (IAM, secret store, repo)
Define what “cataloged” means: recorded with owner, purpose, permissions, and lifecycle metadata

Step 2: Build a discovery map (where NHIs hide)

NHIs are scattered. Create a map of sources you will query regularly:

Identity providers & IAM: cloud IAM, Kubernetes RBAC bindings, SaaS admin consoles
Secret managers: Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault
CI/CD systems: pipeline variables, runners, OIDC integrations, deploy keys
Code & config: Git repos (including IaC), Helm charts, Terraform state, config repos
Observability: audit logs, access logs, API gateway logs, cloud trail equivalents

Tip: treat discovery as a data integration problem. You’ll correlate “identity objects” (accounts/roles/apps) with “credentials” (keys/tokens/certs) and with “usage evidence” (logs).

Step 3: Use multiple discovery techniques

No single method finds everything. Combine these techniques for better coverage:

3.1 Enumerate from authoritative identity systems

List service accounts/roles/apps from cloud IAM and SaaS admin APIs.
Export Kubernetes service accounts and their role bindings/cluster role bindings.
Collect metadata like creation time, labels/tags, description fields, and attached policies.

3.2 Enumerate from secret stores

List secrets that look like credentials (naming patterns, secret types, tags).
Capture rotation settings, TTL/expiry, and access policies.
Identify “shared secrets” by counting how many principals can read them.

3.3 Search code and CI/CD configuration (safely)

Scan repos for hard-coded credentials and references to external apps/tokens.
Review CI/CD variables and contexts that inject tokens at build/deploy time.
Parse IaC to detect identities created by Terraform/CloudFormation and map them to workloads.

3.4 Confirm with usage telemetry

Discovery isn’t complete until you know what is actually used.

Use audit logs to find principals that performed actions in the last 30/90 days.
Mark “unused” identities for investigation (but don’t delete without validation).
Identify high-risk patterns: broad permissions + frequent use + no rotation/expiry.

Step 4: Design a catalog schema (minimum viable, but useful)

Your catalog can start as a database table, a YAML repo, or a CMDB entry—what matters is consistency and ownership. A practical schema includes:

Unique ID: stable identifier for the NHI record
Type: service account, API key, OAuth app, bot, certificate, SSH key, etc.
System of record: where it is defined (AWS IAM, GitHub, Vault, Kubernetes)
Owner: team and escalation contact (on-call or group email)
Purpose: what workflow/workload uses it (deploy, metrics, billing sync)
Environment: prod/stage/dev and scope (account/project/namespace)
Permissions footprint: roles/policies + a risk rating (low/med/high)
Credential details: where stored, expiry date, rotation period, last rotated
Usage evidence: last used timestamp, source logs, frequency (if available)
Lifecycle state: active, deprecated, pending removal, removed

Step 5: Correlate identities to workloads and owners

The hardest part is attribution. Use several signals:

Tags/labels: enforce tags like owner, service, cost-center at creation time.
Workload mapping: map Kubernetes service accounts to Deployments; map CI tokens to repos/pipelines.
Network and audit context: source IPs, user agents, and calling services can identify the client.
“Ask and record” workflow: for unknowns, create a ticket to confirm owner and set a deadline.

Step 6: Add controls that prevent catalog drift

A catalog that is accurate once and stale forever is worse than no catalog. Add guardrails:

Creation-time policy: require owner tags and purpose annotations via IaC checks or admission policies.
Automated sync: nightly jobs that re-enumerate systems and update the catalog.
Rotation SLAs: enforce max credential age; alert on approaching expiry or overdue rotation.
Least privilege reviews: scheduled reviews of high-privilege NHIs and unused permissions.
Decommission playbook: standard steps to disable, observe impact, then delete and remove secrets.

Step 7: Prioritize remediation (what to fix first)

Once you have initial inventory coverage, prioritize by risk and blast radius:

Orphaned credentials (no owner) in production
Long-lived secrets with no rotation/expiry
Over-privileged identities (admin or wildcard permissions)
Shared identities used by many services/teams
Hard-coded secrets in repos or container images

Operational checklist (copy/paste)

Inventory target systems and define “done” for discovery and cataloging
Enumerate identities from IAM/SaaS + credentials from secret stores
Correlate with audit logs to add last-used/usage frequency
Create catalog records with owner, purpose, environment, and permissions
Implement automated sync and alerts for drift/rotation/unknown owners
Remediate highest-risk items first and track progress over time

Common pitfalls to avoid

Relying only on secret scanning: many identities won’t appear in code.
Deleting “unused” identities immediately: confirm with multiple data sources and staged disablement.
Catalog without ownership: an entry without an accountable owner won’t stay accurate.
Ignoring SaaS integrations: third-party apps often have broad access and long-lived tokens.

With a clear scope, multiple discovery paths, a lightweight catalog schema, and automation to prevent drift, NHIs become manageable: you can see what exists, who owns it, what it can do, and how quickly you can revoke or rotate it when incidents happen.