DevSecOps Pipeline Collector — Runbook

How to deploy the GitLab + CI-Pipeline collectors with crons disabled and drive collection via REST endpoints.

This guide shows how to deploy the collector, disable all crons, and trigger every collection step manually via REST endpoints. Use this when you want full control over when data is fetched — e.g. during onboarding, debugging, or when you want to validate what gets written before enabling scheduled collection.

circle-info

Architecture in one line: one Docker image, two containers. One runs the GitLab-wide scrape (projects, commits, MRs, branches, pipeline list). The other enriches pipeline details and optionally collects Jenkins builds.


1. The two containers

Container
Responsibility
Port
Mongo collections it writes

gitlab-collector

Projects, commits, MRs, branches, contributors, events, pipeline list (Stage 1), compliance, groups, releases

8280

dev_insight_*

ci-pipeline-collector

Pipeline detail enrichment (Stage 2), Jenkins pipeline builds

8281

dev_insight_pipelines_collection (enriches), dev_insight_jenkins_builds_collection

Both use the same image: us-central1-docker.pkg.dev/cve-buster/sec1-public-repo/sec1-devsecops-collector:latest.

circle-exclamation

2. Environment variables — cron-disabled, manual-only

Common env (both containers)

SPRING_PROFILES_ACTIVE=prod
SERVER_PORT=8080
LOG_LEVEL=INFO
APP_LOG_LEVEL=INFO
MONGO_LOG_LEVEL=INFO
REST_LOG_LEVEL=INFO

# Mongo — same DB for both containers
MONGODB_URI=mongodb://admin:changeme@mongo:27017/dev-sec-ops-db?authSource=admin
MONGODB_DATABASE=dev-sec-ops-db

# Sec1 API
SEC1_API_KEY=your-sec1-api-key
SEC1_USER_DETAILS_URL=https://api.sec1.io/foss/get-user-details
SEC1_FETCH_USER_DETAILS_ON_STARTUP=false

# GitLab access
GITLAB_BASE_URL=https://gitlab.com
GITLAB_TOKEN=glpat-your-token-here
GITLAB_COLLECT_ALL_PROJECTS=true
GITLAB_HEALTH_CHECK_ENABLED=false

gitlab-collector — all crons OFF

ci-pipeline-collector — all crons OFF


3. Docker / Podman commands

GitLab collector

CI pipeline collector

circle-check

Health check


4. Manual collection endpoints

All endpoints are POST (or DELETE for clear operations) and return a JSON response you can log for audit.

Step 1 — Clear old data (optional, for a fresh start)

triangle-exclamation

Response:

Step 2 — Run GitLab full collection (projects, commits, pipelines list, …)

Runs on the gitlab-collector container. This is the heaviest call — takes 3–4 minutes for ~10 projects depending on commit history. Non-blocking: it returns immediately and runs in the background.

Response:

Track progress by tailing the container logs:

You'll see each job finish in order: projects_synccommits_collectormerge_requests_collectorbranches_collectorcontributors_collectorevents_collectorpipelines_collectorcompliance_collectorgroups_syncreleases_collector.

Step 2b (optional, faster) — Pipelines-only list sync

If you've already collected projects and only want fresh pipeline list data (no commits/MRs/branches), use this on the gitlab-collector. Takes ~7 seconds instead of 3–4 min.

Step 3 — Enrich pipeline details (Stage 2)

Runs on the ci-pipeline-collector container. Reads pipeline list from Mongo and enriches each pipeline with jobs, failures, security scan results, stage bottleneck analysis, NCD template classification, branch protection, governance flags.

Parameters:

Param
Default
Purpose

daysBack

7

How far back to enrich. Use 90 for initial backfill, 7 for regular runs.

force

false

true clears existing enrichment first (re-fetches yaml, re-classifies, re-reads protection). Use when you change classifier config or want truly fresh enrichment.

Response:

Progress log marker:

Step 3b — Enrich specific projects only

If you only want to process certain projects (e.g., during debugging):

Step 4 — Jenkins collection (if enabled)

Runs on the ci-pipeline-collector container.

Refresh Jenkins builds (re-fetch last N days):


5. Complete "first-run" sequence

Copy-paste this whole block for a clean first collection:

Expected total time: 5–7 minutes depending on project count and pipeline volume.


6. Day-to-day manual refresh

Once the first-run completes, you can keep running these regularly (no clear needed):


7. Endpoint reference

On gitlab-collector (:8280)

Method
Path
Purpose

POST

/api/v1/collector/gitlab/collect

Full GitLab cycle: projects → commits → MRs → branches → contributors → events → pipelines → compliance → groups → releases

POST

/api/v1/collector/gitlab/collect-projects

Only sync projects (fastest, just inventory)

POST

/api/v1/collector/gitlab/pipelines/sync-all

Lightweight — only refresh the pipeline list (skips commits/MRs)

POST

/api/v1/collector/gitlab/pipelines/collect

Sync pipeline list for specific projects (JSON body: ["repo-path", "id", ...])

DELETE

/api/v1/collector/gitlab/clear-all

Wipe all 14 GitLab collections

GET

/api/v1/collector/gitlab/sanity-check

Health report (counts + last-run states + issues found)

On ci-pipeline-collector (:8281)

Method
Path
Purpose

POST

/api/v1/collector/gitlab/pipelines/refresh?daysBack=N&force=BOOL

Stage 2 enrichment — jobs, failures, security, NCD classification, governance

POST

/api/v1/collector/gitlab/pipelines/refresh-project?projectId=ID&daysBack=N&force=BOOL

Enrich a single GitLab project

POST

/api/v1/collector/gitlab/pipelines/enrich

Enrich specific projects (JSON body: ["repo-path", "id", ...])

POST

/api/v1/collector/gitlab/pipelines/collect-details

Runs the pipeline-detail cycle once (same as scheduled job)

POST

/api/v1/collector/jenkins-pipeline/collect

Collect builds from all configured Jenkins instances

POST

/api/v1/collector/jenkins-pipeline/collect-instance?name=INSTANCE

Collect from a single Jenkins instance

POST

/api/v1/collector/jenkins-pipeline/refresh?daysBack=N&force=BOOL

Refresh Jenkins builds, overwrite existing

GET

/api/v1/collector/jenkins-pipeline/instances

List configured instances and their health

GET

/api/v1/collector/jenkins-pipeline/sanity-check

Jenkins health report


8. Environment variable reference

All scheduler enable flags default to false. You only ever need to set true to turn a scheduler ON.

Env var
Default
What it controls

GITLAB_SCHEDULER_ENABLED

false

Full GitLab cycle on cron (GITLAB_COLLECTOR_CRON)

GITLAB_PIPELINE_DETAIL_ENABLED

false

Stage 2 pipeline detail on cron (GITLAB_PIPELINE_DETAIL_CRON)

GITLAB_PIPELINES_ONLY_ENABLED

false

Lightweight pipeline-list sync on cron

GITLAB_PIPELINE_CLASSIFICATION_ENABLED

false

NCD classification during Stage 2 enrichment

JENKINS_PIPELINE_ENABLED

false

Jenkins pipeline builds on cron

JENKINS_PIPELINE_INSTANCE_1_*

Per-instance credentials (needed even in manual mode)

SONARQUBE_SCHEDULER_ENABLED

false

SonarQube collection

NEXUSIQ_SCHEDULER_ENABLED

false

Nexus IQ collection

SERVICENOW_SCHEDULER_ENABLED

false

ServiceNow collection

JIRA_SCHEDULER_ENABLED

false

Jira collection

JENKINS_SCHEDULER_ENABLED

false

Legacy Jenkins builds collector

Cron expressions (only matter when a scheduler is enabled)

Env var
Default

GITLAB_COLLECTOR_CRON

0 */10 * * * * — every 10 minutes

GITLAB_PIPELINE_DETAIL_CRON

0 */5 * * * * — every 5 minutes

GITLAB_PIPELINES_ONLY_CRON

0 */5 * * * * — every 5 minutes

JENKINS_PIPELINE_CRON

0 */5 * * * * — every 5 minutes


9. Troubleshooting

"Pipelines stuck on running status forever"

Enrichment only finalises when GitLab reports success / failed / canceled. Trigger another refresh?force=true after the pipeline finishes on GitLab to stamp the final status.

"I enabled classification but pipelineCategory is still custom for NCD projects"

Check the include path in your .gitlab-ci.yml matches the configured GITLAB_PIPELINE_CATEGORY_1_INCLUDE_PROJECT exactly (case-insensitive, full namespace required). Then run refresh?force=true to bust the classification cache.

"renovateEnabled / ncdCliVersion are null"

These only populate when the corresponding NCD variables (RENOVATE_SOFTGATE, CLI_VERISON — note the intentional typo) are set in the pipeline's .gitlab-ci.yml. Not a bug — just missing upstream data.

"Force refresh doesn't seem to pick up my new yaml"

force=true on refresh clears in-memory caches and re-fetches the yaml. If still stale, restart the container — the yaml cache is in-memory with 6-hour TTL.

Container logs worth knowing


10. Switching from manual to scheduled mode

Once manual flow is validated, flip these to true and restart the containers:

Default crons are every 5–10 minutes. Override via GITLAB_COLLECTOR_CRON, GITLAB_PIPELINE_DETAIL_CRON, etc.


11. MongoDB indexes

Spring Data auto-creates indexes declared on the model classes (@Indexed, @CompoundIndex) on first connection. Everything in the Auto-created column below is built for you on app startup — you don't have to do anything. The Recommended additions section is what we suggest you create by hand, either ad-hoc as the data grows or via a migration script.

Verify what's actually present: db.<collection>.getIndexes() in mongosh. Add ?autoCreateIndex=false to MONGO_URI only if you're managing indexes outside the app.

Auto-created from model annotations

Collection
Index
Purpose

dev_insight_projects_collection

gitlabProjectId (unique)

Per-project upsert lookup

team

Filter projects by team

lastActivityAt (desc)

Delta-cycle skipping (hasActivitySince) + recency sort

applicationName

Cross-collector joins

teamId

Team rollups

healthScore, healthStatus

Health filtering / sort

isMonorepoFlag

Monorepo classification

bus_factor_archived (compound)

Bus factor leaderboards

dev_insight_commits_collection

sha (unique)

Commit dedup

committedAt (desc)

Time-window scans

project_committed (compound)

Per-project commit history

author_committed (compound)

Per-author history + first-commit lookup

heatmap_idx (compound)

Hourly commit heatmap aggregation

dev_insight_merge_requests_collection

gitlabMrId (unique)

MR upsert

mergedAt (desc)

Throughput windows

project_updated, project_state

Per-project MR lookups

author_created, state_created

Author / state windowed counts

dev_insight_branches_collection

project_branch (compound, unique)

Per-project branch upsert

project_status

Stale / orphaned counts

status

Org-wide branch hygiene

dev_insight_contributors_collection

email (unique)

Contributor upsert by email

gitlabUserId (unique, sparse)

Contributor upsert by GL user id

totalReviews (desc)

Reviewer leaderboards

isIdle

Idle contributor counts

team_commits (compound)

Team contributor leaderboards

dev_insight_events_collection

gitlabEventId (unique)

Event dedup

createdAt (desc)

Activity feed

project_created (compound)

Per-project event feed

collectedAt (TTL 30 d)

Auto-purge

dev_insight_pipelines_collection

pipelineId (unique)

Pipeline upsert

applicationName

App-level pipeline filtering

project_created (compound)

Per-project pipeline history

project_ref_status (compound)

Branch + status filtering

dev_insight_compliance_collection

projectId (unique)

Per-project compliance upsert

team, complianceLevel

Org compliance reports

dev_insight_groups_collection

gitlabGroupId (unique)

Group upsert

dev_insight_releases_collection

releasedAt (desc)

Release timeline

tagName

Tag lookups

project_released (compound)

Per-project release history

project_tag (compound, unique)

Release upsert

collectedAt (TTL 1 y)

Auto-purge

dev_insight_metrics_daily_collection

date (desc)

Time-series scan

project_date (compound, unique)

Daily per-project upsert

team_date (compound)

Team daily metrics

collectedAt (TTL 1 y)

Auto-purge

dev_insight_metrics_org_daily_collection

date (unique)

Org daily upsert

collectedAt (TTL 1 y)

Auto-purge

dev_insight_commit_hourly_counts_collection

date_day_hour (compound, unique)

Heatmap upsert

collectedAt (TTL 90 d)

Auto-purge

dev_insight_collector_state_collection

jobName (unique)

Job-state lookup (incl. full_cycle cycle pacing)

Manual creation script (idempotent)

The Spring auto-index-creation: true flag silently skips collections that already exist with documents, so on an enterprise DB you'll find the declared indexes are missing. Run the script below in mongosh once. It is safe to re-run — it skips indexes that already exist with the same spec and reports any conflicts without aborting.

Pre-flight (copy-paste into mongosh)

Duplicate-check before any unique index

Unique-index creation fails mid-build if duplicates exist. Run these checks for every unique: true index in the script. If any returns rows, resolve before creating the index — Mongo will not start the build.

If anything comes back, investigate before continuing. Duplicates usually mean the upsert lookup ran without an index — exactly the bug we're trying to fix. You can collapse them with an aggregation $out to a fresh collection, or delete the older copies, depending on which one is correct.

Run

Each call uses an explicit name so re-runs match by name and skip silently. Errors are caught per-index so one failure doesn't abort the script.

Notes for an enterprise box

  • Mongo 8.0 builds indexes hybrid — reads and writes continue during the build. You don't need { background: true }; that flag is a no-op since 4.2 and removed in 8.x.

  • Memory pressure: each large index build uses up to maxIndexBuildMemoryUsageMegabytes (default 200 MB) before spilling to disk. On a 10 M-doc collection a build can take tens of minutes. Schedule the bigger ones (commits, pipelines, events) outside peak collector cycles.

  • Watch progress in another mongosh window: db.currentOp({ "command.createIndexes": { $exists: true } }).

  • One at a time per collection — don't run two createIndex calls on the same collection in parallel; the second one queues anyway and the script is fast enough sequentially.

  • If a unique build fails with DuplicateKey the partial index is automatically rolled back. Fix the dupes (see pre-flight) and re-run the script — the script's named-index match will skip everything that succeeded.

Drop all DevInsight collections (destructive — full reset)

This deletes every DevInsight document AND all indexes on those collections. Use only when you want to start completely from scratch (e.g., re-onboarding, schema migration, or after corrupted state). On a fresh, empty collection Spring's auto-index-creation: true will recreate the declared indexes on next collector boot — but you'll still want to re-run the manual creation script above for the recommended additions.

Notes:

  • Dropping dev_insight_collector_state_collection resets every job's lastRunAt / lastFullSyncAt, so the next cycle is treated as the first run — pulls 90 days of history (whatever gitlab.daysToCollect is set to). Skip it from the list if you only want to reset data, not checkpoints.

  • After dropping, the dev_insight_pipelines_collection enrichment fields (jobs, failure reasons, security data) are gone too. Stage 2 (/pipelines/refresh?force=true) will need to re-fetch everything from GitLab.

  • Drop is instant in Mongo — it removes the collection metadata; storage reclaim happens in the background. You don't need a maintenance window.

Verifying / dropping (single index)

When to re-check

  • Any new repository method added with findBy… / existsBy… / countBy… — confirm the queried fields are indexed.

  • Any aggregation pipeline that does $match on a non-indexed field over a collection > 100k docs.

  • After a bulk import — Mongo doesn't auto-add indexes for fields you only ever group/sort by.


Appendix — Collections written

After a successful run you'll see these in Mongo (dev-sec-ops-db):

Collection
Written by
Key document shape

dev_insight_projects_collection

gitlab-collector

Project metadata, bus factor, health score

dev_insight_commits_collection

gitlab-collector

Commit history

dev_insight_merge_requests_collection

gitlab-collector

MR lifecycle

dev_insight_branches_collection

gitlab-collector

Branch list

dev_insight_contributors_collection

gitlab-collector

Contributor rollup

dev_insight_events_collection

gitlab-collector

Project activity events

dev_insight_pipelines_collection

gitlab-collector (list) + ci-pipeline-collector (enrichment)

Pipeline runs with 70+ fields

dev_insight_compliance_collection

gitlab-collector

Compliance framework data

dev_insight_groups_collection

gitlab-collector

Group hierarchy

dev_insight_releases_collection

gitlab-collector

Release / tag history

dev_insight_metrics_daily_collection

gitlab-collector

Project-level daily metrics

dev_insight_metrics_org_daily_collection

gitlab-collector

Org-level daily metrics

dev_insight_commit_hourly_count_collection

gitlab-collector

Hourly commit heatmap

dev_insight_collector_state_collection

gitlab-collector

Job run state / checkpoints

dev_insight_jenkins_builds_collection

ci-pipeline-collector

Jenkins build records

Config help Debugarrow-up-right All collections have a 90-day TTL on collectedAt — old records auto-purge.

Last updated