How to deploy the GitLab + CI-Pipeline collectors with crons disabled and drive collection via REST endpoints.
This guide shows how to deploy the collector, disable all crons, and trigger every collection step manually via REST endpoints. Use this when you want full control over when data is fetched — e.g. during onboarding, debugging, or when you want to validate what gets written before enabling scheduled collection.
Architecture in one line: one Docker image, two containers. One runs the GitLab-wide scrape (projects, commits, MRs, branches, pipeline list). The other enriches pipeline details and optionally collects Jenkins builds.
Both use the same image: us-central1-docker.pkg.dev/cve-buster/sec1-public-repo/sec1-devsecops-collector:latest.
Important separation: don't enable the GitLab main scheduler AND the pipeline-detail scheduler in the same container — they'll race on the same Mongo docs. Keep them on separate containers.
SPRING_PROFILES_ACTIVE=prod
SERVER_PORT=8080
LOG_LEVEL=INFO
APP_LOG_LEVEL=INFO
MONGO_LOG_LEVEL=INFO
REST_LOG_LEVEL=INFO
# Mongo — same DB for both containers
MONGODB_URI=mongodb://admin:changeme@mongo:27017/dev-sec-ops-db?authSource=admin
MONGODB_DATABASE=dev-sec-ops-db
# Sec1 API
SEC1_API_KEY=your-sec1-api-key
SEC1_USER_DETAILS_URL=https://api.sec1.io/foss/get-user-details
SEC1_FETCH_USER_DETAILS_ON_STARTUP=false
# GitLab access
GITLAB_BASE_URL=https://gitlab.com
GITLAB_TOKEN=glpat-your-token-here
GITLAB_COLLECT_ALL_PROJECTS=true
GITLAB_HEALTH_CHECK_ENABLED=false
gitlab-collector — all crons OFF
ci-pipeline-collector — all crons OFF
3. Docker / Podman commands
GitLab collector
CI pipeline collector
Podman users: swap docker run for podman run --tty — everything else is identical.
Health check
4. Manual collection endpoints
All endpoints are POST (or DELETE for clear operations) and return a JSON response you can log for audit.
Step 1 — Clear old data (optional, for a fresh start)
This deletes every DevInsight record. Run only when you want to start from scratch.
Response:
Step 2 — Run GitLab full collection (projects, commits, pipelines list, …)
Runs on the gitlab-collector container. This is the heaviest call — takes 3–4 minutes for ~10 projects depending on commit history. Non-blocking: it returns immediately and runs in the background.
Response:
Track progress by tailing the container logs:
You'll see each job finish in order: projects_sync → commits_collector → merge_requests_collector → branches_collector → contributors_collector → events_collector → pipelines_collector → compliance_collector → groups_sync → releases_collector.
Step 2b (optional, faster) — Pipelines-only list sync
If you've already collected projects and only want fresh pipeline list data (no commits/MRs/branches), use this on the gitlab-collector. Takes ~7 seconds instead of 3–4 min.
Step 3 — Enrich pipeline details (Stage 2)
Runs on the ci-pipeline-collector container. Reads pipeline list from Mongo and enriches each pipeline with jobs, failures, security scan results, stage bottleneck analysis, NCD template classification, branch protection, governance flags.
Parameters:
Param
Default
Purpose
daysBack
7
How far back to enrich. Use 90 for initial backfill, 7 for regular runs.
force
false
true clears existing enrichment first (re-fetches yaml, re-classifies, re-reads protection). Use when you change classifier config or want truly fresh enrichment.
Response:
Progress log marker:
Step 3b — Enrich specific projects only
If you only want to process certain projects (e.g., during debugging):
Step 4 — Jenkins collection (if enabled)
Runs on the ci-pipeline-collector container.
Refresh Jenkins builds (re-fetch last N days):
5. Complete "first-run" sequence
Copy-paste this whole block for a clean first collection:
Expected total time: 5–7 minutes depending on project count and pipeline volume.
6. Day-to-day manual refresh
Once the first-run completes, you can keep running these regularly (no clear needed):
All scheduler enable flags default to false. You only ever need to set true to turn a scheduler ON.
Env var
Default
What it controls
GITLAB_SCHEDULER_ENABLED
false
Full GitLab cycle on cron (GITLAB_COLLECTOR_CRON)
GITLAB_PIPELINE_DETAIL_ENABLED
false
Stage 2 pipeline detail on cron (GITLAB_PIPELINE_DETAIL_CRON)
GITLAB_PIPELINES_ONLY_ENABLED
false
Lightweight pipeline-list sync on cron
GITLAB_PIPELINE_CLASSIFICATION_ENABLED
false
NCD classification during Stage 2 enrichment
JENKINS_PIPELINE_ENABLED
false
Jenkins pipeline builds on cron
JENKINS_PIPELINE_INSTANCE_1_*
—
Per-instance credentials (needed even in manual mode)
SONARQUBE_SCHEDULER_ENABLED
false
SonarQube collection
NEXUSIQ_SCHEDULER_ENABLED
false
Nexus IQ collection
SERVICENOW_SCHEDULER_ENABLED
false
ServiceNow collection
JIRA_SCHEDULER_ENABLED
false
Jira collection
JENKINS_SCHEDULER_ENABLED
false
Legacy Jenkins builds collector
Cron expressions (only matter when a scheduler is enabled)
Env var
Default
GITLAB_COLLECTOR_CRON
0 */10 * * * * — every 10 minutes
GITLAB_PIPELINE_DETAIL_CRON
0 */5 * * * * — every 5 minutes
GITLAB_PIPELINES_ONLY_CRON
0 */5 * * * * — every 5 minutes
JENKINS_PIPELINE_CRON
0 */5 * * * * — every 5 minutes
9. Troubleshooting
"Pipelines stuck on running status forever"
Enrichment only finalises when GitLab reports success / failed / canceled. Trigger another refresh?force=true after the pipeline finishes on GitLab to stamp the final status.
"I enabled classification but pipelineCategory is still custom for NCD projects"
Check the include path in your .gitlab-ci.yml matches the configured GITLAB_PIPELINE_CATEGORY_1_INCLUDE_PROJECT exactly (case-insensitive, full namespace required). Then run refresh?force=true to bust the classification cache.
"renovateEnabled / ncdCliVersion are null"
These only populate when the corresponding NCD variables (RENOVATE_SOFTGATE, CLI_VERISON — note the intentional typo) are set in the pipeline's .gitlab-ci.yml. Not a bug — just missing upstream data.
"Force refresh doesn't seem to pick up my new yaml"
force=true on refresh clears in-memory caches and re-fetches the yaml. If still stale, restart the container — the yaml cache is in-memory with 6-hour TTL.
Container logs worth knowing
10. Switching from manual to scheduled mode
Once manual flow is validated, flip these to true and restart the containers:
Default crons are every 5–10 minutes. Override via GITLAB_COLLECTOR_CRON, GITLAB_PIPELINE_DETAIL_CRON, etc.
11. MongoDB indexes
Spring Data auto-creates indexes declared on the model classes (@Indexed, @CompoundIndex) on first connection. Everything in the Auto-created column below is built for you on app startup — you don't have to do anything. The Recommended additions section is what we suggest you create by hand, either ad-hoc as the data grows or via a migration script.
Verify what's actually present:db.<collection>.getIndexes() in mongosh. Add ?autoCreateIndex=false to MONGO_URI only if you're managing indexes outside the app.
The Spring auto-index-creation: true flag silently skips collections that already exist with documents, so on an enterprise DB you'll find the declared indexes are missing. Run the script below in mongosh once. It is safe to re-run — it skips indexes that already exist with the same spec and reports any conflicts without aborting.
Pre-flight (copy-paste into mongosh)
Duplicate-check before any unique index
Unique-index creation fails mid-build if duplicates exist. Run these checks for every unique: true index in the script. If any returns rows, resolve before creating the index — Mongo will not start the build.
If anything comes back, investigate before continuing. Duplicates usually mean the upsert lookup ran without an index — exactly the bug we're trying to fix. You can collapse them with an aggregation $out to a fresh collection, or delete the older copies, depending on which one is correct.
Run
Each call uses an explicit name so re-runs match by name and skip silently. Errors are caught per-index so one failure doesn't abort the script.
Notes for an enterprise box
Mongo 8.0 builds indexes hybrid — reads and writes continue during the build. You don't need { background: true }; that flag is a no-op since 4.2 and removed in 8.x.
Memory pressure: each large index build uses up to maxIndexBuildMemoryUsageMegabytes (default 200 MB) before spilling to disk. On a 10 M-doc collection a build can take tens of minutes. Schedule the bigger ones (commits, pipelines, events) outside peak collector cycles.
Watch progress in another mongosh window: db.currentOp({ "command.createIndexes": { $exists: true } }).
One at a time per collection — don't run two createIndex calls on the same collection in parallel; the second one queues anyway and the script is fast enough sequentially.
If a unique build fails with DuplicateKey the partial index is automatically rolled back. Fix the dupes (see pre-flight) and re-run the script — the script's named-index match will skip everything that succeeded.
Drop all DevInsight collections (destructive — full reset)
This deletes every DevInsight document AND all indexes on those collections. Use only when you want to start completely from scratch (e.g., re-onboarding, schema migration, or after corrupted state). On a fresh, empty collection Spring's auto-index-creation: true will recreate the declared indexes on next collector boot — but you'll still want to re-run the manual creation script above for the recommended additions.
Notes:
Dropping dev_insight_collector_state_collection resets every job's lastRunAt / lastFullSyncAt, so the next cycle is treated as the first run — pulls 90 days of history (whatever gitlab.daysToCollect is set to). Skip it from the list if you only want to reset data, not checkpoints.
After dropping, the dev_insight_pipelines_collection enrichment fields (jobs, failure reasons, security data) are gone too. Stage 2 (/pipelines/refresh?force=true) will need to re-fetch everything from GitLab.
Drop is instant in Mongo — it removes the collection metadata; storage reclaim happens in the background. You don't need a maintenance window.
Verifying / dropping (single index)
When to re-check
Any new repository method added with findBy… / existsBy… / countBy… — confirm the queried fields are indexed.
Any aggregation pipeline that does $match on a non-indexed field over a collection > 100k docs.
After a bulk import — Mongo doesn't auto-add indexes for fields you only ever group/sort by.
Appendix — Collections written
After a successful run you'll see these in Mongo (dev-sec-ops-db):
# DISABLE all schedulers
GITLAB_SCHEDULER_ENABLED=false
GITLAB_PIPELINE_DETAIL_ENABLED=false
GITLAB_PIPELINES_ONLY_ENABLED=false
# Template classification off here (done by CI container)
GITLAB_PIPELINE_CLASSIFICATION_ENABLED=false
# Other collectors — not used here
SONARQUBE_SCHEDULER_ENABLED=false
NEXUSIQ_SCHEDULER_ENABLED=false
SERVICENOW_SCHEDULER_ENABLED=false
JIRA_SCHEDULER_ENABLED=false
JENKINS_SCHEDULER_ENABLED=false
JENKINS_PIPELINE_ENABLED=false
# GitLab main scheduler OFF (handled by gitlab-collector)
GITLAB_SCHEDULER_ENABLED=false
# Pipeline enrichment scheduler — DISABLED for manual mode
GITLAB_PIPELINE_DETAIL_ENABLED=false
GITLAB_PIPELINES_ONLY_ENABLED=false
# Template classification — keep ON so manual enrichment tags NCD/custom
GITLAB_PIPELINE_CLASSIFICATION_ENABLED=true
GITLAB_PIPELINE_CATEGORY_1_NAME=NCD
GITLAB_PIPELINE_CATEGORY_1_INCLUDE_PROJECT=gts-cta-strategy-innersource/ncd/pipeline-templates
# Add more categories as env vars — CATEGORY_2_NAME, CATEGORY_2_INCLUDE_PROJECT, etc.
# Jenkins pipeline scheduler — DISABLED for manual mode
JENKINS_PIPELINE_ENABLED=false
# Jenkins instance credentials (needed even in manual mode so the collect endpoint has connection info)
JENKINS_PIPELINE_INSTANCE_1_NAME=production-jenkins
JENKINS_PIPELINE_INSTANCE_1_URL=https://jenkins.example.com
JENKINS_PIPELINE_INSTANCE_1_USERNAME=api-user
JENKINS_PIPELINE_INSTANCE_1_TOKEN=your-jenkins-api-token
JENKINS_PIPELINE_INSTANCE_1_ENV=production
JENKINS_PIPELINE_INSTANCE_1_ENABLED=true
JENKINS_PIPELINE_INSTANCE_1_COLLECT_ALL=true
# Other collectors
SONARQUBE_SCHEDULER_ENABLED=false
NEXUSIQ_SCHEDULER_ENABLED=false
SERVICENOW_SCHEDULER_ENABLED=false
JIRA_SCHEDULER_ENABLED=false
JENKINS_SCHEDULER_ENABLED=false
curl -X POST http://localhost:8280/api/v1/collector/gitlab/pipelines/sync-all
# Re-enrich all pipelines in the last 90 days (non-blocking)
curl -X POST "http://localhost:8281/api/v1/collector/gitlab/pipelines/refresh?daysBack=90&force=true"
{
"success": true,
"message": "Pipeline detail refresh started for last 90 days (FORCE — full refresh) — runs in background",
"data": { "daysBack": 90, "force": true }
}
# Collect from all configured Jenkins instances
curl -X POST http://localhost:8281/api/v1/collector/jenkins-pipeline/collect
# Collect from a single instance
curl -X POST "http://localhost:8281/api/v1/collector/jenkins-pipeline/collect-instance?name=production-jenkins"
curl -X POST "http://localhost:8281/api/v1/collector/jenkins-pipeline/refresh?daysBack=30&force=true"
GITLAB_COLLECTOR=http://localhost:8280
CI_PIPELINE_COLLECTOR=http://localhost:8281
# 1. Wipe any previous data
curl -X DELETE $GITLAB_COLLECTOR/api/v1/collector/gitlab/clear-all
# 2. Kick off full GitLab collection (3-4 min)
curl -X POST $GITLAB_COLLECTOR/api/v1/collector/gitlab/collect
# Wait until pipelines_collector job is visible in logs
echo "Waiting for Stage 1..."
until docker logs gitlab-collector 2>&1 | grep -q "Job \[pipelines_collector\] completed"; do
sleep 10
done
echo "Stage 1 done."
# 3. Enrich all pipelines in the last 90 days
curl -X POST "$CI_PIPELINE_COLLECTOR/api/v1/collector/gitlab/pipelines/refresh?daysBack=90&force=true"
# 4. (Optional) Collect Jenkins
curl -X POST $CI_PIPELINE_COLLECTOR/api/v1/collector/jenkins-pipeline/collect
# Quick GitLab refresh — projects + commits + pipeline list
curl -X POST http://localhost:8280/api/v1/collector/gitlab/collect
# (even faster) pipelines-only list sync
curl -X POST http://localhost:8280/api/v1/collector/gitlab/pipelines/sync-all
# Re-enrich recent pipelines (last 7 days, no force)
curl -X POST "http://localhost:8281/api/v1/collector/gitlab/pipelines/refresh?daysBack=7"
# Jenkins refresh
curl -X POST "http://localhost:8281/api/v1/collector/jenkins-pipeline/refresh?daysBack=7"
use('dev-sec-ops-db');
const COLLECTIONS = [
"dev_insight_projects_collection",
"dev_insight_commits_collection",
"dev_insight_merge_requests_collection",
"dev_insight_branches_collection",
"dev_insight_contributors_collection",
"dev_insight_events_collection",
"dev_insight_pipelines_collection",
"dev_insight_compliance_collection",
"dev_insight_groups_collection",
"dev_insight_releases_collection",
"dev_insight_metrics_daily_collection",
"dev_insight_metrics_org_daily_collection",
"dev_insight_commit_hourly_counts_collection",
"dev_insight_collector_state_collection",
"dev_insight_jenkins_builds_collection"
];
COLLECTIONS.forEach(c => {
const exists = db.getCollectionNames().includes(c);
if (!exists) { print("· skip ", c, "(does not exist)"); return; }
const before = db.getCollection(c).estimatedDocumentCount();
try {
db.getCollection(c).drop();
print("✓ dropped", c.padEnd(48), "(" + before + " docs)");
} catch (e) {
print("✗ FAILED ", c.padEnd(48), e.codeName, e.errmsg);
}
});
print("\nDone. All DevInsight collections + indexes removed. Next collector run will recreate them empty.");
// Show all indexes on a collection
db.dev_insight_projects_collection.getIndexes()
// Drop one index by name
db.dev_insight_projects_collection.dropIndex("visibility_1")
// Quick sanity check that hot queries are using indexes
db.dev_insight_commits_collection.find({ projectId: 12345 })
.sort({ committedAt: -1 }).limit(1).explain("executionStats")
// look for "stage": "IXSCAN" — not "COLLSCAN"