> For the complete documentation index, see [llms.txt](https://docs.sec1.io/user-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.sec1.io/user-docs/9-setup-instructions.md).

# DevSecOps Pipeline Collector — Runbook

This guide shows how to **deploy the collector, disable all crons, and trigger every collection step manually via REST endpoints**. Use this when you want full control over when data is fetched — e.g. during onboarding, debugging, or when you want to validate what gets written before enabling scheduled collection.

{% hint style="info" %}
**Architecture in one line:** one Docker image, two containers. One runs the GitLab-wide scrape (projects, commits, MRs, branches, pipeline list). The other enriches pipeline details and optionally collects Jenkins builds.
{% endhint %}

***

## 1. The two containers

| Container                 | Responsibility                                                                                                    | Port   | Mongo collections it writes                                                            |
| ------------------------- | ----------------------------------------------------------------------------------------------------------------- | ------ | -------------------------------------------------------------------------------------- |
| **gitlab-collector**      | Projects, commits, MRs, branches, contributors, events, **pipeline list (Stage 1)**, compliance, groups, releases | `8280` | `dev_insight_*`                                                                        |
| **ci-pipeline-collector** | **Pipeline detail enrichment (Stage 2)**, Jenkins pipeline builds                                                 | `8281` | `dev_insight_pipelines_collection` (enriches), `dev_insight_jenkins_builds_collection` |

Both use the same image: `us-central1-docker.pkg.dev/cve-buster/sec1-public-repo/sec1-devsecops-collector:latest`.

{% hint style="warning" %}
**Important separation:** don't enable the GitLab main scheduler AND the pipeline-detail scheduler in the same container — they'll race on the same Mongo docs. Keep them on separate containers.
{% endhint %}

***

## 2. Environment variables — cron-disabled, manual-only

### Common env (both containers)

```env
SPRING_PROFILES_ACTIVE=prod
SERVER_PORT=8080
LOG_LEVEL=INFO
APP_LOG_LEVEL=INFO
MONGO_LOG_LEVEL=INFO
REST_LOG_LEVEL=INFO

# Mongo — same DB for both containers
MONGODB_URI=mongodb://admin:changeme@mongo:27017/dev-sec-ops-db?authSource=admin
MONGODB_DATABASE=dev-sec-ops-db

# Sec1 API
SEC1_API_KEY=your-sec1-api-key
SEC1_USER_DETAILS_URL=https://api.sec1.io/foss/get-user-details
SEC1_FETCH_USER_DETAILS_ON_STARTUP=false

# GitLab access
GITLAB_BASE_URL=https://gitlab.com
GITLAB_TOKEN=glpat-your-token-here
GITLAB_COLLECT_ALL_PROJECTS=true
GITLAB_HEALTH_CHECK_ENABLED=false
```

### `gitlab-collector` — all crons OFF

```env
# DISABLE all schedulers
GITLAB_SCHEDULER_ENABLED=false
GITLAB_PIPELINE_DETAIL_ENABLED=false
GITLAB_PIPELINES_ONLY_ENABLED=false

# Template classification off here (done by CI container)
GITLAB_PIPELINE_CLASSIFICATION_ENABLED=false

# Other collectors — not used here
SONARQUBE_SCHEDULER_ENABLED=false
NEXUSIQ_SCHEDULER_ENABLED=false
SERVICENOW_SCHEDULER_ENABLED=false
JIRA_SCHEDULER_ENABLED=false
JENKINS_SCHEDULER_ENABLED=false
JENKINS_PIPELINE_ENABLED=false
```

### `ci-pipeline-collector` — all crons OFF

```env
# GitLab main scheduler OFF (handled by gitlab-collector)
GITLAB_SCHEDULER_ENABLED=false

# Pipeline enrichment scheduler — DISABLED for manual mode
GITLAB_PIPELINE_DETAIL_ENABLED=false
GITLAB_PIPELINES_ONLY_ENABLED=false

# Template classification — keep ON so manual enrichment tags NCD/custom
GITLAB_PIPELINE_CLASSIFICATION_ENABLED=true
GITLAB_PIPELINE_CATEGORY_1_NAME=NCD
GITLAB_PIPELINE_CATEGORY_1_INCLUDE_PROJECT=gts-cta-strategy-innersource/ncd/pipeline-templates
# Add more categories as env vars — CATEGORY_2_NAME, CATEGORY_2_INCLUDE_PROJECT, etc.

# Jenkins pipeline scheduler — DISABLED for manual mode
JENKINS_PIPELINE_ENABLED=false

# Jenkins instance credentials (needed even in manual mode so the collect endpoint has connection info)
JENKINS_PIPELINE_INSTANCE_1_NAME=production-jenkins
JENKINS_PIPELINE_INSTANCE_1_URL=https://jenkins.example.com
JENKINS_PIPELINE_INSTANCE_1_USERNAME=api-user
JENKINS_PIPELINE_INSTANCE_1_TOKEN=your-jenkins-api-token
JENKINS_PIPELINE_INSTANCE_1_ENV=production
JENKINS_PIPELINE_INSTANCE_1_ENABLED=true
JENKINS_PIPELINE_INSTANCE_1_COLLECT_ALL=true

# Other collectors
SONARQUBE_SCHEDULER_ENABLED=false
NEXUSIQ_SCHEDULER_ENABLED=false
SERVICENOW_SCHEDULER_ENABLED=false
JIRA_SCHEDULER_ENABLED=false
JENKINS_SCHEDULER_ENABLED=false
```

***

## 3. Docker / Podman commands

### GitLab collector

```bash
docker run -d --name gitlab-collector -p 8280:8080 \
  --env-file /opt/sec1/gitlab-collector.env \
  us-central1-docker.pkg.dev/cve-buster/sec1-public-repo/sec1-devsecops-collector:latest
```

### CI pipeline collector

```bash
docker run -d --name ci-pipeline-collector -p 8281:8080 \
  --env-file /opt/sec1/ci-pipeline-collector.env \
  us-central1-docker.pkg.dev/cve-buster/sec1-public-repo/sec1-devsecops-collector:latest
```

{% hint style="success" %}
Podman users: swap `docker run` for `podman run --tty` — everything else is identical.
{% endhint %}

### Health check

```bash
curl http://localhost:8280/actuator/health
curl http://localhost:8281/actuator/health
# Both should return: {"status":"UP",...}
```

***

## 4. Manual collection endpoints

All endpoints are `POST` (or `DELETE` for clear operations) and return a JSON response you can log for audit.

### Step 1 — Clear old data (optional, for a fresh start)

{% hint style="danger" %}
This deletes every DevInsight record. Run only when you want to start from scratch.
{% endhint %}

```bash
# Clears 14 GitLab collections (projects, commits, MRs, branches, contributors,
# events, pipelines, compliance, groups, releases, metrics_daily, metrics_org_daily,
# commit_hourly_counts, collector_state)
curl -X DELETE http://localhost:8280/api/v1/collector/gitlab/clear-all
```

Response:

```json
{
  "success": true,
  "message": "Cleared 2338 documents across 14 GitLab collections",
  "data": {
    "projects": 9,
    "commits": 1687,
    "merge_requests": 34,
    "branches": 168,
    "contributors": 159,
    "events": 8,
    "pipelines": 16,
    "compliance": 9,
    "groups": 2,
    "releases": 3,
    "metrics_daily": 9,
    "metrics_org_daily": 1,
    "commit_hourly_counts": 0,
    "collector_state": 11
  }
}
```

### Step 2 — Run GitLab full collection (projects, commits, pipelines list, …)

Runs on the **gitlab-collector** container. This is the heaviest call — takes 3–4 minutes for \~10 projects depending on commit history. Non-blocking: it returns immediately and runs in the background.

```bash
curl -X POST http://localhost:8280/api/v1/collector/gitlab/collect
```

Response:

```json
{
  "success": true,
  "message": "GitLab collection started",
  "timestamp": "2026-04-24T10:00:00Z"
}
```

Track progress by tailing the container logs:

```bash
docker logs -f gitlab-collector | grep -E "Job \[.*\] completed"
```

You'll see each job finish in order: `projects_sync` → `commits_collector` → `merge_requests_collector` → `branches_collector` → `contributors_collector` → `events_collector` → `pipelines_collector` → `compliance_collector` → `groups_sync` → `releases_collector`.

### Step 2b (optional, faster) — Pipelines-only list sync

If you've already collected projects and only want fresh pipeline list data (no commits/MRs/branches), use this on the **gitlab-collector**. Takes \~7 seconds instead of 3–4 min.

```bash
curl -X POST http://localhost:8280/api/v1/collector/gitlab/pipelines/sync-all
```

### Step 3 — Enrich pipeline details (Stage 2)

Runs on the **ci-pipeline-collector** container. Reads pipeline list from Mongo and enriches each pipeline with jobs, failures, security scan results, stage bottleneck analysis, NCD template classification, branch protection, governance flags.

```bash
# Re-enrich all pipelines in the last 90 days (non-blocking)
curl -X POST "http://localhost:8281/api/v1/collector/gitlab/pipelines/refresh?daysBack=90&force=true"
```

Parameters:

| Param      | Default | Purpose                                                                                                                                                              |
| ---------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `daysBack` | `7`     | How far back to enrich. Use `90` for initial backfill, `7` for regular runs.                                                                                         |
| `force`    | `false` | `true` clears existing enrichment first (re-fetches yaml, re-classifies, re-reads protection). Use when you change classifier config or want truly fresh enrichment. |

Response:

```json
{
  "success": true,
  "message": "Pipeline detail refresh started for last 90 days (FORCE — full refresh) — runs in background",
  "data": { "daysBack": 90, "force": true }
}
```

Progress log marker:

```bash
docker logs -f ci-pipeline-collector | grep -E "Stage 2: enriched"
# -> 2026-04-24 10:05:12 - Stage 2: enriched 33 pipelines across all projects
```

### Step 3b — Enrich specific projects only

If you only want to process certain projects (e.g., during debugging):

```bash
# Accepts numeric IDs, repo paths, or names
curl -X POST "http://localhost:8281/api/v1/collector/gitlab/pipelines/enrich?daysBack=90&force=true" \
  -H "Content-Type: application/json" \
  -d '["sec1-group/ScaleSecVulnado", "80337161", "vulnerable-java-application"]'
```

### Step 4 — Jenkins collection (if enabled)

Runs on the **ci-pipeline-collector** container.

```bash
# Collect from all configured Jenkins instances
curl -X POST http://localhost:8281/api/v1/collector/jenkins-pipeline/collect

# Collect from a single instance
curl -X POST "http://localhost:8281/api/v1/collector/jenkins-pipeline/collect-instance?name=production-jenkins"
```

Refresh Jenkins builds (re-fetch last N days):

```bash
curl -X POST "http://localhost:8281/api/v1/collector/jenkins-pipeline/refresh?daysBack=30&force=true"
```

***

## 5. Complete "first-run" sequence

Copy-paste this whole block for a clean first collection:

```bash
GITLAB_COLLECTOR=http://localhost:8280
CI_PIPELINE_COLLECTOR=http://localhost:8281

# 1. Wipe any previous data
curl -X DELETE $GITLAB_COLLECTOR/api/v1/collector/gitlab/clear-all

# 2. Kick off full GitLab collection (3-4 min)
curl -X POST $GITLAB_COLLECTOR/api/v1/collector/gitlab/collect

# Wait until pipelines_collector job is visible in logs
echo "Waiting for Stage 1..."
until docker logs gitlab-collector 2>&1 | grep -q "Job \[pipelines_collector\] completed"; do
  sleep 10
done
echo "Stage 1 done."

# 3. Enrich all pipelines in the last 90 days
curl -X POST "$CI_PIPELINE_COLLECTOR/api/v1/collector/gitlab/pipelines/refresh?daysBack=90&force=true"

# 4. (Optional) Collect Jenkins
curl -X POST $CI_PIPELINE_COLLECTOR/api/v1/collector/jenkins-pipeline/collect
```

Expected total time: 5–7 minutes depending on project count and pipeline volume.

***

## 6. Day-to-day manual refresh

Once the first-run completes, you can keep running these regularly (no clear needed):

```bash
# Quick GitLab refresh — projects + commits + pipeline list
curl -X POST http://localhost:8280/api/v1/collector/gitlab/collect

# (even faster) pipelines-only list sync
curl -X POST http://localhost:8280/api/v1/collector/gitlab/pipelines/sync-all

# Re-enrich recent pipelines (last 7 days, no force)
curl -X POST "http://localhost:8281/api/v1/collector/gitlab/pipelines/refresh?daysBack=7"

# Jenkins refresh
curl -X POST "http://localhost:8281/api/v1/collector/jenkins-pipeline/refresh?daysBack=7"
```

***

## 7. Endpoint reference

### On **gitlab-collector** (`:8280`)

| Method   | Path                                          | Purpose                                                                                                                     |
| -------- | --------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| `POST`   | `/api/v1/collector/gitlab/collect`            | Full GitLab cycle: projects → commits → MRs → branches → contributors → events → pipelines → compliance → groups → releases |
| `POST`   | `/api/v1/collector/gitlab/collect-projects`   | Only sync projects (fastest, just inventory)                                                                                |
| `POST`   | `/api/v1/collector/gitlab/pipelines/sync-all` | Lightweight — only refresh the pipeline list (skips commits/MRs)                                                            |
| `POST`   | `/api/v1/collector/gitlab/pipelines/collect`  | Sync pipeline list for specific projects (JSON body: `["repo-path", "id", ...]`)                                            |
| `DELETE` | `/api/v1/collector/gitlab/clear-all`          | Wipe all 14 GitLab collections                                                                                              |
| `GET`    | `/api/v1/collector/gitlab/sanity-check`       | Health report (counts + last-run states + issues found)                                                                     |

### On **ci-pipeline-collector** (`:8281`)

| Method | Path                                                                                    | Purpose                                                                           |
| ------ | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| `POST` | `/api/v1/collector/gitlab/pipelines/refresh?daysBack=N&force=BOOL`                      | **Stage 2 enrichment** — jobs, failures, security, NCD classification, governance |
| `POST` | `/api/v1/collector/gitlab/pipelines/refresh-project?projectId=ID&daysBack=N&force=BOOL` | Enrich a single GitLab project                                                    |
| `POST` | `/api/v1/collector/gitlab/pipelines/enrich`                                             | Enrich specific projects (JSON body: `["repo-path", "id", ...]`)                  |
| `POST` | `/api/v1/collector/gitlab/pipelines/collect-details`                                    | Runs the pipeline-detail cycle once (same as scheduled job)                       |
| `POST` | `/api/v1/collector/jenkins-pipeline/collect`                                            | Collect builds from **all** configured Jenkins instances                          |
| `POST` | `/api/v1/collector/jenkins-pipeline/collect-instance?name=INSTANCE`                     | Collect from a single Jenkins instance                                            |
| `POST` | `/api/v1/collector/jenkins-pipeline/refresh?daysBack=N&force=BOOL`                      | Refresh Jenkins builds, overwrite existing                                        |
| `GET`  | `/api/v1/collector/jenkins-pipeline/instances`                                          | List configured instances and their health                                        |
| `GET`  | `/api/v1/collector/jenkins-pipeline/sanity-check`                                       | Jenkins health report                                                             |

***

## 8. Environment variable reference

All scheduler enable flags default to `false`. You only ever need to set `true` to turn a scheduler ON.

| Env var                                  | Default | What it controls                                                |
| ---------------------------------------- | ------- | --------------------------------------------------------------- |
| `GITLAB_SCHEDULER_ENABLED`               | `false` | Full GitLab cycle on cron (`GITLAB_COLLECTOR_CRON`)             |
| `GITLAB_PIPELINE_DETAIL_ENABLED`         | `false` | Stage 2 pipeline detail on cron (`GITLAB_PIPELINE_DETAIL_CRON`) |
| `GITLAB_PIPELINES_ONLY_ENABLED`          | `false` | Lightweight pipeline-list sync on cron                          |
| `GITLAB_PIPELINE_CLASSIFICATION_ENABLED` | `false` | NCD classification during Stage 2 enrichment                    |
| `JENKINS_PIPELINE_ENABLED`               | `false` | Jenkins pipeline builds on cron                                 |
| `JENKINS_PIPELINE_INSTANCE_1_*`          | —       | Per-instance credentials (needed even in manual mode)           |
| `SONARQUBE_SCHEDULER_ENABLED`            | `false` | SonarQube collection                                            |
| `NEXUSIQ_SCHEDULER_ENABLED`              | `false` | Nexus IQ collection                                             |
| `SERVICENOW_SCHEDULER_ENABLED`           | `false` | ServiceNow collection                                           |
| `JIRA_SCHEDULER_ENABLED`                 | `false` | Jira collection                                                 |
| `JENKINS_SCHEDULER_ENABLED`              | `false` | Legacy Jenkins builds collector                                 |

### Cron expressions (only matter when a scheduler is enabled)

| Env var                       | Default                             |
| ----------------------------- | ----------------------------------- |
| `GITLAB_COLLECTOR_CRON`       | `0 */10 * * * *` — every 10 minutes |
| `GITLAB_PIPELINE_DETAIL_CRON` | `0 */5 * * * *` — every 5 minutes   |
| `GITLAB_PIPELINES_ONLY_CRON`  | `0 */5 * * * *` — every 5 minutes   |
| `JENKINS_PIPELINE_CRON`       | `0 */5 * * * *` — every 5 minutes   |

***

## 9. Troubleshooting

### "Pipelines stuck on `running` status forever"

Enrichment only finalises when GitLab reports `success` / `failed` / `canceled`. Trigger another `refresh?force=true` after the pipeline finishes on GitLab to stamp the final status.

### "I enabled classification but `pipelineCategory` is still `custom` for NCD projects"

Check the include path in your `.gitlab-ci.yml` matches the configured `GITLAB_PIPELINE_CATEGORY_1_INCLUDE_PROJECT` exactly (case-insensitive, full namespace required). Then run `refresh?force=true` to bust the classification cache.

### "`renovateEnabled` / `ncdCliVersion` are null"

These only populate when the corresponding NCD variables (`RENOVATE_SOFTGATE`, `CLI_VERISON` — note the intentional typo) are set in the pipeline's `.gitlab-ci.yml`. Not a bug — just missing upstream data.

### "Force refresh doesn't seem to pick up my new yaml"

`force=true` on `refresh` clears in-memory caches and re-fetches the yaml. If still stale, restart the container — the yaml cache is in-memory with 6-hour TTL.

### Container logs worth knowing

```bash
# Collection start / end
docker logs gitlab-collector | grep -E "Starting|completed"

# Pipeline enrichment progress
docker logs ci-pipeline-collector | grep -E "Stage 2|Refreshing"

# GitLab API rate-limit headers
docker logs gitlab-collector | grep -i ratelimit | tail -5
```

***

## 10. Switching from manual to scheduled mode

Once manual flow is validated, flip these to `true` and restart the containers:

```env
# gitlab-collector
GITLAB_SCHEDULER_ENABLED=true

# ci-pipeline-collector
GITLAB_PIPELINE_DETAIL_ENABLED=true
JENKINS_PIPELINE_ENABLED=true
```

Default crons are every 5–10 minutes. Override via `GITLAB_COLLECTOR_CRON`, `GITLAB_PIPELINE_DETAIL_CRON`, etc.

***

## 11. MongoDB indexes

Spring Data auto-creates indexes declared on the model classes (`@Indexed`, `@CompoundIndex`) on first connection. Everything in the **Auto-created** column below is built for you on app startup — you don't have to do anything. The **Recommended additions** section is what we suggest you create by hand, either ad-hoc as the data grows or via a migration script.

> **Verify what's actually present:** `db.<collection>.getIndexes()` in `mongosh`. Add `?autoCreateIndex=false` to `MONGO_URI` only if you're managing indexes outside the app.

### Auto-created from model annotations

| Collection                                    | Index                               | Purpose                                                  |
| --------------------------------------------- | ----------------------------------- | -------------------------------------------------------- |
| `dev_insight_projects_collection`             | `gitlabProjectId` (unique)          | Per-project upsert lookup                                |
|                                               | `team`                              | Filter projects by team                                  |
|                                               | `lastActivityAt` (desc)             | Delta-cycle skipping (`hasActivitySince`) + recency sort |
|                                               | `applicationName`                   | Cross-collector joins                                    |
|                                               | `teamId`                            | Team rollups                                             |
|                                               | `healthScore`, `healthStatus`       | Health filtering / sort                                  |
|                                               | `isMonorepoFlag`                    | Monorepo classification                                  |
|                                               | `bus_factor_archived` (compound)    | Bus factor leaderboards                                  |
| `dev_insight_commits_collection`              | `sha` (unique)                      | Commit dedup                                             |
|                                               | `committedAt` (desc)                | Time-window scans                                        |
|                                               | `project_committed` (compound)      | Per-project commit history                               |
|                                               | `author_committed` (compound)       | Per-author history + first-commit lookup                 |
|                                               | `heatmap_idx` (compound)            | Hourly commit heatmap aggregation                        |
| `dev_insight_merge_requests_collection`       | `gitlabMrId` (unique)               | MR upsert                                                |
|                                               | `mergedAt` (desc)                   | Throughput windows                                       |
|                                               | `project_updated`, `project_state`  | Per-project MR lookups                                   |
|                                               | `author_created`, `state_created`   | Author / state windowed counts                           |
| `dev_insight_branches_collection`             | `project_branch` (compound, unique) | Per-project branch upsert                                |
|                                               | `project_status`                    | Stale / orphaned counts                                  |
|                                               | `status`                            | Org-wide branch hygiene                                  |
| `dev_insight_contributors_collection`         | `email` (unique)                    | Contributor upsert by email                              |
|                                               | `gitlabUserId` (unique, sparse)     | Contributor upsert by GL user id                         |
|                                               | `totalReviews` (desc)               | Reviewer leaderboards                                    |
|                                               | `isIdle`                            | Idle contributor counts                                  |
|                                               | `team_commits` (compound)           | Team contributor leaderboards                            |
| `dev_insight_events_collection`               | `gitlabEventId` (unique)            | Event dedup                                              |
|                                               | `createdAt` (desc)                  | Activity feed                                            |
|                                               | `project_created` (compound)        | Per-project event feed                                   |
|                                               | `collectedAt` (TTL 30 d)            | Auto-purge                                               |
| `dev_insight_pipelines_collection`            | `pipelineId` (unique)               | Pipeline upsert                                          |
|                                               | `applicationName`                   | App-level pipeline filtering                             |
|                                               | `project_created` (compound)        | Per-project pipeline history                             |
|                                               | `project_ref_status` (compound)     | Branch + status filtering                                |
| `dev_insight_compliance_collection`           | `projectId` (unique)                | Per-project compliance upsert                            |
|                                               | `team`, `complianceLevel`           | Org compliance reports                                   |
| `dev_insight_groups_collection`               | `gitlabGroupId` (unique)            | Group upsert                                             |
| `dev_insight_releases_collection`             | `releasedAt` (desc)                 | Release timeline                                         |
|                                               | `tagName`                           | Tag lookups                                              |
|                                               | `project_released` (compound)       | Per-project release history                              |
|                                               | `project_tag` (compound, unique)    | Release upsert                                           |
|                                               | `collectedAt` (TTL 1 y)             | Auto-purge                                               |
| `dev_insight_metrics_daily_collection`        | `date` (desc)                       | Time-series scan                                         |
|                                               | `project_date` (compound, unique)   | Daily per-project upsert                                 |
|                                               | `team_date` (compound)              | Team daily metrics                                       |
|                                               | `collectedAt` (TTL 1 y)             | Auto-purge                                               |
| `dev_insight_metrics_org_daily_collection`    | `date` (unique)                     | Org daily upsert                                         |
|                                               | `collectedAt` (TTL 1 y)             | Auto-purge                                               |
| `dev_insight_commit_hourly_counts_collection` | `date_day_hour` (compound, unique)  | Heatmap upsert                                           |
|                                               | `collectedAt` (TTL 90 d)            | Auto-purge                                               |
| `dev_insight_collector_state_collection`      | `jobName` (unique)                  | Job-state lookup (incl. `full_cycle` cycle pacing)       |

### Manual creation script (idempotent)

The Spring `auto-index-creation: true` flag silently skips collections that already exist with documents, so on an enterprise DB you'll find the declared indexes are missing. Run the script below in `mongosh` once. It is **safe to re-run** — it skips indexes that already exist with the same spec and reports any conflicts without aborting.

#### Pre-flight (copy-paste into mongosh)

```js
use('dev-sec-ops-db');

// 1. Confirm version + topology — Mongo 8.x builds indexes hybrid by default,
//    so reads/writes continue during the build. No need for { background: true }.
print("Mongo version:", db.version());
print("Hello:", JSON.stringify(db.hello().setName ? "replica set" : "standalone"));

// 2. Snapshot what's already there so you know what'll be created vs skipped
const COLLECTIONS = [
  "dev_insight_projects_collection",
  "dev_insight_commits_collection",
  "dev_insight_merge_requests_collection",
  "dev_insight_branches_collection",
  "dev_insight_contributors_collection",
  "dev_insight_events_collection",
  "dev_insight_pipelines_collection",
  "dev_insight_compliance_collection",
  "dev_insight_groups_collection",
  "dev_insight_releases_collection",
  "dev_insight_metrics_daily_collection",
  "dev_insight_metrics_org_daily_collection",
  "dev_insight_commit_hourly_counts_collection",
  "dev_insight_collector_state_collection",
  "dev_insight_jenkins_builds_collection"
];
COLLECTIONS.forEach(c => {
  const idx = db.getCollection(c).getIndexes().map(i => i.name);
  print(c.padEnd(48), "→", idx.join(", "));
});

// 3. Doc counts so you know what you're about to index
COLLECTIONS.forEach(c => print(c.padEnd(48), "→", db.getCollection(c).estimatedDocumentCount(), "docs"));
```

#### Duplicate-check before any unique index

Unique-index creation **fails mid-build** if duplicates exist. Run these checks for every `unique: true` index in the script. If any returns rows, resolve before creating the index — Mongo will not start the build.

```js
// gitlabProjectId on projects
db.dev_insight_projects_collection.aggregate([
  { $group: { _id: "$gitlabProjectId", n: { $sum: 1 } } },
  { $match: { n: { $gt: 1 } } }
]).toArray();

// sha on commits
db.dev_insight_commits_collection.aggregate([
  { $group: { _id: "$sha", n: { $sum: 1 } } },
  { $match: { n: { $gt: 1 } } }
]).toArray();

// gitlabMrId on MRs
db.dev_insight_merge_requests_collection.aggregate([
  { $group: { _id: "$gitlabMrId", n: { $sum: 1 } } },
  { $match: { n: { $gt: 1 } } }
]).toArray();

// pipelineId on pipelines
db.dev_insight_pipelines_collection.aggregate([
  { $group: { _id: "$pipelineId", n: { $sum: 1 } } },
  { $match: { n: { $gt: 1 } } }
]).toArray();

// gitlabEventId on events
db.dev_insight_events_collection.aggregate([
  { $group: { _id: "$gitlabEventId", n: { $sum: 1 } } },
  { $match: { n: { $gt: 1 } } }
]).toArray();

// projectId on compliance (one compliance doc per project)
db.dev_insight_compliance_collection.aggregate([
  { $group: { _id: "$projectId", n: { $sum: 1 } } },
  { $match: { n: { $gt: 1 } } }
]).toArray();

// (projectId, tagName) on releases
db.dev_insight_releases_collection.aggregate([
  { $group: { _id: { p: "$projectId", t: "$tagName" }, n: { $sum: 1 } } },
  { $match: { n: { $gt: 1 } } }
]).toArray();

// (projectId, branchName) on branches
db.dev_insight_branches_collection.aggregate([
  { $group: { _id: { p: "$projectId", b: "$branchName" }, n: { $sum: 1 } } },
  { $match: { n: { $gt: 1 } } }
]).toArray();
```

If anything comes back, **investigate before continuing.** Duplicates usually mean the upsert lookup ran without an index — exactly the bug we're trying to fix. You can collapse them with an aggregation `$out` to a fresh collection, or delete the older copies, depending on which one is correct.

#### Run

Each call uses an explicit `name` so re-runs match by name and skip silently. Errors are caught per-index so one failure doesn't abort the script.

```js
use('dev-sec-ops-db');

function safeCreate(coll, spec, opts) {
  opts = opts || {};
  if (!opts.name) { print("✗ missing name for", coll, JSON.stringify(spec)); return; }
  try {
    db.getCollection(coll).createIndex(spec, opts);
    print("✓", coll.padEnd(48), opts.name);
  } catch (e) {
    if (e.codeName === "IndexOptionsConflict" || e.codeName === "IndexKeySpecsConflict") {
      print("⚠ skipped", coll.padEnd(48), opts.name, "—", e.codeName);
    } else if (e.codeName === "DuplicateKey") {
      print("✗ DUPLICATE DATA", coll.padEnd(48), opts.name, "— resolve dupes then re-run");
    } else {
      print("✗ FAILED", coll.padEnd(48), opts.name, "—", e.codeName, e.errmsg);
    }
  }
}

// ===================== dev_insight_projects_collection =====================
safeCreate("dev_insight_projects_collection", { gitlabProjectId: 1 }, { unique: true, name: "gitlabProjectId_1" });
safeCreate("dev_insight_projects_collection", { team: 1 }, { name: "team_1" });
safeCreate("dev_insight_projects_collection", { lastActivityAt: -1 }, { name: "lastActivityAt_-1" });
safeCreate("dev_insight_projects_collection", { applicationName: 1 }, { name: "applicationName_1" });
safeCreate("dev_insight_projects_collection", { teamId: 1 }, { name: "teamId_1" });
safeCreate("dev_insight_projects_collection", { healthScore: 1 }, { name: "healthScore_1" });
safeCreate("dev_insight_projects_collection", { healthStatus: 1 }, { name: "healthStatus_1" });
safeCreate("dev_insight_projects_collection", { isMonorepoFlag: 1 }, { name: "isMonorepoFlag_1" });
safeCreate("dev_insight_projects_collection", { busFactor: 1, isArchived: 1 }, { name: "bus_factor_archived" });
// recommended additions
safeCreate("dev_insight_projects_collection", { isArchived: 1 }, { name: "isArchived_1" });
safeCreate("dev_insight_projects_collection", { visibility: 1 }, { name: "visibility_1" });
safeCreate("dev_insight_projects_collection", { scmProvider: 1 }, { name: "scmProvider_1" });

// ===================== dev_insight_commits_collection =====================
safeCreate("dev_insight_commits_collection", { sha: 1 }, { unique: true, name: "sha_1" });
safeCreate("dev_insight_commits_collection", { committedAt: -1 }, { name: "committedAt_-1" });
safeCreate("dev_insight_commits_collection", { projectId: 1, committedAt: -1 }, { name: "project_committed" });
safeCreate("dev_insight_commits_collection", { authorEmail: 1, committedAt: -1 }, { name: "author_committed" });
safeCreate("dev_insight_commits_collection", { dayOfWeek: 1, hourOfDay: 1 }, { name: "heatmap_idx" });

// ===================== dev_insight_merge_requests_collection ==============
safeCreate("dev_insight_merge_requests_collection", { gitlabMrId: 1 }, { unique: true, name: "gitlabMrId_1" });
safeCreate("dev_insight_merge_requests_collection", { mergedAt: -1 }, { name: "mergedAt_-1" });
safeCreate("dev_insight_merge_requests_collection", { projectId: 1, updatedAt: -1 }, { name: "project_updated" });
safeCreate("dev_insight_merge_requests_collection", { projectId: 1, state: 1 }, { name: "project_state" });
safeCreate("dev_insight_merge_requests_collection", { authorUsername: 1, createdAt: -1 }, { name: "author_created" });
safeCreate("dev_insight_merge_requests_collection", { state: 1, createdAt: -1 }, { name: "state_created" });
// recommended additions
safeCreate("dev_insight_merge_requests_collection", { mergedWithoutReview: 1, mergedAt: -1 }, { name: "mergedWithoutReview_mergedAt" });

// ===================== dev_insight_branches_collection ====================
safeCreate("dev_insight_branches_collection", { projectId: 1, branchName: 1 }, { unique: true, name: "project_branch" });
safeCreate("dev_insight_branches_collection", { projectId: 1, status: 1 }, { name: "project_status" });
safeCreate("dev_insight_branches_collection", { status: 1 }, { name: "status_1" });

// ===================== dev_insight_contributors_collection ================
safeCreate("dev_insight_contributors_collection", { gitlabUserId: 1 }, { unique: true, sparse: true, name: "gitlabUserId_1" });
safeCreate("dev_insight_contributors_collection", { email: 1 }, { unique: true, name: "email_1" });
safeCreate("dev_insight_contributors_collection", { totalReviews: -1 }, { name: "totalReviews_-1" });
safeCreate("dev_insight_contributors_collection", { isIdle: 1 }, { name: "isIdle_1" });
safeCreate("dev_insight_contributors_collection", { team: 1, totalCommits: -1 }, { name: "team_commits" });

// ===================== dev_insight_events_collection ======================
safeCreate("dev_insight_events_collection", { gitlabEventId: 1 }, { unique: true, name: "gitlabEventId_1" });
safeCreate("dev_insight_events_collection", { createdAt: -1 }, { name: "createdAt_-1" });
safeCreate("dev_insight_events_collection", { projectId: 1, createdAt: -1 }, { name: "project_created" });
safeCreate("dev_insight_events_collection", { collectedAt: 1 }, { expireAfterSeconds: 2592000, name: "collectedAt_ttl" });

// ===================== dev_insight_pipelines_collection ===================
safeCreate("dev_insight_pipelines_collection", { pipelineId: 1 }, { unique: true, name: "pipelineId_1" });
safeCreate("dev_insight_pipelines_collection", { applicationName: 1 }, { name: "applicationName_1" });
safeCreate("dev_insight_pipelines_collection", { projectId: 1, createdAt: -1 }, { name: "project_created" });
safeCreate("dev_insight_pipelines_collection", { projectId: 1, ref: 1, status: 1 }, { name: "project_ref_status" });
// recommended additions
safeCreate("dev_insight_pipelines_collection", { status: 1, createdAt: -1 }, { name: "status_createdAt" });
safeCreate("dev_insight_pipelines_collection", { failureCategory: 1 }, { name: "failureCategory_1" });

// ===================== dev_insight_compliance_collection ==================
safeCreate("dev_insight_compliance_collection", { projectId: 1 }, { unique: true, name: "projectId_1" });
safeCreate("dev_insight_compliance_collection", { team: 1 }, { name: "team_1" });
safeCreate("dev_insight_compliance_collection", { complianceLevel: 1 }, { name: "complianceLevel_1" });

// ===================== dev_insight_groups_collection ======================
safeCreate("dev_insight_groups_collection", { gitlabGroupId: 1 }, { unique: true, name: "gitlabGroupId_1" });

// ===================== dev_insight_releases_collection ====================
safeCreate("dev_insight_releases_collection", { tagName: 1 }, { name: "tagName_1" });
safeCreate("dev_insight_releases_collection", { releasedAt: -1 }, { name: "releasedAt_-1" });
safeCreate("dev_insight_releases_collection", { projectId: 1, releasedAt: -1 }, { name: "project_released" });
safeCreate("dev_insight_releases_collection", { projectId: 1, tagName: 1 }, { unique: true, name: "project_tag" });
safeCreate("dev_insight_releases_collection", { collectedAt: 1 }, { expireAfterSeconds: 31536000, name: "collectedAt_ttl" });

// ===================== dev_insight_metrics_daily_collection ===============
safeCreate("dev_insight_metrics_daily_collection", { date: -1 }, { name: "date_-1" });
safeCreate("dev_insight_metrics_daily_collection", { projectId: 1, date: 1 }, { unique: true, name: "project_date" });
safeCreate("dev_insight_metrics_daily_collection", { team: 1, date: 1 }, { name: "team_date" });
safeCreate("dev_insight_metrics_daily_collection", { collectedAt: 1 }, { expireAfterSeconds: 31536000, name: "collectedAt_ttl" });

// ===================== dev_insight_metrics_org_daily_collection ===========
safeCreate("dev_insight_metrics_org_daily_collection", { date: 1 }, { unique: true, name: "date_1" });
safeCreate("dev_insight_metrics_org_daily_collection", { collectedAt: 1 }, { expireAfterSeconds: 31536000, name: "collectedAt_ttl" });

// ===================== dev_insight_commit_hourly_counts_collection ========
safeCreate("dev_insight_commit_hourly_counts_collection", { date: 1, dayOfWeek: 1, hourOfDay: 1 }, { unique: true, name: "date_day_hour" });
safeCreate("dev_insight_commit_hourly_counts_collection", { collectedAt: 1 }, { expireAfterSeconds: 7776000, name: "collectedAt_ttl" });

// ===================== dev_insight_collector_state_collection =============
safeCreate("dev_insight_collector_state_collection", { jobName: 1 }, { unique: true, name: "jobName_1" });

// ===================== dev_insight_jenkins_builds_collection ==============
safeCreate("dev_insight_jenkins_builds_collection", { instanceName: 1, jobName: 1, buildNumber: 1 }, { unique: true, name: "instance_job_build" });
safeCreate("dev_insight_jenkins_builds_collection", { instanceName: 1, startedAt: -1 }, { name: "instance_started" });
safeCreate("dev_insight_jenkins_builds_collection", { applicationName: 1, startedAt: -1 }, { name: "app_started" });
safeCreate("dev_insight_jenkins_builds_collection", { jobName: 1, buildNumber: -1 }, { name: "job_buildNum" });
safeCreate("dev_insight_jenkins_builds_collection", { result: 1, startedAt: -1 }, { name: "result_started" });

print("\nDone. Re-run the snapshot block above to confirm.");
```

#### Notes for an enterprise box

* **Mongo 8.0 builds indexes hybrid** — reads and writes continue during the build. You don't need `{ background: true }`; that flag is a no-op since 4.2 and removed in 8.x.
* **Memory pressure**: each large index build uses up to `maxIndexBuildMemoryUsageMegabytes` (default 200 MB) before spilling to disk. On a 10 M-doc collection a build can take tens of minutes. Schedule the bigger ones (`commits`, `pipelines`, `events`) outside peak collector cycles.
* **Watch progress** in another `mongosh` window: `db.currentOp({ "command.createIndexes": { $exists: true } })`.
* **One at a time per collection** — don't run two `createIndex` calls on the same collection in parallel; the second one queues anyway and the script is fast enough sequentially.
* **If a unique build fails with `DuplicateKey`** the partial index is automatically rolled back. Fix the dupes (see pre-flight) and re-run the script — the script's named-index match will skip everything that succeeded.

### Drop all DevInsight collections (destructive — full reset)

> **This deletes every DevInsight document AND all indexes on those collections.** Use only when you want to start completely from scratch (e.g., re-onboarding, schema migration, or after corrupted state). On a fresh, empty collection Spring's `auto-index-creation: true` will recreate the declared indexes on next collector boot — but you'll still want to re-run the manual creation script above for the recommended additions.

```js
use('dev-sec-ops-db');

const COLLECTIONS = [
  "dev_insight_projects_collection",
  "dev_insight_commits_collection",
  "dev_insight_merge_requests_collection",
  "dev_insight_branches_collection",
  "dev_insight_contributors_collection",
  "dev_insight_events_collection",
  "dev_insight_pipelines_collection",
  "dev_insight_compliance_collection",
  "dev_insight_groups_collection",
  "dev_insight_releases_collection",
  "dev_insight_metrics_daily_collection",
  "dev_insight_metrics_org_daily_collection",
  "dev_insight_commit_hourly_counts_collection",
  "dev_insight_collector_state_collection",
  "dev_insight_jenkins_builds_collection"
];

COLLECTIONS.forEach(c => {
  const exists = db.getCollectionNames().includes(c);
  if (!exists) { print("· skip   ", c, "(does not exist)"); return; }
  const before = db.getCollection(c).estimatedDocumentCount();
  try {
    db.getCollection(c).drop();
    print("✓ dropped", c.padEnd(48), "(" + before + " docs)");
  } catch (e) {
    print("✗ FAILED ", c.padEnd(48), e.codeName, e.errmsg);
  }
});

print("\nDone. All DevInsight collections + indexes removed. Next collector run will recreate them empty.");
```

**Notes:**

* Dropping `dev_insight_collector_state_collection` resets every job's `lastRunAt` / `lastFullSyncAt`, so the next cycle is treated as the **first run** — pulls 90 days of history (whatever `gitlab.daysToCollect` is set to). Skip it from the list if you only want to reset data, not checkpoints.
* After dropping, the `dev_insight_pipelines_collection` enrichment fields (jobs, failure reasons, security data) are gone too. Stage 2 (`/pipelines/refresh?force=true`) will need to re-fetch everything from GitLab.
* Drop is **instant** in Mongo — it removes the collection metadata; storage reclaim happens in the background. You don't need a maintenance window.

### Verifying / dropping (single index)

```js
// Show all indexes on a collection
db.dev_insight_projects_collection.getIndexes()

// Drop one index by name
db.dev_insight_projects_collection.dropIndex("visibility_1")

// Quick sanity check that hot queries are using indexes
db.dev_insight_commits_collection.find({ projectId: 12345 })
  .sort({ committedAt: -1 }).limit(1).explain("executionStats")
// look for "stage": "IXSCAN" — not "COLLSCAN"
```

### When to re-check

* Any new repository method added with `findBy…` / `existsBy…` / `countBy…` — confirm the queried fields are indexed.
* Any aggregation pipeline that does `$match` on a non-indexed field over a collection > 100k docs.
* After a bulk import — Mongo doesn't auto-add indexes for fields you only ever group/sort by.

***

## Appendix — Collections written

After a successful run you'll see these in Mongo (`dev-sec-ops-db`):

| Collection                                   | Written by                                                   | Key document shape                         |
| -------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------ |
| `dev_insight_projects_collection`            | gitlab-collector                                             | Project metadata, bus factor, health score |
| `dev_insight_commits_collection`             | gitlab-collector                                             | Commit history                             |
| `dev_insight_merge_requests_collection`      | gitlab-collector                                             | MR lifecycle                               |
| `dev_insight_branches_collection`            | gitlab-collector                                             | Branch list                                |
| `dev_insight_contributors_collection`        | gitlab-collector                                             | Contributor rollup                         |
| `dev_insight_events_collection`              | gitlab-collector                                             | Project activity events                    |
| `dev_insight_pipelines_collection`           | gitlab-collector (list) + ci-pipeline-collector (enrichment) | Pipeline runs with 70+ fields              |
| `dev_insight_compliance_collection`          | gitlab-collector                                             | Compliance framework data                  |
| `dev_insight_groups_collection`              | gitlab-collector                                             | Group hierarchy                            |
| `dev_insight_releases_collection`            | gitlab-collector                                             | Release / tag history                      |
| `dev_insight_metrics_daily_collection`       | gitlab-collector                                             | Project-level daily metrics                |
| `dev_insight_metrics_org_daily_collection`   | gitlab-collector                                             | Org-level daily metrics                    |
| `dev_insight_commit_hourly_count_collection` | gitlab-collector                                             | Hourly commit heatmap                      |
| `dev_insight_collector_state_collection`     | gitlab-collector                                             | Job run state / checkpoints                |
| `dev_insight_jenkins_builds_collection`      | ci-pipeline-collector                                        | Jenkins build records                      |

[Config help](/user-docs/9-setup-instructions/jira.md) [Debug](https://github.com/sec0ne/user-docs/blob/main/docs/9-setup-instructions/debug.md) All collections have a 90-day TTL on `collectedAt` — old records auto-purge.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sec1.io/user-docs/9-setup-instructions.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
