> For the complete documentation index, see [llms.txt](https://docs.sec1.io/user-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.sec1.io/user-docs/9-setup-instructions/debug.md).

# debug

### Reading Logs from Podman

The two collectors run as separate containers:

| Container               | Service                                                       |
| ----------------------- | ------------------------------------------------------------- |
| `gitlab-collector`      | GitLab DevInsight collection (`GitLabCollectorService`)       |
| `ci-pipeline-collector` | Pipeline detail enrichment (`GitLabPipelineCollectorService`) |

Verify they're running:

```bash
podman ps --format "{{.Names}}\t{{.Status}}" | grep -E "gitlab-collector|ci-pipeline-collector"
```

> **macOS note:** these commands use `mktime`, which requires GNU awk. Install with `brew install gawk` and replace `awk` with `gawk`.

***

#### A. GitLab Full Collection Cycle — Wall-Clock per Run

```bash
podman logs gitlab-collector 2>&1 | awk '
  /Starting GitLab DevInsight data collection cycle/ {
    start_ts = $1 " " $2
    start_epoch = mktime(gensub(/[-:]/," ","g",start_ts))
  }
  /Completed GitLab DevInsight data collection cycle/ && start_epoch {
    end_ts = $1 " " $2
    end_epoch = mktime(gensub(/[-:]/," ","g",end_ts))
    printf "%s -> %s   %d sec\n", start_ts, end_ts, end_epoch - start_epoch
    start_epoch = 0
  }'
```

#### B. GitLab Per-Job Aggregated Stats (runs / avg / min / max in ms)

```bash
podman logs gitlab-collector 2>&1 | awk -F'[][]' '/GitLabCollectorService - Job \[/ {
    job=$2
    match($0, /in ([0-9]+)ms/, m); ms=m[1]
    n[job]++; sum[job]+=ms
    if (ms<min[job]||min[job]==0) min[job]=ms
    if (ms>max[job]) max[job]=ms
}
END {
    printf "%-25s %6s %10s %10s %10s\n","job","runs","avg_ms","min_ms","max_ms"
    for (j in n) printf "%-25s %6d %10d %10d %10d\n", j, n[j], sum[j]/n[j], min[j], max[j]
}' | sort
```

#### C. Pipeline Collector — Per-Cycle Wall-Clock

```bash
podman logs ci-pipeline-collector 2>&1 | awk '
  /GitLabPipelineCollectorService - Refreshing ALL pipeline details/ {
    s_ts = $1 " " $2
    s = mktime(gensub(/[-:]/," ","g",s_ts))
  }
  /GitLabPipelineCollectorService - Stage 2: enriched/ && s {
    e_ts = $1 " " $2
    e = mktime(gensub(/[-:]/," ","g",e_ts))
    match($0, /enriched ([0-9]+) pipelines/, m)
    printf "%s -> %s   %4d sec   (%s pipelines)\n", s_ts, e_ts, e-s, m[1]
    s = 0
  }'
```

#### D. Pipeline Collector — Summary Stats

```bash
podman logs ci-pipeline-collector 2>&1 | awk '
  /GitLabPipelineCollectorService - Refreshing ALL pipeline details/ {
    s = mktime(gensub(/[-:]/," ","g",$1" "$2))
  }
  /GitLabPipelineCollectorService - Stage 2: enriched/ && s {
    e = mktime(gensub(/[-:]/," ","g",$1" "$2))
    d = e - s; n++; sum += d
    if (min==0 || d<min) min=d
    if (d>max) max=d
    s = 0
  }
  END {
    if (n) printf "pipeline cycles: %d   avg=%.1fs   min=%ds   max=%ds   total=%ds\n",
                  n, sum/n, min, max, sum
  }'
```

***

### Capturing Errors & Warnings

Run these against whichever container you're inspecting. Replace `<container>` with `gitlab-collector` or `ci-pipeline-collector`.

#### E. All ERROR / WARN lines

```bash
# GitLab collector
podman logs gitlab-collector 2>&1 | grep -E " (ERROR|WARN) "

# Pipeline collector
podman logs ci-pipeline-collector 2>&1 | grep -E " (ERROR|WARN) "
```

#### F. Errors with stack-trace context (15 lines after each ERROR)

```bash
podman logs <container> 2>&1 | grep -A 15 " ERROR "
```

#### G. Distinct errors, deduplicated and counted

```bash
podman logs <container> 2>&1 \
  | grep -E " ERROR " \
  | sed -E 's/^[0-9-]+ [0-9:]+ \[[^]]+\] ERROR //' \
  | sort | uniq -c | sort -rn
```

#### H. HTTP failure status codes from the GitLab API

```bash
podman logs <container> 2>&1 \
  | grep -oE "[0-9]{3} (Not Found|Unauthorized|Forbidden|Bad Request|Too Many Requests|Internal Server Error|Bad Gateway|Service Unavailable|Gateway Timeout)" \
  | sort | uniq -c | sort -rn
```

#### I. Per-project soft failures (404, skipped resources)

```bash
podman logs <container> 2>&1 \
  | grep -oE "Could not fetch [a-z_ ]+ for project [0-9]+: [0-9]+ [A-Za-z ]+" \
  | sort | uniq -c | sort -rn
```

#### J. Rate-limiting, timeouts, connection issues

```bash
podman logs <container> 2>&1 \
  | grep -iE "rate.?limit|429|timeout|timed out|connection reset|connection refused|socket" \
  | grep -vE " DEBUG "
```

#### K. Cycle health summary — errors per GitLab cycle

```bash
podman logs gitlab-collector 2>&1 | awk '
  /Starting GitLab DevInsight data collection cycle/ {
    in_cycle = 1; start = $1 " " $2; errs = 0
  }
  in_cycle && / ERROR / { errs++ }
  /Completed GitLab DevInsight data collection cycle/ && in_cycle {
    printf "%s -> %s   errors=%d\n", start, $1 " " $2, errs
    in_cycle = 0
  }'
```

***

### Tips for Podman Log Capture

* **Limit time range:** `podman logs --since 1h gitlab-collector 2>&1 | …`, or since a timestamp: `--since 2026-04-24T09:00:00`.
* **Tail recent only:** `podman logs --tail 5000 gitlab-collector 2>&1 | …` for quick spot checks.
* **Live monitor errors:** `podman logs -f gitlab-collector 2>&1 | grep --line-buffered -E " (ERROR|WARN) "`.
* **Save a snapshot for offline analysis:**

  ```bash
  podman logs gitlab-collector       > gitlab-collector.log 2>&1
  podman logs ci-pipeline-collector  > ci-pipeline-collector.log 2>&1
  ```

  Then run the same awk/grep commands against the files (drop the `podman logs … |` prefix and pass the file as the last argument).
* **Why `2>&1`:** Spring Boot writes logs to stderr by default; without `2>&1`, `grep`/`awk` only see stdout.
* **Podman timestamps vs app timestamps:** the awk scripts above parse the app's own `YYYY-MM-DD HH:MM:SS` prefix from `$1 $2`. Avoid `--timestamps` (it adds an RFC3339 prefix that shifts the columns and breaks the scripts).

## Pipeline Categorization Rules

Runtime-mutable rules stored in MongoDB (`pipeline_category_rules` collection). They're evaluated by the collector during pipeline enrichment and by the `/pipelines/recategorize` endpoint when re-stamping stored pipelines.

* **Editable at runtime** via `POST /api/v1/collector/gitlab/classification/rules` — no container rebuild, no redeploy.
* **Applied to stored data** via `POST /api/v1/collector/gitlab/pipelines/recategorize?daysBack=N` — pure Mongo, no GitLab calls.
* **Evaluated in priority order** (ascending). First matching rule sets `pipelineCategory`. **Every** matching rule contributes labels.

***

### Quick Reference

| Priority | Name         | Match type | Matched against                                                | Sub-label dimensions |
| -------- | ------------ | ---------- | -------------------------------------------------------------- | -------------------- |
| 10       | **NPC2**     | namespace  | `eis-terraform-npc-provisioning` (substring on namespace path) | `env`                |
| 20       | **TOM**      | namespace  | `eis-grafana-tom` (substring)                                  | `subteam`            |
| 30       | **Database** | namespace  | `EIS-DBMW-DBENG` (substring)                                   | —                    |
| 40       | **NCD**      | include    | `.gitlab-ci.yml` `include.project` matches `.*ncd.*pipeline.*` | `subtype`            |

> **Fallthrough:** pipelines matching no rule receive `pipelineCategory = "custom"`. Pipelines with no `.gitlab-ci.yml` and no namespace match receive `pipelineCategory = "none"`.

***

### Rule 1 — NPC2 (priority 10)

Namespace-based. Categorizes any project whose namespace path contains `eis-terraform-npc-provisioning`. Works for both lab (`gitlab.com`) and production (`gitlab.nomura.com`) URLs.

#### JSON

```json
{
  "name": "NPC2",
  "matchType": "namespace",
  "namespacePattern": "eis-terraform-npc-provisioning",
  "priority": 10,
  "enabled": true,
  "description": "NPC2 — env from namespace suffix on first path segment",
  "subRules": [
    { "field": "namespace", "pattern": "-(nonprodtest|nonprod)(/|$)", "label": "nonprod", "key": "env", "enabled": true },
    { "field": "namespace", "pattern": "-(prodtest|prod)(/|$)",       "label": "prod",    "key": "env", "enabled": true },
    { "field": "namespace", "pattern": "-qa(/|$)",                    "label": "qa",      "key": "env", "enabled": true }
  ]
}
```

#### Sub-label table

| Namespace suffix | `env` label |
| ---------------- | ----------- |
| `-prodtest`      | `prod`      |
| `-prod`          | `prod`      |
| `-nonprodtest`   | `nonprod`   |
| `-nonprod`       | `nonprod`   |
| `-qa`            | `qa`        |

> Sub-rule order matters: `nonprod` is listed **before** `prod` because `-prod` would otherwise substring-match `-nonprodtest`.

#### Example outputs

| URL                                                            | `pipelineCategory` | `pipelineLabels`  | `pipelineLabelMap`               |
| -------------------------------------------------------------- | ------------------ | ----------------- | -------------------------------- |
| `gitlab.com/eis-terraform-npc-provisioning-prodtest/foo`       | NPC2               | `[NPC2, prod]`    | `{category: NPC2, env: prod}`    |
| `gitlab.nomura.com/eis-terraform-npc-provisioning-nonprod/bar` | NPC2               | `[NPC2, nonprod]` | `{category: NPC2, env: nonprod}` |
| `gitlab.nomura.com/eis-terraform-npc-provisioning-qa/x`        | NPC2               | `[NPC2, qa]`      | `{category: NPC2, env: qa}`      |

***

### Rule 2 — TOM (priority 20)

Namespace-based. Matches `eis-grafana-tom` and distinguishes the `rules` vs `eng` sub-teams.

#### JSON

```json
{
  "name": "TOM",
  "matchType": "namespace",
  "namespacePattern": "eis-grafana-tom",
  "priority": 20,
  "enabled": true,
  "description": "TOM grafana/observability — sub-team from first-segment suffix (rules / eng)",
  "subRules": [
    { "field": "namespace", "pattern": "-rules(/|$)", "label": "rules", "key": "subteam", "enabled": true },
    { "field": "namespace", "pattern": "-eng(/|$)",   "label": "eng",   "key": "subteam", "enabled": true }
  ]
}
```

#### Sub-label table

| Namespace suffix | `subteam` label |
| ---------------- | --------------- |
| `-rules`         | `rules`         |
| `-eng`           | `eng`           |

#### Example outputs

| URL                                                               | `pipelineCategory` | `pipelineLabels` | `pipelineLabelMap`                |
| ----------------------------------------------------------------- | ------------------ | ---------------- | --------------------------------- |
| `gitlab.nomura.com/eis-grafana-tom-rules/rules-repo`              | TOM                | `[TOM, rules]`   | `{category: TOM, subteam: rules}` |
| `gitlab.nomura.com/eis-grafana-tom-eng/grafana-ui-automation-job` | TOM                | `[TOM, eng]`     | `{category: TOM, subteam: eng}`   |
| `gitlab.nomura.com/eis-grafana-tom/whatever`                      | TOM                | `[TOM]`          | `{category: TOM}`                 |

***

### Rule 3 — Database (priority 30)

Namespace-based. Catches the DBMW database engineering group. No sub-rules.

#### JSON

```json
{
  "name": "Database",
  "matchType": "namespace",
  "namespacePattern": "EIS-DBMW-DBENG",
  "priority": 30,
  "enabled": true,
  "description": "DBMW database engineering pipelines"
}
```

#### Example outputs

| URL                                                  | `pipelineCategory` | `pipelineLabels` | `pipelineLabelMap`     |
| ---------------------------------------------------- | ------------------ | ---------------- | ---------------------- |
| `gitlab.nomura.com/EIS-DBMW-DBENG/dbmw-housekeeping` | Database           | `[Database]`     | `{category: Database}` |

***

### Rule 4 — NCD (priority 40)

Include-based. Matches any `.gitlab-ci.yml` whose `include.project` value contains `ncd.*pipeline.*` (e.g. `gts-cta-strategy-innersource/ncd/pipeline-templates`). Sub-rules examine `include.file` to identify Helm CI / Application CI / Dependency CI.

#### JSON

```json
{
  "name": "NCD",
  "matchType": "include",
  "includeProjectPattern": ".*ncd.*pipeline.*",
  "priority": 40,
  "enabled": true,
  "description": "NCD template family — Helm sub-rule listed first so it outranks Application",
  "subRules": [
    { "field": "templateFile", "pattern": "NCD-Build\\.helm\\.local\\.gitlab-ci\\.yml", "label": "Helm CI",        "key": "subtype", "enabled": true },
    { "field": "templateFile", "pattern": "NCD-Dependency\\.local\\.gitlab-ci\\.yml",   "label": "Dependency CI",  "key": "subtype", "enabled": true },
    { "field": "templateFile", "pattern": "NCD-Build\\.local\\.gitlab-ci\\.yml",        "label": "Application CI", "key": "subtype", "enabled": true }
  ]
}
```

#### Sub-label table

| `include.file`                       | `subtype` label  |
| ------------------------------------ | ---------------- |
| `NCD-Build.helm.local.gitlab-ci.yml` | `Helm CI`        |
| `NCD-Dependency.local.gitlab-ci.yml` | `Dependency CI`  |
| `NCD-Build.local.gitlab-ci.yml`      | `Application CI` |

> Sub-rule order matters: `Helm CI` is checked **before** `Application CI` because `NCD-Build.helm.local.gitlab-ci.yml` would otherwise substring-match the Application pattern.

#### Example outputs

| `include.project`                                      | `include.file`                       | `pipelineCategory` | `pipelineLabels`        | `pipelineLabelMap`                         |
| ------------------------------------------------------ | ------------------------------------ | ------------------ | ----------------------- | ------------------------------------------ |
| `gts-cta-strategy-innersource/ncd/pipeline-templates`  | `NCD-Build.helm.local.gitlab-ci.yml` | NCD                | `[NCD, Helm CI]`        | `{category: NCD, subtype: Helm CI}`        |
| `gts-cta-strategy-innersource/ncd/pipeline-templates`  | `NCD-Build.local.gitlab-ci.yml`      | NCD                | `[NCD, Application CI]` | `{category: NCD, subtype: Application CI}` |
| `gts-cta-strategy-innersource/ncd/pipeline-dependency` | `NCD-Dependency.local.gitlab-ci.yml` | NCD                | `[NCD, Dependency CI]`  | `{category: NCD, subtype: Dependency CI}`  |

***

### Installation — POST all rules in one go

```bash
COLLECTOR=http://localhost:8088

# NPC2
curl -sX POST "$COLLECTOR/api/v1/collector/gitlab/classification/rules" \
  -H 'Content-Type: application/json' -d '{
  "name": "NPC2",
  "matchType": "namespace",
  "namespacePattern": "eis-terraform-npc-provisioning",
  "priority": 10,
  "enabled": true,
  "description": "NPC2 — env from namespace suffix on first path segment",
  "subRules": [
    { "field": "namespace", "pattern": "-(nonprodtest|nonprod)(/|$)", "label": "nonprod", "key": "env", "enabled": true },
    { "field": "namespace", "pattern": "-(prodtest|prod)(/|$)",       "label": "prod",    "key": "env", "enabled": true },
    { "field": "namespace", "pattern": "-qa(/|$)",                    "label": "qa",      "key": "env", "enabled": true }
  ]
}'

# TOM
curl -sX POST "$COLLECTOR/api/v1/collector/gitlab/classification/rules" \
  -H 'Content-Type: application/json' -d '{
  "name": "TOM",
  "matchType": "namespace",
  "namespacePattern": "eis-grafana-tom",
  "priority": 20,
  "enabled": true,
  "description": "TOM grafana/observability — sub-team from first-segment suffix (rules / eng)",
  "subRules": [
    { "field": "namespace", "pattern": "-rules(/|$)", "label": "rules", "key": "subteam", "enabled": true },
    { "field": "namespace", "pattern": "-eng(/|$)",   "label": "eng",   "key": "subteam", "enabled": true }
  ]
}'

# Database
curl -sX POST "$COLLECTOR/api/v1/collector/gitlab/classification/rules" \
  -H 'Content-Type: application/json' -d '{
  "name": "Database",
  "matchType": "namespace",
  "namespacePattern": "EIS-DBMW-DBENG",
  "priority": 30,
  "enabled": true,
  "description": "DBMW database engineering pipelines"
}'

# NCD
curl -sX POST "$COLLECTOR/api/v1/collector/gitlab/classification/rules" \
  -H 'Content-Type: application/json' -d '{
  "name": "NCD",
  "matchType": "include",
  "includeProjectPattern": ".*ncd.*pipeline.*",
  "priority": 40,
  "enabled": true,
  "description": "NCD template family — Helm sub-rule listed first so it outranks Application",
  "subRules": [
    { "field": "templateFile", "pattern": "NCD-Build\\.helm\\.local\\.gitlab-ci\\.yml", "label": "Helm CI",        "key": "subtype", "enabled": true },
    { "field": "templateFile", "pattern": "NCD-Dependency\\.local\\.gitlab-ci\\.yml",   "label": "Dependency CI",  "key": "subtype", "enabled": true },
    { "field": "templateFile", "pattern": "NCD-Build\\.local\\.gitlab-ci\\.yml",        "label": "Application CI", "key": "subtype", "enabled": true }
  ]
}'
```

#### Verify

```bash
curl -s "$COLLECTOR/api/v1/collector/gitlab/classification/rules" | python3 -c "
import json, sys
d = json.load(sys.stdin)['data']['activeRules']
print(f\"source={d['source']}  count={d['count']}\")
for r in d['rules']:
    print(f\"  {r['priority']:>3}  {r['name']:<10}  matchType={r['matchType']}  subRules={len(r['subRules'])}\")
"
```

Expected:

```
source=mongo  count=4
   10  NPC2        matchType=namespace  subRules=3
   20  TOM         matchType=namespace  subRules=2
   30  Database    matchType=namespace  subRules=0
   40  NCD         matchType=include    subRules=3
```

#### Apply to stored data

```bash
curl -sX POST "$COLLECTOR/api/v1/collector/gitlab/pipelines/recategorize?daysBack=90"
```

Runs in seconds — pure Mongo + config, no GitLab calls.

***

### Sub-rule field reference

The `field` attribute on a sub-rule decides which raw fact the pattern is matched against:

| `field` value     | Source fact                             | Notes                                                |
| ----------------- | --------------------------------------- | ---------------------------------------------------- |
| `templateProject` | `include.project` from `.gitlab-ci.yml` | The template repo path                               |
| `templateRef`     | `include.ref`                           | The branch / tag of the included template            |
| `templateFile`    | `include.file`                          | The specific file inside the template repo           |
| `namespace`       | parsed from `repoUrl`                   | E.g. `eis-terraform-npc-provisioning-prod/some-repo` |
| `repoUrl`         | full project URL                        | Useful for rare URL-based regexes                    |

#### Sub-rule semantics

* All sub-rule patterns are **case-insensitive** and use `Matcher.find()` (substring match).
* Sub-rules are evaluated in array order. List the **most specific** patterns first when they can overlap (e.g. Helm before Application; nonprod before prod).
* Each matching sub-rule contributes its `label` to `pipelineLabels`. If `key` is set, it also lands in `pipelineLabelMap[key]` (last writer wins on key collisions).
* Set `enabled: false` to keep a sub-rule on record without applying it.

***

### Operational notes

* The classifier caches rules in-process for 60 s. Mutating endpoints invalidate the cache automatically.
* **Namespace rules require `pipeline.repoUrl`** to be populated. That field is stamped during enrichment from the project record — which means projects must already exist in `dev_insight_projects_collection` for namespace rules to fire. Run `/gitlab/collect` once to seed projects, then `/pipelines/refresh` for enrichment.
* For include-based rules, only `pipelineTemplateProject` / `pipelineTemplateRef` / `pipelineTemplateFile` are needed — these are captured directly during pipeline enrichment from the parsed `.gitlab-ci.yml`.

## Investigation — Pipeline categorization showing 81% "custom"

**Reporter:** ops, 2026-05-21 **Symptom:** On the Nomura GitLab-Demo dataset (4288 pipelines, 30-day window), the collector + dashboard reports:

* NCD = 342 (8%)
* NPC2 = 393 (9%)
* TOM = 80 (2%)
* Database = 1 (0%)
* **custom = 3472 (81%)**
* none = 0

Spot-check: `GM-EUC-QIS-Structuring/renovate-config` shows `.gitlab-ci.yml` with `include.project = gts-cta-strategy-innersource/ncd/pipeline-dependency` and `include.file = NCD-Dependency.local.gitlab-ci.yml`. This **should** classify as **NCD / Dependency CI** per the active rules. Two rows of `renovate-config` in the dashboard are tagged `NCD / Dependency CI` correctly; **three other `renovate-config` rows are tagged `custom`** — same YAML shape, different fork instances.

**Additional context:**

* 30-day enrichment cycle on this dataset took \~11 hours.
* Running `/pipelines/recategorize` after collection only changed a handful of records; most stayed `custom`.

**Conclusion:** there's a real bug. This document narrows down which of three hypotheses is responsible before any code change.

***

### Hypotheses

#### A. Per-project cache contamination across pipelines

`PipelineCategoryClassifier.CACHE` is keyed by `projectId` alone. The cache stores a `Classification` object that carries the **first include's `templateProject` / `templateRef` / `templateFile`**. When a project has multiple pipelines on different refs (e.g. `main`, `release-2.x`, feature branches), they all share the same cached `Classification` from whichever ref was fetched first — even if their actual `.gitlab-ci.yml` could differ on different refs.

If this is the cause:

* A subset of pipelines would have **wrong** template values stamped (matching the first-cached ref, not their own).
* Affected pipelines: those in projects with diverse refs.

#### B. Stale enrichment from before classification was wired up

Stage 1 (`pipelines_collector`) creates pipeline docs with **no template fields**. Stage 2 enrichment runs only when explicitly triggered (`/pipelines/refresh`). If `force=false`, terminal pipelines hit `isFullyEnrichedTerminal` short-circuit and are skipped. Pipelines enriched once before the categorization rules existed (or before the multi-include/namespace fix) carry `pipelineCategory = "custom"` indefinitely.

`/pipelines/recategorize` works off **stored** `pipelineTemplateProject` / `pipelineTemplateRef` / `pipelineTemplateFile`. If those fields are null (because enrichment never captured them), recategorize has nothing to derive from and the doc remains `custom`.

If this is the cause:

* Many "custom" pipelines would have **null** `pipelineTemplate*` fields.
* Recategorize wouldn't help — only a `force=true` re-enrichment would, because that re-fetches the YAML.
* Distribution by `collectedAt` would show "custom" docs clustered in older cycles.

#### C. YAML fetch failing for many projects

`fetchGitLabCiYaml` returns `null` on 404, network error, or any non-2xx. When YAML is null, classifier returns `noYamlCategory` (default `"none"`). The dashboard shows 0 "none", so this isn't a primary cause — but partial failures (e.g. some refs return 200, others 404) could still contribute via the cache.

***

### Investigation queries

Run these against the Mongo backing the affected environment. **Do not commit the output to git** — it may contain repo names/IDs. Paste back as plain text in the investigation thread.

#### Query 1 — How many "custom" pipelines actually have template facts captured?

```js
db.dev_insight_pipelines_collection.aggregate([
  { $match: { pipelineCategory: "custom" } },
  { $group: {
      _id: {
        hasProject: { $cond: [{ $ifNull: ["$pipelineTemplateProject", false] }, "yes", "no"] },
        hasFile:    { $cond: [{ $ifNull: ["$pipelineTemplateFile",    false] }, "yes", "no"] }
      },
      count: { $sum: 1 }
  }},
  { $sort: { count: -1 } }
])
```

**Interpretation:**

* `hasProject=no, hasFile=no` dominant → **Hypothesis B** (template facts never captured). `/pipelines/recategorize` cannot help; a full `/pipelines/refresh?force=true&daysBack=30` is needed.
* `hasProject=yes, hasFile=yes` dominant → **Hypothesis A or C** (facts captured but rules didn't fire — either wrong values or fetch returned partial data).

#### Query 2 — Sample a "custom" pipeline that should be NCD

```js
db.dev_insight_pipelines_collection.findOne(
  { pipelineCategory: "custom",
    repoUrl: { $regex: "renovate-config", $options: "i" } },
  { _id: 0, pipelineId: 1, repoUrl: 1, ref: 1, status: 1,
    pipelineCategory: 1, pipelineLabels: 1, pipelineLabelMap: 1,
    pipelineTemplateProject: 1, pipelineTemplateRef: 1, pipelineTemplateFile: 1,
    collectedAt: 1, updatedAt: 1, projectId: 1 }
)
```

**Interpretation:**

* All `pipelineTemplate*` fields **null** → enrichment never captured the YAML for this pipeline. Confirms Hypothesis B.
* `pipelineTemplateProject` set to the **wrong** project (not the one in the actual `.gitlab-ci.yml`) → cache contamination, Hypothesis A.
* Fields set correctly but `pipelineCategory = custom` → rule isn't matching. Check active rules via `GET /api/v1/collector/gitlab/classification/rules`.

#### Query 3 — Is there a YAML snapshot for that project?

Take `projectId` from Query 2's result, then:

```js
db.gitlab_ci_yaml.find(
  { gitlabProjectId: <projectId> },
  { _id: 0, ref: 1, includeProject: 1, includeFile: 1, fetchedAt: 1, contentHash: 1 }
).sort({ fetchedAt: -1 })
```

**Interpretation:**

* No snapshot → the collector never successfully fetched `.gitlab-ci.yml` for this project. Network issue, 404, or wrong path. Hypothesis C variant.
* One snapshot per ref → check whether `includeProject` matches what the dashboard shows.
* Multiple snapshots with different `includeProject` values → confirms that refs diverge and cache-by-projectId (Hypothesis A) would cause wrong stamping.

#### Query 4 — When were the "custom" pipelines categorized?

```js
db.dev_insight_pipelines_collection.aggregate([
  { $match: { pipelineCategory: "custom" } },
  { $project: { day: { $dateToString: { format: "%Y-%m-%d", date: "$collectedAt" } } } },
  { $group: { _id: "$day", n: { $sum: 1 } } },
  { $sort: { _id: -1 } },
  { $limit: 15 }
])
```

**Interpretation:**

* All `custom` docs clustered in dates **before** the categorization rules were uploaded → Hypothesis B is definitive.
* `custom` docs spread across recent days, including post-rule-upload → enrichment is producing `custom` even with current rules. Hypothesis A or C.

#### Query 5 — Distribution of `pipelineTemplateFile` values among "custom"

```js
db.dev_insight_pipelines_collection.aggregate([
  { $match: { pipelineCategory: "custom", pipelineTemplateFile: { $ne: null } } },
  { $group: { _id: "$pipelineTemplateFile", n: { $sum: 1 } } },
  { $sort: { n: -1 } },
  { $limit: 20 }
])
```

**Interpretation:**

* If any of these `_id` values match patterns from the NCD/NPC2/TOM/Database rules → confirms rule didn't fire even when facts were captured. Cache contamination (A) or rules mismatch.
* If all values are unfamiliar → those projects use different templates not yet in the rule set.

#### Query 6 — Current active rules (sanity check)

```bash
curl -s "http://<collector-host>:8080/api/v1/collector/gitlab/classification/rules" | jq

db.dev_insight_pipelines_collection.aggregate([
  { $match: {
      collectedAt: { $gte: ISODate("2026-05-26T00:00:00Z"), $lt: ISODate("2026-05-27T00:00:00Z") }
  }},
  { $group: {
      _id: {
        hasProject: { $cond: [{ $ifNull: ["$pipelineTemplateProject", false] }, "yes", "no"] },
        hasJobs:    { $cond: [{ $gt: [{ $size: { $ifNull: ["$jobs", []] } }, 0] }, "yes", "no"] }
      },
      count: { $sum: 1 }
  }}
])
podman logs ci-pipeline-collector 2>&1 | grep -E "Refreshing ALL pipeline details" | head -10

```

Confirm:

* `data.activeRules.source` is `"mongo"` (not `"yaml-config"` fallback).
* `data.activeRules.count` is 4.
* Each rule has the expected `matchType`, regex, and sub-rules.

***

### Decision tree

| Query 1 says                 | Query 4 says                    | Likely cause                    | Fix                                                                                                                         |
| ---------------------------- | ------------------------------- | ------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| `hasProject=no` dominant     | `custom` clustered in old dates | **B — stale enrichment**        | Run `POST /api/v1/collector/gitlab/pipelines/refresh?daysBack=30&force=true` — full re-enrich captures fresh template facts |
| `hasProject=yes` dominant    | `custom` includes recent dates  | **A — cache contamination**     | Code change: key classifier cache by `(projectId, ref)` instead of `projectId` alone                                        |
| Query 3 shows no snapshot    | —                               | **C — YAML fetch failing**      | Check collector logs for "Could not fetch .gitlab-ci.yml" WARNs; inspect for 401/403/404 patterns                           |
| Query 5 shows known patterns | —                               | Rule isn't firing despite facts | Check Query 6 — verify rules loaded and sub-rule order                                                                      |

***

### Side observation — 11h cycle time

30 days × \~4000 pipelines took \~11 hours. For reference, the previous baseline was 1h 44m for 19k pipelines on gitlab.com. The discrepancy suggests:

* Stage 1 (`pipelines_collector`) was the bottleneck — paginated pipeline-list fetch for \~4000 active projects, serially.
* Or Nomura's internal GitLab has higher per-request latency than gitlab.com.
* Or rate-limiting was triggering global pauses.

To diagnose:

```bash
# Per-job timings from collector logs
podman logs gitlab-collector 2>&1 | grep -E "completed successfully in [0-9]+ms" | tail -30

# Look for rate-limit pauses
podman logs gitlab-collector 2>&1 | grep -E "rate limit|429|Pausing"
```

If `pipelines_collector` (Stage 1) is the longest job, the project-level parallelism added in this release will help once `GITLAB_COLLECTOR_PROJECT_PARALLELISM=4` is set. **But categorization correctness comes first — don't bump parallelism until the categorization bug is understood.**

***

### What to send back

When investigating, paste the output of:

1. Query 1 (counts by `hasProject`/`hasFile`)
2. Query 4 (counts by day)
3. Query 6 (active rules listing)

That's enough to identify the hypothesis. Queries 2/3/5 are follow-ups depending on what 1+4 show.

**Don't change code yet.** The fix for B (full re-enrich) doesn't need code changes — it's an operational command. The fix for A is a small code change; the fix for C is environment troubleshooting. Picking the wrong fix risks burning another 11-hour cycle.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.sec1.io/user-docs/9-setup-instructions/debug.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
