debug

Reading Logs from Podman

The two collectors run as separate containers:

Container
Service

gitlab-collector

GitLab DevInsight collection (GitLabCollectorService)

ci-pipeline-collector

Pipeline detail enrichment (GitLabPipelineCollectorService)

Verify they're running:

podman ps --format "{{.Names}}\t{{.Status}}" | grep -E "gitlab-collector|ci-pipeline-collector"

macOS note: these commands use mktime, which requires GNU awk. Install with brew install gawk and replace awk with gawk.


A. GitLab Full Collection Cycle — Wall-Clock per Run

podman logs gitlab-collector 2>&1 | awk '
  /Starting GitLab DevInsight data collection cycle/ {
    start_ts = $1 " " $2
    start_epoch = mktime(gensub(/[-:]/," ","g",start_ts))
  }
  /Completed GitLab DevInsight data collection cycle/ && start_epoch {
    end_ts = $1 " " $2
    end_epoch = mktime(gensub(/[-:]/," ","g",end_ts))
    printf "%s -> %s   %d sec\n", start_ts, end_ts, end_epoch - start_epoch
    start_epoch = 0
  }'

B. GitLab Per-Job Aggregated Stats (runs / avg / min / max in ms)

C. Pipeline Collector — Per-Cycle Wall-Clock

D. Pipeline Collector — Summary Stats


Capturing Errors & Warnings

Run these against whichever container you're inspecting. Replace <container> with gitlab-collector or ci-pipeline-collector.

E. All ERROR / WARN lines

F. Errors with stack-trace context (15 lines after each ERROR)

G. Distinct errors, deduplicated and counted

H. HTTP failure status codes from the GitLab API

I. Per-project soft failures (404, skipped resources)

J. Rate-limiting, timeouts, connection issues

K. Cycle health summary — errors per GitLab cycle


Tips for Podman Log Capture

  • Limit time range: podman logs --since 1h gitlab-collector 2>&1 | …, or since a timestamp: --since 2026-04-24T09:00:00.

  • Tail recent only: podman logs --tail 5000 gitlab-collector 2>&1 | … for quick spot checks.

  • Live monitor errors: podman logs -f gitlab-collector 2>&1 | grep --line-buffered -E " (ERROR|WARN) ".

  • Save a snapshot for offline analysis:

    Then run the same awk/grep commands against the files (drop the podman logs … | prefix and pass the file as the last argument).

  • Why 2>&1: Spring Boot writes logs to stderr by default; without 2>&1, grep/awk only see stdout.

  • Podman timestamps vs app timestamps: the awk scripts above parse the app's own YYYY-MM-DD HH:MM:SS prefix from $1 $2. Avoid --timestamps (it adds an RFC3339 prefix that shifts the columns and breaks the scripts).

Pipeline Categorization Rules

Runtime-mutable rules stored in MongoDB (pipeline_category_rules collection). They're evaluated by the collector during pipeline enrichment and by the /pipelines/recategorize endpoint when re-stamping stored pipelines.

  • Editable at runtime via POST /api/v1/collector/gitlab/classification/rules — no container rebuild, no redeploy.

  • Applied to stored data via POST /api/v1/collector/gitlab/pipelines/recategorize?daysBack=N — pure Mongo, no GitLab calls.

  • Evaluated in priority order (ascending). First matching rule sets pipelineCategory. Every matching rule contributes labels.


Quick Reference

Priority
Name
Match type
Matched against
Sub-label dimensions

10

NPC2

namespace

eis-terraform-npc-provisioning (substring on namespace path)

env

20

TOM

namespace

eis-grafana-tom (substring)

subteam

30

Database

namespace

EIS-DBMW-DBENG (substring)

40

NCD

include

.gitlab-ci.yml include.project matches .*ncd.*pipeline.*

subtype

Fallthrough: pipelines matching no rule receive pipelineCategory = "custom". Pipelines with no .gitlab-ci.yml and no namespace match receive pipelineCategory = "none".


Rule 1 — NPC2 (priority 10)

Namespace-based. Categorizes any project whose namespace path contains eis-terraform-npc-provisioning. Works for both lab (gitlab.com) and production (gitlab.nomura.com) URLs.

JSON

Sub-label table

Namespace suffix

env label

-prodtest

prod

-prod

prod

-nonprodtest

nonprod

-nonprod

nonprod

-qa

qa

Sub-rule order matters: nonprod is listed before prod because -prod would otherwise substring-match -nonprodtest.

Example outputs

URL

pipelineCategory

pipelineLabels

pipelineLabelMap

gitlab.com/eis-terraform-npc-provisioning-prodtest/foo

NPC2

[NPC2, prod]

{category: NPC2, env: prod}

gitlab.nomura.com/eis-terraform-npc-provisioning-nonprod/bar

NPC2

[NPC2, nonprod]

{category: NPC2, env: nonprod}

gitlab.nomura.com/eis-terraform-npc-provisioning-qa/x

NPC2

[NPC2, qa]

{category: NPC2, env: qa}


Rule 2 — TOM (priority 20)

Namespace-based. Matches eis-grafana-tom and distinguishes the rules vs eng sub-teams.

JSON

Sub-label table

Namespace suffix

subteam label

-rules

rules

-eng

eng

Example outputs

URL

pipelineCategory

pipelineLabels

pipelineLabelMap

gitlab.nomura.com/eis-grafana-tom-rules/rules-repo

TOM

[TOM, rules]

{category: TOM, subteam: rules}

gitlab.nomura.com/eis-grafana-tom-eng/grafana-ui-automation-job

TOM

[TOM, eng]

{category: TOM, subteam: eng}

gitlab.nomura.com/eis-grafana-tom/whatever

TOM

[TOM]

{category: TOM}


Rule 3 — Database (priority 30)

Namespace-based. Catches the DBMW database engineering group. No sub-rules.

JSON

Example outputs

URL

pipelineCategory

pipelineLabels

pipelineLabelMap

gitlab.nomura.com/EIS-DBMW-DBENG/dbmw-housekeeping

Database

[Database]

{category: Database}


Rule 4 — NCD (priority 40)

Include-based. Matches any .gitlab-ci.yml whose include.project value contains ncd.*pipeline.* (e.g. gts-cta-strategy-innersource/ncd/pipeline-templates). Sub-rules examine include.file to identify Helm CI / Application CI / Dependency CI.

JSON

Sub-label table

include.file

subtype label

NCD-Build.helm.local.gitlab-ci.yml

Helm CI

NCD-Dependency.local.gitlab-ci.yml

Dependency CI

NCD-Build.local.gitlab-ci.yml

Application CI

Sub-rule order matters: Helm CI is checked before Application CI because NCD-Build.helm.local.gitlab-ci.yml would otherwise substring-match the Application pattern.

Example outputs

include.project

include.file

pipelineCategory

pipelineLabels

pipelineLabelMap

gts-cta-strategy-innersource/ncd/pipeline-templates

NCD-Build.helm.local.gitlab-ci.yml

NCD

[NCD, Helm CI]

{category: NCD, subtype: Helm CI}

gts-cta-strategy-innersource/ncd/pipeline-templates

NCD-Build.local.gitlab-ci.yml

NCD

[NCD, Application CI]

{category: NCD, subtype: Application CI}

gts-cta-strategy-innersource/ncd/pipeline-dependency

NCD-Dependency.local.gitlab-ci.yml

NCD

[NCD, Dependency CI]

{category: NCD, subtype: Dependency CI}


Installation — POST all rules in one go

Verify

Expected:

Apply to stored data

Runs in seconds — pure Mongo + config, no GitLab calls.


Sub-rule field reference

The field attribute on a sub-rule decides which raw fact the pattern is matched against:

field value

Source fact

Notes

templateProject

include.project from .gitlab-ci.yml

The template repo path

templateRef

include.ref

The branch / tag of the included template

templateFile

include.file

The specific file inside the template repo

namespace

parsed from repoUrl

E.g. eis-terraform-npc-provisioning-prod/some-repo

repoUrl

full project URL

Useful for rare URL-based regexes

Sub-rule semantics

  • All sub-rule patterns are case-insensitive and use Matcher.find() (substring match).

  • Sub-rules are evaluated in array order. List the most specific patterns first when they can overlap (e.g. Helm before Application; nonprod before prod).

  • Each matching sub-rule contributes its label to pipelineLabels. If key is set, it also lands in pipelineLabelMap[key] (last writer wins on key collisions).

  • Set enabled: false to keep a sub-rule on record without applying it.


Operational notes

  • The classifier caches rules in-process for 60 s. Mutating endpoints invalidate the cache automatically.

  • Namespace rules require pipeline.repoUrl to be populated. That field is stamped during enrichment from the project record — which means projects must already exist in dev_insight_projects_collection for namespace rules to fire. Run /gitlab/collect once to seed projects, then /pipelines/refresh for enrichment.

  • For include-based rules, only pipelineTemplateProject / pipelineTemplateRef / pipelineTemplateFile are needed — these are captured directly during pipeline enrichment from the parsed .gitlab-ci.yml.

Investigation — Pipeline categorization showing 81% "custom"

Reporter: ops, 2026-05-21 Symptom: On the Nomura GitLab-Demo dataset (4288 pipelines, 30-day window), the collector + dashboard reports:

  • NCD = 342 (8%)

  • NPC2 = 393 (9%)

  • TOM = 80 (2%)

  • Database = 1 (0%)

  • custom = 3472 (81%)

  • none = 0

Spot-check: GM-EUC-QIS-Structuring/renovate-config shows .gitlab-ci.yml with include.project = gts-cta-strategy-innersource/ncd/pipeline-dependency and include.file = NCD-Dependency.local.gitlab-ci.yml. This should classify as NCD / Dependency CI per the active rules. Two rows of renovate-config in the dashboard are tagged NCD / Dependency CI correctly; three other renovate-config rows are tagged custom — same YAML shape, different fork instances.

Additional context:

  • 30-day enrichment cycle on this dataset took ~11 hours.

  • Running /pipelines/recategorize after collection only changed a handful of records; most stayed custom.

Conclusion: there's a real bug. This document narrows down which of three hypotheses is responsible before any code change.


Hypotheses

A. Per-project cache contamination across pipelines

PipelineCategoryClassifier.CACHE is keyed by projectId alone. The cache stores a Classification object that carries the first include's templateProject / templateRef / templateFile. When a project has multiple pipelines on different refs (e.g. main, release-2.x, feature branches), they all share the same cached Classification from whichever ref was fetched first — even if their actual .gitlab-ci.yml could differ on different refs.

If this is the cause:

  • A subset of pipelines would have wrong template values stamped (matching the first-cached ref, not their own).

  • Affected pipelines: those in projects with diverse refs.

B. Stale enrichment from before classification was wired up

Stage 1 (pipelines_collector) creates pipeline docs with no template fields. Stage 2 enrichment runs only when explicitly triggered (/pipelines/refresh). If force=false, terminal pipelines hit isFullyEnrichedTerminal short-circuit and are skipped. Pipelines enriched once before the categorization rules existed (or before the multi-include/namespace fix) carry pipelineCategory = "custom" indefinitely.

/pipelines/recategorize works off stored pipelineTemplateProject / pipelineTemplateRef / pipelineTemplateFile. If those fields are null (because enrichment never captured them), recategorize has nothing to derive from and the doc remains custom.

If this is the cause:

  • Many "custom" pipelines would have null pipelineTemplate* fields.

  • Recategorize wouldn't help — only a force=true re-enrichment would, because that re-fetches the YAML.

  • Distribution by collectedAt would show "custom" docs clustered in older cycles.

C. YAML fetch failing for many projects

fetchGitLabCiYaml returns null on 404, network error, or any non-2xx. When YAML is null, classifier returns noYamlCategory (default "none"). The dashboard shows 0 "none", so this isn't a primary cause — but partial failures (e.g. some refs return 200, others 404) could still contribute via the cache.


Investigation queries

Run these against the Mongo backing the affected environment. Do not commit the output to git — it may contain repo names/IDs. Paste back as plain text in the investigation thread.

Query 1 — How many "custom" pipelines actually have template facts captured?

Interpretation:

  • hasProject=no, hasFile=no dominant → Hypothesis B (template facts never captured). /pipelines/recategorize cannot help; a full /pipelines/refresh?force=true&daysBack=30 is needed.

  • hasProject=yes, hasFile=yes dominant → Hypothesis A or C (facts captured but rules didn't fire — either wrong values or fetch returned partial data).

Query 2 — Sample a "custom" pipeline that should be NCD

Interpretation:

  • All pipelineTemplate* fields null → enrichment never captured the YAML for this pipeline. Confirms Hypothesis B.

  • pipelineTemplateProject set to the wrong project (not the one in the actual .gitlab-ci.yml) → cache contamination, Hypothesis A.

  • Fields set correctly but pipelineCategory = custom → rule isn't matching. Check active rules via GET /api/v1/collector/gitlab/classification/rules.

Query 3 — Is there a YAML snapshot for that project?

Take projectId from Query 2's result, then:

Interpretation:

  • No snapshot → the collector never successfully fetched .gitlab-ci.yml for this project. Network issue, 404, or wrong path. Hypothesis C variant.

  • One snapshot per ref → check whether includeProject matches what the dashboard shows.

  • Multiple snapshots with different includeProject values → confirms that refs diverge and cache-by-projectId (Hypothesis A) would cause wrong stamping.

Query 4 — When were the "custom" pipelines categorized?

Interpretation:

  • All custom docs clustered in dates before the categorization rules were uploaded → Hypothesis B is definitive.

  • custom docs spread across recent days, including post-rule-upload → enrichment is producing custom even with current rules. Hypothesis A or C.

Query 5 — Distribution of pipelineTemplateFile values among "custom"

Interpretation:

  • If any of these _id values match patterns from the NCD/NPC2/TOM/Database rules → confirms rule didn't fire even when facts were captured. Cache contamination (A) or rules mismatch.

  • If all values are unfamiliar → those projects use different templates not yet in the rule set.

Query 6 — Current active rules (sanity check)

Confirm:

  • data.activeRules.source is "mongo" (not "yaml-config" fallback).

  • data.activeRules.count is 4.

  • Each rule has the expected matchType, regex, and sub-rules.


Decision tree

Query 1 says
Query 4 says
Likely cause
Fix

hasProject=no dominant

custom clustered in old dates

B — stale enrichment

Run POST /api/v1/collector/gitlab/pipelines/refresh?daysBack=30&force=true — full re-enrich captures fresh template facts

hasProject=yes dominant

custom includes recent dates

A — cache contamination

Code change: key classifier cache by (projectId, ref) instead of projectId alone

Query 3 shows no snapshot

C — YAML fetch failing

Check collector logs for "Could not fetch .gitlab-ci.yml" WARNs; inspect for 401/403/404 patterns

Query 5 shows known patterns

Rule isn't firing despite facts

Check Query 6 — verify rules loaded and sub-rule order


Side observation — 11h cycle time

30 days × ~4000 pipelines took ~11 hours. For reference, the previous baseline was 1h 44m for 19k pipelines on gitlab.com. The discrepancy suggests:

  • Stage 1 (pipelines_collector) was the bottleneck — paginated pipeline-list fetch for ~4000 active projects, serially.

  • Or Nomura's internal GitLab has higher per-request latency than gitlab.com.

  • Or rate-limiting was triggering global pauses.

To diagnose:

If pipelines_collector (Stage 1) is the longest job, the project-level parallelism added in this release will help once GITLAB_COLLECTOR_PROJECT_PARALLELISM=4 is set. But categorization correctness comes first — don't bump parallelism until the categorization bug is understood.


What to send back

When investigating, paste the output of:

  1. Query 1 (counts by hasProject/hasFile)

  2. Query 4 (counts by day)

  3. Query 6 (active rules listing)

That's enough to identify the hypothesis. Queries 2/3/5 are follow-ups depending on what 1+4 show.

Don't change code yet. The fix for B (full re-enrich) doesn't need code changes — it's an operational command. The fix for A is a small code change; the fix for C is environment troubleshooting. Picking the wrong fix risks burning another 11-hour cycle.

Last updated