Parser statistics

The email parser logs every parse attempt into ParseTrainingLog — which parser fired, what was extracted, what was missing. From this TravStats builds a Parser Stats overview, which makes it visible how reliably the parser performs per airline and where a custom template would pay off.

Where you see it

Parser → Parse Logs — the tab is admin-only. The overview shows:

Total number of parse attempts
Overall hit rate — percentage of parses where a template (built-in or user) fired (rather than only the Ollama fallback or the generic regex extractor)
Per-airline table with:
- Number of attempts
- Hit rate (built-in / user template match rate)
- Top-5 missing fields (what the templates fail to extract most often — e.g. aircraft, gate, seat)

The table is sorted descending by attempt count — airlines you’ve parsed most often appear at the top.

What the hit rate tells you

Hit rate	Meaning
100 %	The airline has a perfectly-working built-in or user template — no action needed
70–95 %	Template fires most of the time, but occasionally the airline tweaks its format. A user-template refresh or a built-in patch is due
30–70 %	Template fires unreliably — either the airline has multiple email formats (old / new) and only one is covered, or the template is stale
0–30 %	Practically only Ollama / regex fallback. A user template would pay off
`Unknown` row	Mails where the airline detection didn’t fire at all — sender domain unknown, subject pattern doesn’t match

Hit rate is not the parse success rate — a mail with hit rate 0 will still be parsed by Ollama or the generic regex layer and typically still lands on the review screen with correct data. Hit rate measures only “did a deterministic template fire?”.

Reading the common-missing-fields column

The commonMissingFields column shows the five most frequently missing fields per airline. Example:

Airline	Total	Hits	Hit rate	Missing
Lufthansa	142	138	97 %	`aircraft`, `gate`, `seat`
Ryanair	38	36	95 %	`aircraft`, `seat`, `terminal`, `pnr`
Air Baltic	12	0	0 %	`flightNumber`, `dep`, `arr`, `departureTime`, `airline`

Reads as:

Lufthansa: template fires reliably, but the aircraft type is usually missing (LH mails rarely include it — Aviationstack fills it in during enrichment)
Ryanair: template fires, but PNR / terminal placement varies in their mails
Air Baltic: no template (no match), Ollama or regex fallback does the rest. If you fly Air Baltic often, a user template would pay off

Anonymised JSONL export

For maintainers / bug reports there’s a download:

Parser → Parse Logs → Export JSONL (admin only, rate-limited) or programmatically:

curl -fsS -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://travstats.example.com/api/v1/admin/parse-logs/export \
  --output parse-logs.jsonl

What the export contains (max 50,000 rows):

Subject (hashed, not the original)
Detected airline / template, hit flag
Which fields were found / missed
Parser confidence
Anonymised body — detected PII (emails, IPs, JWTs, UUIDs) replaced with markers (<redacted:email> etc.)
Timestamp

What the export does not contain:

Original email body
Passenger names, PNRs, booking refs (stripped before logging)
Personally identifying IDs

The format is useful for filing a bug report along the lines of “look, the Lufthansa template started missing fields after date X” — without sharing real booking data.

Training data (recording mode)

Alongside the passive statistics there’s an active recording mode (see User templates):

Parser → My Templates → Record New lets you upload an example email (.eml, .txt, .msg) or a boarding-pass image
You annotate the fields in the web view, TravStats derives a template from it (deriveTemplateFromAnnotation)
The derived template is stored in ParserTemplate and used on the next match against an email from the same sender

Training uploads are stored in the trainingData DB table (20 MB limit per file, accepted formats: .eml, .txt, .msg, .jpg, .png, .pdf). They’re not part of the public statistics — those show only production parse results.

Limitations

No history over time — the statistics aggregate the last 10,000 logs as a rolling window. If your Lufthansa hit rate drops from 100 % to 60 %, you’ll only see it if you check the table regularly. A trend chart is on the roadmap
User templates feed in too — if you record a user template for KL, KL appears in the table afterwards with a high hit rate. That makes the aggregate uninterpretable when you’re the only user on the instance; on multi-user setups it doesn’t matter
No per-template diff — the statistics know “template X fired” but not which field it got wrong. The JSONL export is for that level of detail