Name: codex-history-ingest
Availability: InStock

Codex History Ingest — Conversation Mining
You are extracting knowledge from the user's past Codex sessions and distilling it into the Obsidian wiki. Session logs are rich but noisy: focus on durable knowledge, not operational telemetry.
This skill can be invoked directly or via the
wiki-history-ingest
router (
/wiki-history-ingest codex
).
Before You Start
Read
.env
to get
OBSIDIAN_VAULT_PATH
and
CODEX_HISTORY_PATH
(default to
~/.codex
if unset)
Read
.manifest.json
at the vault root to check what has already been ingested
Read
index.md
at the vault root to understand what the wiki already contains
Ingest Modes
Append Mode (default)
Check
.manifest.json
for each source file. Only process:
Files not in the manifest (new session rollouts, new index files)
Files whose modification time is newer than
ingested_at
in the manifest
Use this mode for regular syncs.
Full Mode
Process everything regardless of manifest. Use after
wiki-rebuild
or if the user explicitly asks for a full re-ingest.
Codex Data Layout
Codex stores local artifacts under
~/.codex/
.
~/.codex/
├── sessions/ # Session rollout logs by date
│ └── YYYY/MM/DD/
│ └── rollout--.jsonl
├── archived_sessions/ # Archived rollout logs
├── session_index.jsonl # Lightweight index of thread id/name/updated_at
├── history.jsonl # Local transcript history (if persistence enabled)
├── config.toml # User config (contains history settings)
└── state_.sqlite / logs_.sqlite # Runtime DBs (usually skip)
Key data sources ranked by value
session_index.jsonl
— best inventory source for IDs, titles, and freshness
sessions/*/rollout-.jsonl
— rich structured transcript events
history.jsonl
— useful fallback/timeline aid if enabled
Avoid ingesting SQLite internals unless the user explicitly asks.
Step 1: Survey and Compute Delta
Scan
CODEX_HISTORY_PATH
and compare against
.manifest.json
:
~/.codex/session_index.jsonl
~/.codex/sessions/*/rollout-.jsonl
~/.codex/archived_sessions/**
(optional; only if user asks for archived history)
~/.codex/history.jsonl
(optional fallback)
Classify each file:
New
— not in manifest
Modified
— in manifest but file is newer than
ingested_at
Unchanged
— already ingested and unchanged
Report a concise delta summary before deep parsing.
Step 2: Parse Session Index First
session_index.jsonl
typically has entries like:
{
"id"
:
"..."
,
"thread_name"
:
"..."
,
"updated_at"
:
"..."
}
Use it to:
Build a canonical session inventory
Prioritize recent/high-signal sessions
Map rollout IDs to human-readable thread names
Step 3: Parse Rollout JSONL Safely
Each
rollout-*.jsonl
line is an event envelope with:
{
"timestamp"
:
"..."
,
"type"
:
"session_meta|turn_context|event_msg|response_item"
,
"payload"
:
{
...
}
}
Extraction rules
Prioritize user intent and assistant-visible outputs
Favor
response_item
records with user/assistant message content
Use
event_msg
selectively for meaningful milestones; ignore pure telemetry
Treat
session_meta
as metadata (cwd, model, ids), not user knowledge
Skip/noise filters
Token accounting events
Tool plumbing with no semantic content
Raw command output unless it contains reusable decisions/patterns
Repeated plan snapshots unless they add novel decisions
Critical privacy filter
Rollout logs can include injected instructions, tool payloads, and sensitive text. Do not ingest verbatim system/developer prompts or secrets.
Remove API keys, tokens, passwords, credentials
Redact private identifiers unless relevant and approved
Summarize instead of quoting raw transcripts
Step 4: Cluster by Topic
Do not create one wiki page per session.
Group by stable topics across many sessions
Split mixed sessions into separate themes
Merge recurring concepts across dates/projects
Use
cwd
from metadata to infer project scope
Step 5: Distill into Wiki Pages
Route extracted knowledge using existing wiki conventions:
Project-specific architecture/process ->
projects//...
General concepts ->
concepts/
Recurring techniques/debug playbooks ->
skills/
Tools/services ->
entities/
Cross-session patterns ->
synthesis/
For each impacted project, create/update
projects//.md
(project name as filename, never
_project.md
).
Writing rules
Distill knowledge, not chronology
Avoid "on date X we discussed..." unless date context is essential
Add
summary:
frontmatter on each new/updated page (1-2 sentences, <= 200 chars)
Add provenance markers:
^[extracted]
when directly grounded in explicit session content
^[inferred]
when synthesizing patterns across events/sessions
^[ambiguous]
when sessions conflict
Add/update
provenance:
frontmatter mix for each changed page
Step 6: Update Manifest, Log, and Index
Update
.manifest.json
For each processed source file:
ingested_at
,
size_bytes
,
modified_at
source_type
:
codex_rollout
|
codex_index
|
codex_history
project: inferred project name (when applicable) pages_created , pages_updated Add/update a top-level project/session summary block: { "project-name" : { "source_path" : "~/.codex/sessions/..." , "last_ingested" : "TIMESTAMP" , "sessions_ingested" : 12 , "sessions_total" : 40 , "index_updated_at" : "TIMESTAMP" } } Update special files Update index.md and log.md : - [TIMESTAMP] CODEX_HISTORY_INGEST sessions=N pages_updated=X pages_created=Y mode=append|full Privacy and Compliance Distill and synthesize; avoid raw transcript dumps Default to redaction for anything that looks sensitive Ask the user before storing personal/sensitive details Keep references to other people minimal and purpose-bound Reference See references/codex-data-format.md for field-level parsing notes and extraction guidance.

安装