Abstract

Most AI memory systems are structured around retrieval: given a query, return the most relevant memories. This is necessary but insufficient. Models that connect to a memory system without context first do not know what topics exist, which conversational threads are open, or what conventions the user has established. They write duplicates, fragment threads, and apply inconsistent tags - failure modes that no retrieval API can prevent because they happen before the model thinks to retrieve.

This paper presents Reflect Memory, an AI memory system organised around a different principle: hand the model a structured map of the corpus at session start, then let it navigate using that map. The system has three load-bearing components - (1) a Model Context Protocol (MCP) [1] initialization briefing containing identity, named topic clusters, open threads, and behavioural guidance; (2) a SQLite-native graph projection that exposes implicit edges (threading, tag co-occurrence, content references) without requiring an external graph database; and (3) Louvain-detected [2] topic clusters with LLM-generated names that act as the briefing's table of contents.

We validate the system with a scenario-based evaluation harness combining deterministic hard assertions and an LLM-as-judge rubric, run with N=3 stochastic repetitions across ten task scenarios. Behavioural correctness improves from a 60% baseline (with a flat tag-list briefing) to 96.7% (with the full graph layer), without changes to retrieval quality or model parameters - the gain comes entirely from changing what context the model receives on connect.

1. Introduction

1.1 The orientation problem

The substantial body of work on AI memory systems - surveyed in §2 - concentrates on retrieval quality: precision and recall metrics on benchmarks like LongMemEval [3]. This focus assumes the model already knows when and how to retrieve. In practice, observation of production memory systems suggests this assumption fails routinely. Models default to writing whatever the prompt asks for, without first checking what is already recorded. The dominant failure modes observed are:

Duplicate creation. A model receives a prompt that matches an existing memory closely or exactly. Without explicit context indicating prior coverage, the model writes a fresh memory, fragmenting the corpus.
Thread fragmentation. A user makes a follow-up comment on a topic that is already a parent memory with prior replies. Without visibility of open threads, the model creates a new top-level memory instead of replying to the existing thread.
Tag inconsistency. A user has established conventions (priority labels, area tags, status markers). A model lacking awareness of these conventions invents new tags on each write, eroding the corpus's clusterability over time.

These are orientation failures, not retrieval failures. A retrieval-only architecture cannot prevent them because it requires the model to know to retrieve - and that knowledge is exactly what is missing on first contact.

1.2 Contribution

This work argues that the missing primitive is a session-initialization briefing: a structured snapshot of the user's memory state delivered to the model on connect, before any prompt is processed. We demonstrate that:

The MCP initialize handshake provides a natural protocol-level surface for delivering such a briefing without requiring changes to clients.
A briefing constructed from existing memory metadata (identity, threading structure, tag co-occurrence, recency) - augmented with Louvain-clustered named topics - fits comfortably within typical context budgets (~5-7 KB / ~1,500 tokens for a corpus of several hundred memories).
Scenario-based behavioural evaluation, combining hard assertions and LLM-as-judge rubric grading, can measure orientation quality directly and drive iterative improvement of briefing content.

The contribution is architectural rather than algorithmic: every component is constructible from prior art, but the combination - and specifically the use of initialize.instructions as a structured map surface - is, to the authors' knowledge, novel in shipped agent-memory systems.

We position Reflect Memory against six categories of relevant systems.

2.1 Graph RAG (document → graph → answer)

Microsoft GraphRAG [4], LightRAG [5], nano-graphrag [6], and Cognee occupy this category. They batch-index static document corpora, extract entity-relation triplets via LLM, build hierarchical communities (typically via Leiden), and answer queries via subgraph traversal plus LLM synthesis. The architectural assumption is a static corpus indexed once and queried many times. Reported per-query token costs are substantial - GraphRAG's published cost is approximately 610,000 tokens per query at retrieval time [5].

Reflect Memory is not in this category. Memories are written incrementally, often singly, and we do not perform LLM-based entity extraction at write time.

2.2 Agent memory layers

Mem0 [7], Letta (formerly MemGPT) [8], Hindsight, and OMEGA provide persistent memory for AI agents across sessions. Most expose add/search APIs and rely on the model to invoke them as needed. Mem0 additionally performs LLM extraction on every conversation turn to identify candidate memories.

Reflect Memory shares this category most closely. The architectural difference is the briefing surface - none of the systems surveyed deliver a structured map of the corpus to the model on connect.

2.3 Temporal context graphs

Graphiti / Zep [9], Memento, Hydra DB, and WorldDB [10] explicitly model the temporal dimension of facts: every edge carries valid_at (when a fact became true) and invalid_at (when it stopped being true), plus separate transaction-time fields. This enables queries like "what did the system know about X on date Y" without losing history when facts change.

Reflect Memory does not yet implement bi-temporal edges. We discuss this as the highest-priority future work in §9.

2.4 Personal Knowledge Management (PKM)

Obsidian, Logseq, Tana, and Roam Research [11] provide bidirectional [[wiki-style]] linking with automatic backlinks. Their assumption is that humans curate the link graph manually.

Reflect Memory is not a PKM tool - its memories are written predominantly by AI agents through the MCP interface, not by humans typing in a UI. Link discovery must therefore be automatic. The system implements this via three mechanisms (parent-child threading, content-based UUID scanning, and a legacy tag-reference convention) all surfaced through a backlink query (§3.2).

2.5 SQLite-native graph implementations

Several open-source libraries implement graph storage and traversal directly on SQLite, including simple-graph [12], sqlite-graph [13] (notable for bi-temporal edges and recursive-CTE traversal), and sqlite-knowledge-graph [14] (a Rust library implementing PageRank, Louvain, and other graph algorithms over SQLite). These demonstrate that graph operations at small-to-medium scale do not require a dedicated graph database.

Reflect Memory follows this approach. Graph traversal uses SQLite recursive Common Table Expressions over an existing parent_memory_id column. Tag co-occurrence uses self-joins over json_each on the tags column. Clustering runs in-process. No second data store is introduced.

2.6 Cutting-edge research

WorldDB [10] introduces three notable ideas: nested "world" nodes (recursive sub-graphs), content-addressed immutability (Merkle-tree audit trail), and edges as programs - every edge type ships handlers (on_insert, on_delete, on_query_rewrite) that encode behaviour at the schema level. Reported results on LongMemEval-s reach 96.4%, surpassing the previous Hydra DB baseline.

Reflect Memory has independently implemented the third idea (typed cascade handlers - see §3.5) but did not adopt nested worlds or content-addressed immutability, which we judged premature for our scale.

3. System architecture

The system has six components, described below in approximately one paragraph each.

3.1 The initialization briefing

When a Model Context Protocol [1] client invokes the initialize handshake, the server populates the response's instructions field with a markdown-formatted snapshot of the connecting user's memory state. The briefing contains: user identity and team membership; total memory counts (personal, team-shared, team pool); auto-detected tagging conventions; a "before you write" guidance block; a topic map of named clusters with member tag lists; flat tag indexes for backward compatibility; recent-activity tags; and the open threads with full UUIDs.

The briefing is sized to fit comfortably within typical context budgets - a corpus of 50-500 memories produces a briefing in the range of 5-7 KB or approximately 1,500 tokens. The briefing is also exposed as a regular MCP tool (get_memory_briefing) and an HTTP endpoint for mid-session refresh.

3.2 Graph projection over existing data

A small library of SQL-backed helper functions exposes the implicit graph already encoded in the memory schema, without introducing new tables:

getBacklinks(userId, memoryId) - returns memories that reference the target. Three sources are unioned: child memories via parent_memory_id; memories whose content text contains the target's UUID verbatim; and memories carrying a legacy ref_<8-char-id> tag pattern.
getGraphAround(userId, memoryId, opts) - returns the local subgraph: parent, children, siblings (other children of the same parent), top-K tag-similar memories (memories sharing at least K tags), and bidirectional content references.
getTagCooccurrence(opts) - emits unordered tag pairs with co-occurrence counts, scoped to either a single user's pool or a team-shared pool. This drives the clustering step (§3.3).

All queries respect a visibility model that scopes results to memories the calling user owns or that are shared with the user's team. Queries are implemented as SQLite recursive CTEs over parent_memory_id and json_each self-joins over the tags column, producing results in milliseconds at the scales tested (hundreds to low-thousands of memories per user).

3.3 Louvain tag clustering

The system applies the Louvain community detection algorithm [2] to the tag co-occurrence graph. Louvain maximises modularity - each detected cluster has more internal edges than would be expected by chance under a null model - and runs in approximately O(E log V) time. Implementations exist as small libraries with no native bindings (we use graphology and graphology-communities-louvain).

A deliberate simplification: the system clusters tags, not memories. A memory's cluster is then derived from the cluster of its tags. This keeps the algorithm fast (typically 50-200 nodes regardless of corpus size) and means that adding a single new memory rarely changes the cluster structure, which simplifies caching (§3.4). Trivial clusters of fewer than three tags are dropped; remaining clusters are sorted by combined size and intra-cluster edge weight.

3.4 LLM-named topic clusters

Each non-trivial cluster receives a 1-3 word Title-Case name and a one-line description from a single LLM call (we use a small/fast model - Anthropic Haiku-class - for cost reasons). The result is cached in a dedicated table keyed on (user_id, scope, cluster_hash), where cluster_hash is the SHA-256 digest of the sorted member-tag list. The cache has a 24-hour TTL; entries that have drifted (cluster membership has changed) generate a new hash and force a fresh name.

Failure handling: if an LLM call fails (network error, rate limit, malformed output), the system falls back to a placeholder name constructed from the top three tags (e.g., auth/api/mcp). Critically, the fallback is not cached, so the next session retries naming naturally. This prevents a transient burst of failures from locking in poor-quality names for 24 hours.

Total cost is approximately one Haiku-class LLM call per cluster per 24 hours, amortised across all sessions for the user. For a typical user with 5-10 clusters, the steady-state cost is on the order of cents per month.

3.5 Typed cascade handlers (edges as programs)

The system follows WorldDB's "edges as programs" pattern [10]: behaviour for typed edges is encoded as handler functions invoked when an edge is inserted, removed, or queried.

Three edge types are currently materialised this way:

SHARED_WITH_TEAM - when set on a parent memory, a cascadeShare handler propagates the team scope to all children regardless of author. When cleared, cascadeUnshare performs the inverse. The whole conversational thread's visibility flips together.
PARENT_OF - soft-delete of a parent triggers cascadeSoftDelete scoped to the caller's own children only (preserving teammates' replies). Hard-delete triggers cascadeHardDelete, which orphans teammates' children (sets parent_memory_id = NULL) so the foreign-key constraint on the parent's removal is satisfied without destroying others' work.
AUTHORED_BY - implicit in the user_id column; used by visibility filters across the system.

The handlers are currently implemented as plain functions rather than as registered methods on an "edge type" abstraction. Formalising the abstraction would enable adding edge types like SUPERSEDES, CONTRADICTS, and SAME_AS (proposed in [10]) at lower marginal cost. We discuss this as future work in §9.

3.6 Graph-aware tools and visualisation

Two new MCP tools expose the graph layer to LLMs: get_graph_around(memory_id) returns the local subgraph in one call (avoiding manual chaining of read-thread / search / browse tools), and get_topic_cluster(tag, limit) returns recent memories from any cluster identified by one of its tags. Existing tool descriptions for write_memory and update_memory were also tightened to direct the model toward checking the briefing's open threads and topic clusters before writing, and to discourage update_memory operations on memories authored by other users (which would destroy the original author's text).

A web-based visualiser presents the same graph data in a force-directed layout, with nodes coloured by topic cluster (using the same names the LLM sees in its briefing). The visualiser serves two purposes: it lets human users inspect the corpus structure that AI agents are seeing, and it surfaces any divergence between the model's interpretation and the user's mental model.

4. The initialization briefing as architectural primitive

We claim the briefing is the system's central architectural contribution; this section examines it more closely.

4.1 What the briefing carries

A representative briefing contains five informational sections in addition to a header:

Identity and totals. The connecting user's account, team membership, and counts of personal vs. team-shared memories. Establishes the access scope the model is operating in.
Auto-detected conventions. Patterns inferred from the user's actual tag distribution - for example, the presence of multiple priority tags (p0, p1, p2, p3) combined with an eng tag triggers the convention "engineering tickets are tagged with eng plus a priority plus an area." These conventions are not hand-curated; they are derived from observed tag co-occurrence patterns.
Behavioural guidance. Three explicit rules: read before writing when the prompt overlaps with existing topics; link explicitly when extending or contradicting prior memories; match the existing tag vocabulary when writing fresh.
Topic map. The Louvain-detected named clusters (§3.3-3.4), each with its member tags, member count, and a one-line description. This is the table of contents.
Open threads. Parent memories with at least one reply, surfaced with their full UUIDs (not truncated identifiers), grouped by visibility scope.

4.2 Why each section matters

In ablation testing during development, removing or degrading individual sections produced characteristic failure modes:

Section degraded	Resulting failure mode
Truncated memory IDs (8-char prefixes instead of full UUIDs)	Models attempted thread replies with the truncated IDs, received "not found" errors, then either abandoned the write or fell back to updating an unrelated memory
Conventions section moved to bottom of briefing	Models stopped applying the umbrella `eng` tag for engineering work, instead following narrower per-cluster vocabulary
Open threads scoped to caller's-own only	Team members could not discover threads colleagues had started, leading to fragmented parallel writes on the same topic
Topic map omitted	Models invented ad-hoc tags rather than reusing the user's established vocabulary
Behavioural guidance omitted	Models defaulted to writing without reading first, even on prompts that obviously overlapped existing memories

These observations motivated the iterative changes documented in §6.

4.3 Why MCP `initialize` is the right surface

The Model Context Protocol [1] specifies an instructions field in the initialize handshake response, intended for the server to send free-form context to the client. The protocol does not constrain content. Most surveyed MCP servers leave it empty.

This surface has three desirable properties:

Once-per-session cost. The briefing is computed once at connect, not per query. Cluster naming hits the LLM only on cache miss; subsequent sessions are essentially free.
No client changes required. Any compliant MCP client receives the briefing automatically. We have observed it being processed correctly by Anthropic Claude Desktop, Cursor's MCP V2 client, and custom clients built on the official @modelcontextprotocol/sdk.
Architecturally distinct from tools. The briefing is context, not a capability. The model receives it without having to choose to invoke it. This matters for orientation specifically: if the model had to call a get_orientation tool, it would need to know to call it, which is the problem we are solving.

5. Methodology: scenario-based evaluation harness

The hardest part of building a memory system is not the storage layer; it is proving that LLMs interact with it correctly. Stochastic outputs, context drift, and the prompt-engineering surface area combine to make any single behavioural test misleading. We address this with a scenario-based harness.

5.1 Architecture

The harness has six layers:

Setup. Provisions dedicated test users on a dedicated test team, generates fresh API keys per run, and is idempotent.
Fixture corpus. A library of approximately 45 realistic memory templates organised into five categories (engineering tickets, architectural decisions, operational runbooks, session summaries, and orthogonal "noise") with realistic threading and tag distributions. Designed to produce identifiable Louvain clusters under the algorithm in §3.3.
Seeding. Wipes test-user state and re-applies the corpus before each scenario rep, guaranteeing consistent initial conditions.
Driver. Connects to the system's MCP server using the official SDK, captures the briefing from the initialize response, translates the available MCP tools into the LLM provider's tool format, and drives a multi-turn conversation. Records every event (system, user, assistant, tool use, tool result) as a structured transcript.
Scenarios. A set of ten task scenarios, each defining a prompt, deterministic hard assertions over the captured transcript (TypeScript predicates with access to a fixture-ref → memory-ID resolution map), and rubric questions for the judge.
Judge. An LLM-as-judge stage using a strong model (Anthropic Opus-class) with a strict-JSON output schema, returning per-question answers and a composite 0-10 score. Includes a malformed-JSON repair fallback for the rare format failures.

A runner orchestrates scenarios × repetitions, applies assertions, optionally invokes the judge, and appends each run's scoreboard to a persistent results log for trend tracking across iterations.

5.2 Scenario set

#	Scenario	Tests for
1	Reply to existing thread	Model uses thread-reply primitive when prompt maps to open thread
2	Create new top-level memory	Model creates fresh memory when prompt is unrelated to open threads
3	Cluster recall	Navigation question is answered using search/list tools
4	Multi-author thread reply	Model can reply to a thread started by a different user
5	Tag-convention compliance	Model follows auto-detected conventions when filing new content
6	Avoid duplication	Near-duplicate prompt triggers read-before-write rather than fresh duplicate
7	Cross-reference	Prompt mentioning another memory produces explicit linkage
8	Supersession	Update to stale fact preserves prior memory and links to it
9	Briefing-only navigation	"What topics?" answered from briefing alone, zero tool calls
10	Cluster vocabulary reuse	Note matching a named cluster uses the cluster's existing tags

Each scenario runs with N=3 stochastic repetitions to estimate behavioural stability.

5.3 Why hard assertions plus rubric

Hard assertions catch specification violations deterministically (e.g., "any write_child_memory call's parent_memory_id matches the seeded ID of the auth-bug-root fixture"). They are necessary because LLM-as-judge tends to be too forgiving on structural correctness - judges consistently rated scenarios well even when the model's actions destroyed information by overwriting a teammate's memory rather than replying to it.

LLM-as-judge rubric grading catches qualitative properties that assertions cannot: naturalness of tone, whether the model qualitatively recognised the right cluster, whether the answer accurately summarised retrieved content. Judges are also forgiving of defensible-but-different behaviour, which is appropriate when measuring product quality rather than spec compliance.

The two together let the harness measure both specification compliance (hard assertions) and behavioural quality (rubric). When they diverge, the divergence itself is informative - it usually points either at an over-strict assertion that should be relaxed, or at a real product question about which behaviour is desired.

6. Results

The harness was used to drive seven iterations of refinement to the briefing and tool descriptions, with no changes to retrieval algorithms or model parameters. Results are summarised below; "5-scen" denotes the original scenario set, "10-scen" includes the additional five scenarios introduced at iteration 6.

Iteration	Change	Hard-assertion pass	Rubric (0-10)
0 (baseline)	Original flat-tag-list briefing	60% (5-scen)	8.76
1	Full UUIDs in briefing's open-threads section	60%	7.79
2	Strengthened convention text; explicit "do not overwrite teammate's memory" guidance	73%	8.84
3	Team-shared threads visible to all team members	100% (5-scen)	8.88
4	Graph layer: backlinks, getGraphAround, Louvain clusters, LLM-named topics, two new MCP tools	86.7%	8.93
5	Conventions section promoted to top of briefing	93.3%	8.82
6	"Before you write" guidance block; five additional scenarios	96.7% (10-scen)	8.47

Several observations are worth highlighting:

Each addition surfaces unintended consequences. Iteration 1's UUID fix slightly degraded the rubric (7.79) because the model's recovery loop on related tasks became less efficient; iteration 4's topic map pushed the conventions section down and degraded convention compliance (86.7%). The harness caught both regressions in the next iteration, where they were addressed by rebalancing the briefing layout. Without the harness, these regressions would likely have shipped to production unnoticed.

The single largest improvement was a four-line SQL change. Iteration 3 modified the briefing's "open threads" query to include team-shared threads regardless of author. The multi-author thread scenario went from 0% pass to 100% pass with the model needing only one tool call. This was not an algorithmic improvement - it was a visibility-model fix in a single function. The harness identified it as a high-priority intervention based on iteration 0's failure pattern.

Rubric scores remained high even when assertions failed. In several iterations the rubric stayed above 8.5 while the hard assertion pass rate moved substantially. This consistently indicated that the model's behaviour was defensible-but-different rather than incorrect. Resolution typically involved either relaxing an over-strict assertion (the cross-reference scenario originally required a write to a specific parent ID; the model legitimately wrote to a child of that parent, which the spec should accept) or refining the briefing guidance to push toward the desired behaviour.

Final iteration 7 added a graph endpoint and a force-directed visualiser consuming the same data structures. This did not change the LLM's behaviour (no briefing or tool changes) but provides a human-facing window onto the same map the model sees.

7. Design choices and trade-offs

This section makes explicit several decisions where alternative paths exist in the literature.

7.1 SQLite-native, no external graph database

Several systems in §2 - Graphiti [9], Mem0 in its paid tier [7], WorldDB [10] - require a dedicated graph database (Neo4j, FalkorDB, Memgraph). This adds operational complexity for the host and for any self-hosting customer.

We use SQLite recursive CTEs for traversal, json_each for tag projection, and an in-process Louvain implementation for clustering. None of this requires a second data store. The trade-off is that traversal performance does not scale arbitrarily - at corpus sizes well above current targets (10,000+ memories per user), a materialised edge table or genuine graph database would become attractive.

7.2 No LLM extraction at write time

GraphRAG-family systems [4,5,6] and Mem0 [7] perform LLM-based entity and relation extraction on every memory write. The cost is substantial: latency increases from milliseconds to seconds per write, and the extracted relations are baked permanently into the graph (extraction errors persist).

We extract zero relations at write time. The user's existing tags and threading structure already encode most of the relations an extractor would discover; we expose those relations via the graph projection layer (§3.2) instead of re-extracting them. The single LLM cost is cluster naming, which is once-per-cluster (not once-per-write) and cached for 24 hours. Steady-state cost is approximately two orders of magnitude lower.

7.3 No `[[wiki-style]]` linking syntax

PKM tools [11] expect humans to type explicit references. In an AI-driven memory system most memories are written by agents through APIs. Asking AI agents to type a manual reference syntax would be brittle and would require teaching the syntax to every connecting tool.

The system instead discovers links automatically via three mechanisms (parent-child threading, content-based UUID scanning, legacy tag-reference convention), all surfaced through a single backlink query.

7.4 Graph is complementary to vector search, not a substitute

We have not implemented vector embedding search. The two retrieval modes serve different purposes: vector finds semantically similar memories ("memories like this one"); graph finds structurally related memories ("memories in the same thread", "memories tagged with the same cluster", "memories that reference this one"). Neither subsumes the other. The decision to defer vector implementation is based on harness-measured failure modes - current bottleneck is navigation, not retrieval recall - rather than a principled claim that vector is unnecessary. We expect to add vector search when usage patterns demonstrate it as a bottleneck.

7.5 No bi-temporal edges (yet)

Graphiti's bi-temporal model [9] represents the highest-ROI deferred feature. When facts change today, the system either updates the existing memory (preserving history through a version table) or writes a new memory referencing the prior one. Both work but neither supports queries of the form "what did the system know about X at time Y" - a query that bi-temporality enables natively. Adding bi-temporality requires a memory_edges table with the four temporal columns Graphiti uses and supersession-cascade handlers analogous to those in §3.5.

8. Limitations

Five limitations are known and worth documenting.

Per-tag clustering, not per-memory. A memory whose tags fall into multiple clusters is implicitly assigned to the cluster of one of its tags; users may notice that a memory tagged for engineering work on a dashboard appears in an engineering cluster but its individual contribution to a "Dashboard" cluster is invisible. Per-memory cluster assignment using soft membership would address this but adds significant complexity.

Cluster-name stochastic variance. The same cluster can receive slightly different names across LLM cache misses (e.g., "Engineering Issues" vs. "Critical Engineering Issues"). The cache pins names for 24 hours so users see consistent names within a session, but cross-day naming drift is possible. Internal references use the stable cluster hash; the human-facing name is for reading only.

Avoid-duplication scenario remains the lowest-passing. Even with explicit "before you write" briefing guidance, the model occasionally writes a fresh memory without first checking. The remaining failure mode is when the prompt presents as a clean factual capture; the model takes it at face value. Stronger nudges in the write_memory tool description are a candidate intervention.

Briefing size scales with cluster count. A user with hundreds of memories and dozens of clusters could push the briefing past 8 KB / 2,000 tokens, at which point context budget becomes a constraint. The system has a soft cap but no aggressive truncation strategy. This will need attention as corpora scale.

Force-directed visualisation is awful on small screens. The force-directed layout used by the dashboard visualiser (react-force-graph-2d) is unusable on mobile. The current version detects small viewports and presents a graceful "view on desktop" message; a mobile-friendly graph view (perhaps cluster-and-list rather than node-link) is future work.

9. Future work

In approximate priority order:

Bi-temporal edges (Graphiti pattern). Add a memory_edges table with valid_at / invalid_at / created_at / expired_at columns. Enables history-aware queries and explicit SUPERSEDES / CONTRADICTS / SAME_AS edge types.
Cross-author tag clustering on the team scope. Currently personal and team clusters are computed independently. A memory shared with the team should participate in a single team-wide cluster vocabulary.
Vector embeddings for content-based similarity edges. Complement to existing tag-based similarity. Enables semantic search and cross-vocabulary clustering.
Backlinks panel in the human-facing UI. The data already exists in the getBacklinks function; only the rendering is missing.
Formalised edge-type abstraction. Convert the existing cascade handlers into registered methods on edge type definitions, enabling new edge types to be added at lower marginal cost.

10. Conclusion

We have presented Reflect Memory, an AI memory system organised around the principle that models need a map of the corpus on first contact - not just a tool for retrieving individual memories. The core architectural commitment is the use of the Model Context Protocol's initialize handshake to deliver a structured briefing containing user identity, named topic clusters, open threads, conventions, and behavioural guidance.

The system uses Louvain community detection over tag co-occurrence to extract topic structure from existing user-curated tags, with LLM-generated cluster names cached for cost efficiency. Graph operations use SQLite recursive CTEs and json_each projections, avoiding the operational complexity of an external graph database. All these choices follow patterns established in prior art [2,5,12,13]; the contribution is the combination, and specifically the briefing surface.

Behavioural validation uses a scenario-based harness combining hard assertions and LLM-as-judge rubric grading. Across seven iterations, hard-assertion correctness improved from 60% to 96.7%, with each iteration's improvements driven by harness-surfaced failure modes rather than ad-hoc redesign.

The work suggests that, for retrieval-augmented systems whose dominant failure modes are orientation rather than recall, investment in session-initialization context delivers larger returns than further refinement of retrieval algorithms.

References

[1] Anthropic. Model Context Protocol Specification. 2024-2026. https://modelcontextprotocol.io

[2] Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008

[3] Wu, D., Wang, H., Yu, W., et al. (2025). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. International Conference on Learning Representations (ICLR).

[4] Edge, D., Trinh, H., Cheng, N., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130. https://microsoft.github.io/graphrag/

[5] Guo, Z., Xia, L., Yu, Y., Ao, T., & Huang, C. (2025). LightRAG: Simple and Fast Retrieval-Augmented Generation. Empirical Methods in Natural Language Processing (EMNLP). https://github.com/HKUDS/LightRAG

[6] Gusye et al. nano-graphrag: A simple, easy-to-hack GraphRAG implementation. Software repository. https://github.com/gusye1234/nano-graphrag

[7] Khanna, R., Mukherjee, A., Singh, A., et al. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413. https://github.com/mem0ai/mem0

[8] Packer, C., Wooders, S., Lin, K., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. https://github.com/letta-ai/letta

[9] Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. https://github.com/getzep/graphiti

[10] Anonymous. (2026). WorldDB: Edge-Programmable Memory with Recursive Worlds and Content-Addressed Immutability. arXiv:2604.18478.

[11] Roam Research, Obsidian, Logseq, Tana. Personal Knowledge Management tools with bidirectional linking. Various dates 2017-2024.

[12] Papathanasiou, D. simple-graph: A simple graph database in SQLite. Software repository. https://github.com/dpapathanasiou/simple-graph

[13] Sharma, R. sqlite-graph: Bi-temporal graph storage on SQLite with FTS5 integration. Software repository. https://github.com/rohansx/sqlite-graph

[14] Wong, H. sqlite-knowledge-graph: Rust library implementing PageRank, Louvain, BFS/DFS, and shortest-path algorithms over SQLite. Software repository. https://github.com/hiyenwong/sqlite-knowledge-graph

Reflect Memory is developed by the Reflect Memory team. For more information, visit https://reflectmemory.com.

A Lightweight Graph Memory System for AI Agents

Abstract

1. Introduction

1.1 The orientation problem

1.2 Contribution

2.1 Graph RAG (document → graph → answer)

2.2 Agent memory layers

2.3 Temporal context graphs

2.4 Personal Knowledge Management (PKM)

2.5 SQLite-native graph implementations

2.6 Cutting-edge research

3. System architecture

3.1 The initialization briefing

3.2 Graph projection over existing data

3.3 Louvain tag clustering

3.4 LLM-named topic clusters

3.5 Typed cascade handlers (edges as programs)

3.6 Graph-aware tools and visualisation

4. The initialization briefing as architectural primitive

4.1 What the briefing carries

4.2 Why each section matters

4.3 Why MCP `initialize` is the right surface

5. Methodology: scenario-based evaluation harness

5.1 Architecture

5.2 Scenario set

5.3 Why hard assertions plus rubric

6. Results

7. Design choices and trade-offs

7.1 SQLite-native, no external graph database

7.2 No LLM extraction at write time

7.3 No `[[wiki-style]]` linking syntax

7.4 Graph is complementary to vector search, not a substitute

7.5 No bi-temporal edges (yet)

8. Limitations

9. Future work

10. Conclusion

References

Try Reflect Memory

A Lightweight Graph Memory System for AI Agents

Abstract

1. Introduction

1.1 The orientation problem

1.2 Contribution

2. Related work

2.1 Graph RAG (document → graph → answer)

2.2 Agent memory layers

2.3 Temporal context graphs

2.4 Personal Knowledge Management (PKM)

2.5 SQLite-native graph implementations

2.6 Cutting-edge research

3. System architecture

3.1 The initialization briefing

3.2 Graph projection over existing data

3.3 Louvain tag clustering

3.4 LLM-named topic clusters

3.5 Typed cascade handlers (edges as programs)

3.6 Graph-aware tools and visualisation

4. The initialization briefing as architectural primitive

4.1 What the briefing carries

4.2 Why each section matters

4.3 Why MCP initialize is the right surface

5. Methodology: scenario-based evaluation harness

5.1 Architecture

5.2 Scenario set

5.3 Why hard assertions plus rubric

6. Results

7. Design choices and trade-offs

7.1 SQLite-native, no external graph database

7.2 No LLM extraction at write time

7.3 No [[wiki-style]] linking syntax

7.4 Graph is complementary to vector search, not a substitute

7.5 No bi-temporal edges (yet)

8. Limitations

9. Future work

10. Conclusion

References

Try Reflect Memory

4.3 Why MCP `initialize` is the right surface

7.3 No `[[wiki-style]]` linking syntax