WHEN DOES GRAPH-STRUCTURED MEMORY HELP MULTI-SESSION LLM AGENTS? AN EMPIRICAL STUDY OF HYGRAM, A HYBRID GRAPH–VECTOR MEMORY ARCHITECTURE
Keywords:
LLM agents; long-term memory; knowledge graphs; retrieval-augmented generation; hybrid re-trieval; temporal knowledge graphs; multi-session dialogue; hal-lucination mitigation; persistent contextual reasoning.Abstract
Large language model (LLM) agents are fundamentally stateless: when a session terminates, the agent loses all episodic context, and subsequent interactions begin from a blank slate. The prevailing remedy stores prior dialogue in a vector database and retrieves semantically similar text chunks at query time. This flat retrieval paradigm discards the relational structure that binds facts together, provides no principled mechanism for distinguishing stale from currently-valid information, and treats memory as a passive evidence store rather than an active, structured representation. A natural hypothesis is that representing memory as a knowledge graph and retrieving it through traversal will improve relationship- and time-dependent reasoning. This paper presents HyGRAM (Hybrid Graph Retrieval-Augmented Memory)—which extracts timestamped subject–relation–object triples from each session into a temporally-aware graph, retrieves by seeding with dense vector similarity and expanding through bounded multi-hop traversal, and consolidates the graph so new evidence can invalidate prior beliefs—and tests that hypothesis empirically against no-memory, flat vector, and graph-only baselines on the LoCoMo benchmark, using entirely free and open tooling and a commodity open model. The principal result is negative and instructive: in this regime the hybrid graph memory did not outperform flat vector retrieval (vector-only achieved the highest accuracy, and the gap persisted on the multi-hop questions graph traversal was expected to favour), and explicit temporal consolidation produced no measurable change in accuracy or contradiction rate. The finding is robust across two extractor scales: replicating with a 7B model raised every condition's accuracy but widened the vector baseline's lead (26.0% versus HyGRAM's 8.0%) rather than closing it, and on the two adequately-sampled question categories—multi-hop and temporal—vector retrieval led decisively. We trace the outcome to extraction: converting dialogue into triples is lossy, and a more capable model exploits the verbatim text retained by a flat store more effectively than the graph; extraction quality appears necessary but not sufficient for graph memory to pay off at this scale. The contributions are therefore a reproducible architecture and pipeline, and a controlled, honestly-reported study that delimits when graph-structured agent memory is likely to help and identifies extraction as the first-order lever.












