From Remembering to Understanding: Designing a Personal Memory System for AI
Building a memory architecture that transforms raw browser activity into a structured, evolving model of the user.
AI assistants have memory now. ChatGPT remembers your preferences, Claude recalls past conversations, Gemini knows your Google account. But that memory is shallow -- it's limited to what you explicitly tell them in chat, or what a platform already has on file. They remember what you said, not what you did. They have no idea you spent three hours last Tuesday comparing cloud providers, that you've been quietly learning Rust every evening, or that you switched jobs two weeks ago -- unless you stop and narrate your own life to them. The context they can build is only as rich as the conversations you have with them.
We're building something different. Our AI agent lives in your browser -- the place where your digital life actually happens -- and it watches, remembers, and over time, understands who you are. Not because you filled out a profile or narrated your life in a chat window, but because it was there when you compared laptops on Amazon, debugged a Python script with Stack Overflow open in the next tab, booked a flight to Tokyo, and reviewed a pull request before your morning coffee. It builds a persistent, structured model of the person behind the browser -- your interests, your work, your goals, your relationships -- and it carries that understanding into every conversation.
The goal is to make the AI genuinely personal, not just personable. An assistant that knows you shouldn't ask what programming language you use. It should know you write Python at work, have been picking up Rust on the side, and switched from VS Code to Cursor two months ago -- because it was there for all of it.
But building this is hard. The core insight that shaped our work is that remembering is easy; understanding is hard. Logging what someone does in their browser is a solved problem. Turning that stream of clicks, page visits, and screenshots into a coherent model of a person is not. And keeping that model accurate over time, as the person changes and the evidence accumulates, adds yet another layer of difficulty.
This post walks through the architecture we built to bridge that gap: an observation pipeline that captures and structures browser activity (the "remembering" part), and an egocentric knowledge graph (a graph where every node's meaning is defined relative to the user at its center) we call the Memory Graph that synthesizes those observations into understanding. The rest of this blog is a walkthrough of both halves—and the hard parts that connect them.
The observation pipeline: remembering
The first problem is capture. You need to observe what the user does in their browser without being invasive, without drowning in noise, and without confusing raw DOM events with actual user intent. We approached this as a multi-stage pipeline, each stage refining the signal before passing it downstream.
Observing intent, not events
The foundation is a DOM observer that runs in every opened tab. It hooks into clicks, inputs, scrolls, and mutations -- but raw DOM output is overwhelmingly noisy. Most events are meaningless for understanding a person.
The observer filters aggressively. Clicks only count if they land on actual interactive elements -- buttons, links, form fields -- verified by geometric and DOM containment checks. Interactions are identified through accessibility APIs, giving us semantic labels rather than brittle CSS selectors. For high-traffic sites like YouTube, Twitter, and ChatGPT, dedicated observers extract user intent that the generic observer can't reliably infer.
The key insight: raw DOM events are not user intent. The observer's job is to convert them into semantic interaction records with natural-language descriptions and screenshots. The downstream pipeline never sees "click at coordinates (412, 307)." It sees "clicked the 'Add to Cart' button on the product page."
From events to summaries
Semantic events are better than raw DOM events, but they're still just a list of discrete actions -- "clicked a filter," "typed a search query," "scrolled down." A stream of what happened, with no sense of why. So we compress them. Events are batched per tab and fed -- along with screenshots -- to an LLM that maintains a running natural-language summary of what the user is doing. Each new batch merges into the previous summary, producing a living document per tab: not "the user clicked 7 buttons and scrolled 3 times," but "the user is comparing laptop prices on Amazon, has narrowed their search to the MacBook Air and ThinkPad X1, and added the MacBook Air to their cart."
Browsing intents
Tab summaries are useful, but they're still just descriptions of activity. The next stage organizes them into browsing intents -- high-level goals paired with the specific activities that serve them.
When a new tab summary arrives, an LLM decides how it relates to the existing intents. The decision is one of three actions:
| Action | Meaning |
|---|---|
| Merge | This tab is continuing an activity already tracked -- update its description with new evidence |
| Add activity | This tab is doing something new that serves an existing intent -- add it as a new activity |
| New intent | This represents a wholly new goal -- create a new intent |
This structure captures something that flat logs miss: the relationship between activities and goals. Visiting three different laptop review sites isn't three unrelated events -- it's one intent ("researching a new laptop") with three supporting activities.
The intent classification LLM also serves as a personal information detector. As a side channel, it reports any clues that the page contains identity information about the signed-in user. When such clues are present and a screenshot is available, the system triggers a separate extraction step. This dual-purpose design avoids the cost of running a second model over every page just to check for personal information -- the intent classifier, which already reads the full page context, handles both jobs in one pass.
Extracting personal facts
This is where remembering starts to shade into understanding. When the browsing intents flag a page as likely containing personal information, a vision LLM examines the screenshot and page context to extract structured observations -- factual statements about who the user is, each with a category, confidence score, and provenance.
The extractor is designed to under-extract rather than over-extract. Only facts about you are extracted -- the system carefully distinguishes what belongs to you from what belongs to others on the page. Candidates below a confidence threshold are discarded. And when revisiting a site, the extractor reconciles new facts against existing ones for that host, doing a full replacement rather than accumulating potentially contradictory observations.
At this point in the pipeline, we have a ledger of structured, confidence-scored facts about the user, each traced back to a specific screenshot and page. This is remembering done well -- but it is still just remembering. A list of 200 observations, no matter how well-structured, is not understanding. To get there, we need to connect the dots.
The Memory Graph: understanding
The Memory Graph is our answer to a question that nagged at us throughout the development of the observation pipeline: what do all these facts mean, taken together?
A user who visits Python documentation, takes a Django course on Udemy, browses job listings at startups, and reads Paul Graham essays is telling a coherent story about themselves. But that story only emerges when you connect those facts through relationships. The observations tell you what happened. The graph tells you who someone is.
Why an egocentric graph?
A person is the sum of their relationships with the world -- the tools they use, the topics they care about, the people they know, the goals they pursue. Understanding someone means mapping those relationships, not just cataloging what they touched. Observations tell you what someone did, but not how those facts connect to who they are. We needed a structure where the same entity could mean different things depending on the user's relationship to it. That led us to a 4-layer egocentric graph -- a structure where every node's meaning is defined by its connection to the user at the center:
| Layer | Contents | Role |
|---|---|---|
| 0 | Self | Fixed hub node -- the user at the center |
| 1 | Themes | How the user relates to broad areas |
| 2 | Entities | Factual nodes: sites, topics, people, products, skills, etc. |
| 3 | Intents | Browsing goals, ordered chronologically like a clock |
The critical design decision is in the theme layer. Themes are not just categories -- they are relationships to Self. The difference matters. "Python" is a topic. "The user is interested in Python" is a relationship. "The user uses Python for work" is a different relationship. The same entity can appear under multiple themes with different meanings.
We defined eight core themes, each representing a distinct way the user relates to things:
| Theme | Meaning |
|---|---|
| Frequently visits | Data-driven site frequency |
| Interested in | Topics of curiosity and fascination |
| Personal info | Identity facts about the user |
| Researches | Deep-dive investigation topics |
| Uses | Tools, products, and platforms |
| Works on | Professional activities and projects |
| Communicates on | Communication and social platforms |
| Learns | Educational pursuits |
This is what makes the graph "understanding" rather than "indexing." An index tells you that "Python" appeared in the data. The graph tells you that the user uses Python at work, researches its async capabilities, and is currently learning its type system. Same entity, three different relationships, three different insights for the AI assistant.
The system also supports dynamically proposed themes. When the LLM encounters items that don't fit neatly into any core theme, and at least three such items accumulate, it can propose a new theme that emerges from the data. Core themes are fixed; extended themes are organic.

From memory signals to graph operations
The graph isn't built all at once -- it grows incrementally through operations. Every memory signal -- browsing intents, extracted personal facts, and user instructions -- is transformed into a batch of proposed graph operations: add a theme node, create an entity, draw an edge between them. These operations are the only way anything enters the graph.
This operation-based approach serves three purposes. First, traceability: every node and edge can be traced back to the operation that created it, giving users an intuitive view of how their memory evolved over time. Second, conflict resolution: because everything enters the graph as a proposed operation, the system has a natural interception point to detect and resolve contradictions before they corrupt the graph -- rather than patching inconsistencies after the fact. Third, flexibility: the component that produces operations never touches the graph store, so the storage layer can be swapped without touching classification logic.
The discovery pipeline itself is a single LLM call followed by programmatic post-processing. The LLM receives up to 30 memory signals and, in one shot, classifies each into a theme and extracts entities. Those results are then translated programmatically into graph operations: theme nodes, entity nodes, Self-to-theme edges, theme-to-entity edges, and intent-to-entity edges.
Conflict resolution as a design problem
The discoverer produces proposed operations. A separate merger resolves and executes them.
The merger runs a 5-step pipeline for each batch of operations:
| Step | Operation |
|---|---|
| 1 | Classify each proposed op as accumulative or conflict-capable |
| 2 | Resolve conflict-capable ops through matching and rules |
| 3 | Execute all resolved ops against the graph store |
| 4 | Recompute derived data (behavior patterns, frequency promotions) |
| 5 | Append the op batch to the event-sourced log |
Most operations are accumulative -- adding a new topic entity, linking it to a theme. These are resolved mechanically: if the node already exists, union the provenance and update the description. If the edge already exists, bump the weight. No LLM needed.
The interesting case is conflict resolution. Certain entity types under sensitive themes can contradict each other. If the system previously recorded "works at Company A" and now encounters evidence for "works at Company B," that's a conflict that needs resolution, not just accumulation.
The resolution pipeline has three tiers:
Tier 1: Semantic matching. An LLM examines new entities against existing ones and classifies each pair as one of: same entity (duplicate -- merge provenance), newer version (the fact has been updated, e.g., a job change), contradicts (genuine conflict), or unrelated (both facts coexist). This step identifies what kind of conflict exists without yet deciding who wins.
Tier 2: User authority. If one side of a conflict was explicitly stated by the user, it wins automatically. No LLM needed, no ambiguity. The system never argues with the user about who they are. This is a deterministic rule, not a heuristic.
Tier 3: Evidence trustworthiness. When neither side has user authority, the LLM evaluates the provenance of both sides -- the source types, the evidence quality, the recency -- and decides which is more trustworthy. It can also synthesize a compromise when both sides have partial truth (e.g., combining a job title from one source with an employer name from another).
After conflict resolution, the merger recomputes behavior patterns -- classifying how frequently the user engages with each theme and entity (daily, weekly, sporadic, one-time) -- and promotes domains to "frequently visits" when they cross the programmatic threshold.
Every op batch is appended to an event-sourced log. This enables full graph replay from the most recent checkpoint -- a critical capability for debugging and for the compaction system described next.
Forgetting as a feature
A memory system that never forgets becomes unusable. Over time, entities accumulate, provenance lists grow, and weak edges clutter the graph. The compaction system addresses this through two periodic operations:
Daily graph compaction. Archive stale, low-frequency entities (older than 90 days with fewer than 2 appearances). Trim provenance histories to prevent unbounded growth. Prune old, weak edges. These thresholds are configurable but deliberately conservative -- the system errs toward retaining too much rather than losing something important.
Weekly checkpoint. Snapshot the full graph state. This allows the event-sourced log to be truncated, preventing unbounded growth while preserving the ability to replay from the latest checkpoint.
Like the discoverer, the compactor is a pure operation producer. It reads the current graph state, decides what should be archived or pruned, and emits proposed operations that flow through the merger. This means compaction goes through the same resolution and logging pipeline as everything else -- no special cases, no back doors.
The compaction thresholds encode a philosophical position: relevance decays. A website you visited once three months ago is less important than one you visit every week. A fact that appeared in a single screenshot is less durable than one confirmed across multiple sources over time. The graph should reflect what matters now, not what ever happened.
From graph to profile: the synthesis
The Memory Graph is a structured representation of understanding, but it's not a form that users or AI assistants can easily consume. The final stage synthesizes the graph into a user profile -- narrative prose with citations, like a Wikipedia article about the user.
The profile builder implements a 2-stage pipeline. First, it gathers facts from four pillars: observations (first-person facts from the ledger), themes (behavioral aggregations from the graph), entities (top-scored graph entities with resolved provenance), and recent activities (intent nodes for temporal context). Second, it sends those gathered facts to an LLM that synthesizes cited prose.
The output is a profile with sections of flowing prose and inline citation markers, each mapped back to its provenance source: a specific screenshot and host for observed facts, a browsing intent entry for behavioral memories, or a direct user statement for self-reported information.
The citation system is essential. A profile that says "you work as a software engineer at Acme Corp" is useful. A profile that says "you work as a software engineer at Acme Corp [1]" -- where [1] links back to a screenshot of your company's internal dashboard -- is trustworthy. Understanding without evidence is hallucination.
Closing the loop: user corrections
The profile is not write-once. Users can correct or add information through natural language. A user can say "I actually work at Beta Corp now, not Acme" and the system will:
- Normalize the instruction into a structured observation with a user-stated source type
- Append it to the observation store
- Patch the graph through the merger (where the user-authority rule ensures the new fact wins)
- Trigger a full profile rebuild
This closes the loop between the user and the memory system. The system observes, extracts, structures, and synthesizes -- but the user always has the last word.
Retrieval: finding what matters
The pipeline described so far builds a rich model of the user -- but you can't stuff the entire memory model into an LLM context window. So the system needs a way to surface the right memories for the right moment, from a store that keeps growing.
A single retrieval strategy isn't enough. Vector/BM25 hybrid search understands what the query means -- it can match "that laptop comparison I did last week" to memories about comparing MacBooks and ThinkPads, even if those exact words never appeared. But it has no sense of the user as a person. It doesn't know that "my project" refers to a specific Rust side project, or that "work stuff" means activities linked to a particular employer. It searches by semantic similarity, blind to personal context.
Graph entity matching understands the user -- it walks the egocentric knowledge graph, finds entities whose labels overlap with the query, and follows provenance links back to the memories that produced those entities. When you search for "Rust," it knows you use Rust at work, learn its async model in the evenings, and research its type system -- because those relationships are already in the graph. But graph matching is brittle with open-ended queries. "What was I doing last Tuesday evening?" contains no entity label to match against.
Semantic search captures meaning. Graph matching captures context. Neither is complete without the other.
Two branches, one fusion
Every query runs both strategies in parallel. A vector/BM25 hybrid index produces one ranked list based on semantic meaning. The Memory Graph produces another by matching query terms against entity labels and following provenance links back to memories. The graph branch also expands outward from the top semantic hits, giving a small bonus to memories that share graph nodes with them -- surfacing contextually related results that neither branch would find alone.
The two lists are merged using Reciprocal Rank Fusion (RRF). Memories that rank well in both signals rise to the top; memories from only one list still surface but with lower priority. The result is retrieval that understands both what the query means and what it means to the user.
What comes next
There's a deeper question about what "understanding" means at the limit. The current graph captures what you do and how you relate to things. It doesn't capture why. The user who researches laptops might be buying a gift, upgrading for a new job, or window-shopping for fun. Those represent very different aspects of who they are, and the system doesn't yet distinguish between them. Bridging that gap -- from behavioral understanding to motivational understanding -- is the frontier we're working toward.
The Memory Graph is not a finished product. It's a design hypothesis: that egocentric graphs with evidence-backed, conflict-resolved, temporally-decaying nodes are a good representation for personal understanding. The architecture is built to be iterated on -- pure operation producers, event-sourced logs, modular pipelines -- precisely because we expect the hypothesis to evolve as we learn more about what it means for an AI to truly know its user.