Teaching AI to Remember: Inside a Java-Based Semantic Memory System
How Quarkus, LangChain4j, and pgvector power long-term memory for intelligent, context-aware conversations
Large Language Models may be brilliant at answering questions, but they’re still notoriously forgetful. This architectural deep dive explores a Java-based system that fixes that. Built on Quarkus and LangChain4j, this AI memory system transforms raw conversations into structured, searchable, and self-organizing memory—paving the way for more intelligent and personalized assistants.
At its core, the project implements a hierarchical, semantic memory system designed to integrate seamlessly with AI chat applications. Instead of simply storing conversation logs, it captures meaning, clusters ideas, abstracts themes, and continuously evolves its memory hierarchy over time.
This tutorial walks you through building a self-organizing AI memory: a pipeline that digests incoming conversation turns, embeds and quantizes them, clusters them into themes, abstracts higher-level summaries, and stores everything in a searchable vector store. It uses Quarkus, LangChain4j, and PostgreSQL with pgvector, and it's built with production-readiness in mind.
Note: While there is a complete Github repository to play around with that contains a working example, I want to clearly say that this is more a proof of concept implementation than a real world scenario. I basically wanted to explore a little bit of the art of the possible. I am very convinced that there a loopholes and edge-cases that I have not thought about so far, so use it for educational purposes and feel free to borrow ideas, but do not run this in production without knowing what you want to achieve.
Prerequisites
Java 21 LTS —
sdk install java 21.0.7-tem
(Optional) Quarkus CLI —
brew install quarkusio/tap/quarkus
Maven 3.8+ - brew install maven
PostgreSQL 21 optional; Dev Services can start a container automatically
PG_Vector library (documentation) We’ll use it as dependency later on.
Local Ollama install (we are going to use a larger model for this)
Project Bootstrap
Let's create the skeleton of our Quarkus application. Open your terminal and run the following command. This command creates a new project directory named ai-memory
and adds the necessary extensions.
mvn io.quarkus.platform:quarkus-maven-plugin:create \
-DprojectGroupId=org.acme \
-DprojectArtifactId=ai-memory \
-Dextensions="rest-jackson,hibernate-orm-panache,jdbc-postgresql,quarkus-langchain4j-ollama,langchain4j-pgvector, quarkus-messaging, quarkus-scheduler"
cd ai-memory
The two Langchain4j extensions wire Quarkus to the Ollama LLM. We also need Hibernate ORM with Panache and a little REST with Jackson. Let’s not forget the Quarkus Messaging component and some scheduling for cleanup later.
We will need one additional dependency to be able to work with vectors in Hibernate:
<dependency>
<groupId>org.hibernate.orm</groupId>
<artifactId>hibernate-vector</artifactId>
<version>6.6.17.Final</version>
</dependency>
Database, Messaging & Vector Config
Now we'll configure the database connection and define the structure for storing our memory data.
Quarkus Configuration
Quarkus can automatically manage services like databases during development. We just need to tell it which database to use and enable this feature. While we are here, we are also going to add some configuration for our Embedding Model and LangChain4j.
Add the following lines to the src/main/resources/application.properties
file:
quarkus.datasource.db-kind=postgresql
quarkus.hibernate-orm.database.generation=drop-and-create
quarkus.langchain4j.pgvector.dimension=384
quarkus.langchain4j.log-requests=false
quarkus.langchain4j.log-responses=false
quarkus.hibernate-orm.log.sql=false
quarkus.hibernate-orm.log.bind-parameters=false
quarkus.langchain4j.ollama.chat-model.model-id=mistral
quarkus.langchain4j.ollama.embedding-model.model-id=all-minilm:l6-v2
quarkus.datasource.jdbc.initial-size=20
quarkus.log.level=INFO
With this configuration, you don't need to install or run Postgres yourself. When you start the application in development mode (quarkus dev
), Quarkus will check if a compatible Postgres container is running. If not, it will start one for you.
API Layer: The Interface to Memory
Everything starts at the edge. The system exposes two REST endpoints: ChatResource
and MemoryResource
.
The ChatResource
handles live conversations. Each session is tied to a unique conversation ID, allowing the system to track and store memory fragments per user or thread. When a user sends a message, this resource kicks off memory retrieval and routes the input to an AI service, returning a context-enriched response.
The MemoryResource
, meanwhile, offers administrative tools for inspecting the memory system. You can view cluster states, initiate memory cleanup, and manually trigger processing workflows.
AI Services Layer: Thinking with Memory
At the heart of the system are LangChain4j-powered services that both consume and enhance memory.
The ConversationalAiService
injects retrieved memories into each AI response, grounding conversations in past context. The MemoryRetrievalAugmentorSupplier
acts as the bridge, providing this memory-to-AI augmentation flow.
To fight the growing noise in long-term memory, the AbstractionAiService
steps in. It generates high-level summaries that distill related memory fragments into coherent abstractions. This is what allows the system to “think” hierarchically and not just recall past events but also recognize themes, trends, and intentions.
And whenever the AI needs relevant past information, the MemoryContentRetriever
fetches it using semantic search backed by vector similarity.
Data Modeling with Hibernate and Vector Types
A MemoryFragment
represents a single unit of memory—a message, a summary, a thought. It stores the original text, a high-dimensional embedding, and a quantized version for efficient storage. It also tracks access frequency and abstraction level.
The MemoryFragment
entity is the atomic unit of memory in the system.
package org.acme;
@Entity
@Table(name = "memory_fragments")
public class MemoryFragment extends PanacheEntity {
@Column(name = "original_text", columnDefinition = "TEXT")
private String originalText;
// HIGH-PRECISION VECTOR FOR SIMILARITY SEARCH
@JdbcTypeCode(SqlTypes.VECTOR)
@Array(length = 384) // Must match the embedding model's dimension
@Column(name = "embedding")
private float[] embedding;
// COMPRESSED VECTOR FOR EFFICIENT STORAGE/ARCHIVAL
@Column(name = "quantized_embedding")
private byte quantizedEmbedding;
@Column(name = "abstraction_level")
private Integer abstractionLevel = 1;
@Column(name = "importance_score")
private Double importanceScore;
@Column(name = "cluster_id")
private String clusterId;
@Column(name = "created_at")
private LocalDateTime createdAt;
@Column(name = "last_accessed")
private LocalDateTime lastAccessed;
@Column(name = "access_count")
private Integer accessCount = 0;
@ManyToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "parent_memory_id")
private MemoryFragment parentMemory;
@OneToMany(mappedBy = "parentMemory", cascade = CascadeType.ALL, fetch = FetchType.LAZY)
private List<MemoryFragment> childMemories = new ArrayList<>();
}
A MemoryCluster
is a group of fragments that form a coherent theme. It contains centroid vectors (for fast cluster-level search), member counts, timestamps, and even human-readable topic summaries.
package org.acme;
@Entity
@Table(name = "memory_clusters")
public class MemoryCluster extends PanacheEntity {
@Column(name = "cluster_id", unique = true)
private String clusterId;
@JdbcTypeCode(SqlTypes.VECTOR)
@Array(length = 384)
@Column(name = "prototype_vector")
private float[] prototypeVector;
@Column(name = "cluster_theme", columnDefinition = "TEXT")
private String theme;
@Column(name = "member_count")
private Integer memberCount;
@Column(name = "last_updated")
private LocalDateTime lastUpdated;
}
To support flexible clustering algorithms, a generic Cluster<T>
class provides noise detection, re-clustering, and dynamic membership which is used heavily by the DBSCAN implementation.
Repositories abstract away database access, supporting PostgreSQL with pgvector for embedding similarity queries and Hibernate Panache for convenient data access patterns.
Reactive Memory Processing: Asynchronous by Design
All memory ingestion flows through a MemoryProcessingPipeline
, implemented as a reactive stream. New conversation content is embedded, quantized, and persisted asynchronously, ensuring the system remains responsive even under heavy chat loads.
The MemoryProcessingPipeline class implements a three-stage reactive messaging pipeline:
Raw Text Input → embedText() → Text + Embedding
Text + Embedding → quantizeEmbedding() → Text + Embedding + Quantized
Complete Payload → persistToStore() → Dual Storage Persistence
Pipeline Stage Methods
Stage 1 - embedText(String text)
Converts raw conversational text into high-dimensional embedding vectors using an AI embedding model.
@Incoming("raw-conversation-in")
@Outgoing("embedding-out")
@Blocking
public IngestionPayload embedText(String text)
Stage 2 - quantizeEmbedding(IngestionPayload payload)
Compresses high-precision embeddings into compact byte representations for efficient storage.
@Incoming("embedding-out")
@Outgoing("quantized-out")
public IngestionPayload quantizeEmbedding(IngestionPayload payload)
Stage 3 - persistToStore(IngestionPayload payload)
Persists processed memories to both database and embedding store for dual storage architecture.
@Incoming("quantized-out")
@Blocking
@Transactional
public void persistToStore(IngestionPayload payload)
Data Transfer Object - IngestionPayload Record
Carries data through the pipeline stages, accumulating processed information at each step.
@Transactional
public void persistToStore(IngestionPayload payload)
Core Services: Intelligence Beneath the Surface
This is where the magic happens.
The MemoryRetrievalService
is the intelligence layer that transforms vague user input into highly relevant memory retrievals. Instead of relying on keyword matches, it uses semantic embeddings to understand the meaning of a query, and then ranks results based on multiple signals.
It starts with vector-based similarity search using LangChain4j’s embedding store, filtering out low-relevance memories with a configurable similarity threshold. But retrieval doesn’t stop there.
The service is cluster-aware. It first identifies the most relevant memory clusters using centroid comparisons, then dives into those clusters to extract the most meaningful fragments. This allows it to surface contextually grouped results rather than isolated quotes.
Ranking is where things get interesting. The system scores each memory not just on similarity, but also on recency, access frequency, and an importance metric. That means it prioritizes memories that were recently used, frequently accessed, or marked as especially valuable.
Finally, the service assembles the retrieved fragments into a coherent context string—ordered by relevance and ready for AI consumption. It also tracks memory usage to improve future results, forming a feedback loop that helps the system get smarter over time.
In short, MemoryRetrievalService
gives your AI more than memory. It gives it judgment.
The ClusteringService
is the system’s built-in librarian. Every hour, it scans unclustered memory fragments and organizes them into semantically coherent groups using the DBSCAN algorithm, a density-based method that doesn’t assume fixed categories and naturally detects outliers.
The service operates incrementally, processing only new data while preserving existing clusters. It calculates cosine similarity between memory embeddings and groups similar fragments into clusters. Each cluster is summarized with a centroid vector (for fast lookup) and a human-readable theme like “travel, vacation (Rome, Paris)” or “project, deadline, invoice.”
Outliers, memories that don’t fit into any group, are retained for future clustering, ensuring the system stays both flexible and accurate over time.
This structure isn’t just tidy. It powers the retrieval engine, abstraction generation, and cleanup logic. With clusters in place, the system can search faster, summarize better, and manage memory more intelligently.
ClusteringService
turns flat memory into a living hierarchy: One that grows, evolves, and adapts to the patterns of conversation.
The HierarchicalAbstractionService
then summarizes these clusters into multi-layered abstractions. Each level compresses knowledge further, letting the AI think not only about what was said, but about what it means.
Finally, the MemoryCleanupService
manages lifecycle events. It removes low-value, redundant, or stale memory fragments based on configurable policies—while preserving what’s important for future reasoning.
Utility Layer: Compression and Clustering
Two tools make the memory system scalable.
ScalarQuantizer
reduces embedding size using vector quantization, converting high-dimensional float vectors into compact byte arrays. This enables affordable storage of long-term memory at scale.
DBSCANClusterer
implements the density-based clustering algorithm. Unlike K-Means, DBSCAN doesn’t require knowing how many clusters you want in advance. It adapts dynamically, detects noise, and works well for natural conversation data, which tends to form irregular and overlapping themes.
End-to-End Usage Example
With all the components in place, we can now demonstrate the full end-to-end flow.
Populate the AI's Memory
First, we use the /memory/store
endpoint to add some facts about a user named Alex. Feel free to add way more when you play around with it. Anything below 100 facts isn’t really showing what the system can do.
# Store the first memory fragment
curl -X POST -H "Content-Type: text/plain" \
-d "My name is Alex and I live in Berlin." \
http://localhost:8080/memory/store
# Store a second, related memory fragment
curl -X POST -H "Content-Type: text/plain" \
-d "My favorite hobby is hiking in the Alps, which I find very relaxing." \
http://localhost:8080/memory/store
# Store a third, unrelated memory fragment
curl -X POST -H "Content-Type: text/plain" \
-d "I need to remember to buy milk tomorrow." \
http://localhost:8080/memory/store
Chat with the AI
Now, we can start a conversation. We'll generate a unique ID for our conversation and use it in the chat endpoint.
# Generate a unique ID for the conversation
CONVERSATION_ID=$(uuidgen)
# Ask a question that requires memory retrieval
curl -X POST -H "Content-Type: text/plain" \
-d "What is my name and where do I live?" \
http://localhost:8080/chat/$CONVERSATION_ID
Expected Response:
Your name is Alex and you live in Berlin.
The system correctly retrieves the relevant memory fragment and uses it to answer the question. Let's try another one.
# Ask about the user's hobby
curl -X POST -H "Content-Type: text/plain" \
-d "What do I enjoy doing for fun?" \
http://localhost:8080/chat/$CONVERSATION_ID
REST Endpoints
We expose a set of REST endpoints for interaction, diagnostics, and maintenance. These endpoints make it easy to integrate the system into larger applications or manage it during development and testing.
POST /memory/cleanup/manual
Triggers a manual memory cleanup. This is helpful during development or testing when you want to verify that the cleanup logic is working correctly without waiting for the scheduled job. The cleanup process respects the system’s importance scoring, access tracking, and abstraction hierarchy, ensuring that only low-value or redundant memories are purged.
GET /memory/cleanup/stats
Returns detailed statistics about the memory cleanup process. This includes metrics such as the number of deleted fragments, skipped high-value memories, and total memory usage. It's essential for monitoring how the system is managing long-term memory growth and ensuring cleanup strategies are effective.
GET /memory/clusters/status
Provides a snapshot of the current clustering state. You can see how many memory clusters exist, how many fragments each contains, and how much of the memory store is currently unclustered. This is useful for assessing the quality of memory organization and determining whether clustering parameters need tuning.
Also valuable for debugging theme generation or assessing how well semantic grouping is working.
GET /memory/embeddings
A debugging endpoint that lists raw memory embeddings and metadata. This is primarily useful during development or fine-tuning the embedding model and similarity thresholds. It gives you visibility into the vector representations behind semantic search and clustering operations.
Some Final Thoughts
This example delivers a self-organizing knowledge consolidation engine. It continuously processes raw conversational data, discovers thematic patterns through advanced clustering, and synthesizes higher-level abstractions using LLMs. The retrieval mechanism leverages this organized structure to provide multi-layered, context-aware information that moves beyond simple fact lookup to genuine contextual understanding. It should be a good starting point for building sophisticated, next-generation AI memory systems on the powerful and productive combination of Quarkus and LangChain4j.