---
title: Emma Assistant
emoji: 🤖
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.9.1
app_file: Emma/app.py 
pinned: false
---

# EMMA — Empathetic Memory-Augmented Multi-layer Assistant  
*(Research Prototype)*

**Empathetic, privacy-aware memory for psychologically informed conversational agents.**

This repository provides a reference implementation of a mobile-friendly, memory-augmented artificial intelligence assistant inspired by the **EMMA architecture**. The system integrates **session**, **episodic**, and **semantic** memory layers, dynamic query classification, **LlamaIndex-based retrieval**, and a **Gradio** demonstration interface. The codebase includes implementation logic, data processing scripts, memory indexing and retrieval components, query classification, and evaluation tooling used in the research prototype.

> ⚠️ **Research prototype — not a clinical tool.**  
> This system is intended solely for research, experimentation, and controlled simulations. It must not be used for clinical diagnosis or treatment. See **Limitations & Safety** below.

---

##  Table of Contents
- [EMMA — Empathetic Memory-Augmented Multi-layer Assistant](#emma--empathetic-memory-augmented-multi-layer-assistant)
  - [Table of Contents](#table-of-contents)
  - [Key Features](#key-features)
  - [Architecture (High Level)](#architecture-high-level)
  - [Personalized Response Generation Workflow](#personalized-response-generation-workflow)
    - [Step 1 – User Query](#step-1--user-query)
    - [Step 2 – Query Classification](#step-2--query-classification)
    - [Step 3 – Memory Routing](#step-3--memory-routing)
    - [Step 4 – Memory Retrieval](#step-4--memory-retrieval)
    - [Step 5 – Prompt Composition](#step-5--prompt-composition)
    - [Step 6 – Response Generation](#step-6--response-generation)
  - [Evaluation \& Metrics](#evaluation--metrics)
    - [Reproducibility](#reproducibility)
  - [Limitations \& Safety](#limitations--safety)


---

## Key Features

- **Three-tier memory architecture**:  
  Session memory (raw conversational transcripts), episodic memory (session summaries), and semantic memory (long-term traits and values).

- **Dynamic query classifier**:  
  Routes user queries to *Episodic*, *Semantic*, *Hybrid*, or *Unrelated* processing pipelines.

- **Privacy-aware retrieval**:  
  Uses **LlamaIndex** and vector-based indexing for efficient local semantic search.

- **Therapy-aligned prompt templates**:  
  Combines retrieved memory with therapist-inspired prompt scaffolding to ensure emotional alignment.

- **Gradio-based demo interface**:  
  Lightweight chat UI with access to memory summaries and session history.

- **Evaluation tooling**:  
  Scripts supporting quantitative memory retrieval accuracy and qualitative Likert-scale evaluation pipelines.


---

## Architecture (High Level)

- **Indexing**: Episodic and semantic memory items are embedded and stored in vector indexes.
- **Routing**: A classifier determines which memory layer(s) should be queried. Hybrid queries may combine episodic and semantic retrieval.
- **Prompting**: Retrieved memory is merged into therapy-aware prompt templates prior to LLM invocation.

---

## Personalized Response Generation Workflow

The figure illustrates the end-to-end workflow used by **EMMA** for generating personalized and psychologically informed responses. The pipeline consists of six sequential stages:

![Personalized Response Generation Workflow](assets/emma_workflow.jpg)

### Step 1 – User Query
The interaction begins when the user submits a query, which may express a psychological concern, emotional state, or a general question.

### Step 2 – Query Classification
The user query, combined with a task-specific prompt, is passed to a language model (e.g., GPT-3.5) acting as a query recognition mechanism. The query is classified into one of the following categories:

- **Episodic**: Past experiences or events  
- **Semantic**: Stable traits, preferences, or beliefs  
- **Hybrid**: Requires both episodic and semantic context  
- **Unrelated**: No memory retrieval required

### Step 3 – Memory Routing
Based on the predicted memory type, the system determines which memory layer(s) should be accessed and forwards the query together with the memory label to the retrieval module.

### Step 4 – Memory Retrieval
EMMA leverages **LlamaIndex** to retrieve relevant memory chunks from its structured memory store, which includes:

- **Session memory** (short-term conversational context)
- **Episodic memory** (summarized past interactions)
- **Semantic memory** (long-term psychological attributes and behavioral patterns)

### Step 5 – Prompt Composition
Retrieved memory content is merged with the user query using task-specific prompt templates designed to preserve emotional tone, maintain psychological coherence, and align responses with empathic counseling principles.

### Step 6 – Response Generation
The composed prompt is forwarded to the language model (e.g., GPT-3.5), which generates a personalized, memory-informed, and emotionally aligned response.  
Optionally, a post-processing module may refine tone and safety to ensure therapeutic appropriateness.

**Privacy Note:**  
Since psychologically relevant information is abstracted into episodic and semantic memory, raw session transcripts can be periodically discarded. This reduces storage overhead while enhancing user privacy and data security.

---

## Evaluation & Metrics

The original prototype evaluation included:

- **Qualitative evaluation**:  
  90 prompts rated on a 5-point Likert scale across *Personalization*, *Continuity*, and *Empathy* dimensions.

- **Quantitative evaluation**:  
  Memory retrieval accuracy computed as the normalized mean of 5-point Likert scores.

- **Automatic metrics**:  
  Automated rubric-based assessment using a stronger LLM evaluator.

### Reproducibility
Evaluation can be reproduced by preparing:
- A test set of prompts linked to memory entries.
- Scripts comparing generated responses with ground-truth memory and computing Likert-aligned scores.

---

## Limitations & Safety

- This system is **not a clinical or diagnostic tool** and must not replace licensed mental health professionals.
- Automated evaluators and LLM judgments may be noisy or biased; safety-critical use cases require clinician oversight and human-in-the-loop validation.
- The system may occasionally hallucinate memory-grounded facts; retrieval traces should always be logged for auditing and debugging.
- See the associated paper for a detailed discussion of limitations and evaluation methodology.