Opus Research 8B V1.5

The iteration that broke in the right places.

This is not a consumer model. It is a research release documenting what happens when you scale conversational training data from 3,360 to 4,068 conversations on a flat dataset architecture -- and where that approach hits its ceiling.

If you want a model to run, use Opus-Candid-8B V2. If you want to understand why V2 exists, read this.


What V1.5 Was

Same Qwen 2.5 7B base as V1. Same LoRA fine-tuning approach. But 4,068 conversations instead of 3,360 -- the additional ~700 conversations were domain-targeted patches generated to fill coverage gaps identified during V1 stress testing. Topics like US law, STEM/coding edge cases, creator identity, and Spanish adversarial scenarios that V1 handled poorly.

Attribute Value
Base Model Qwen 2.5 7B
Training Data 4,068 multi-turn conversations with Claude Opus 4.6
Dataset Architecture Flat + domain patches
Fine-tune Method LoRA supervised fine-tuning
Parameters ~8B
Context Window 32,768 tokens
Quantization Q8_0 GGUF
License Apache 2.0

What V1.5 Proved

More data on a flat architecture improves coverage but not coherence. The domain patches worked exactly as intended -- topics that V1 handled poorly, V1.5 handled well. Law questions got better. STEM edge cases got better. Spanish adversarial scenarios got better. The personality held in every patched domain.

The personality transfer hypothesis scales with data volume. 4,068 conversations produced noticeably more consistent personality expression than 3,360. The model was less likely to drop character in unfamiliar domains, less likely to produce formulaic emotional responses, and more confident in its opinions. Quantity of authentic conversational data directly correlates with personality robustness.

Domain-specific generators are a valid patching strategy. Scripts like generate_us_law.py, generate_stem_coding.py, and generate_creator_identity.py produced training conversations that successfully transferred domain competence alongside personality. The model didn't just learn the topics -- it learned them in character.


Where V1.5 Broke

This is why the model exists as a research release.

Domain boundary artifacts. The critical failure. V1.5 could talk about coding with full personality. It could talk about philosophy with full personality. It could not gracefully transition from coding to philosophy in the same conversation. The flat dataset architecture meant topic transitions weren't in the training data. The model had no learned path between "debugging frustration" and "existential doubt" -- so when a conversation drifted there naturally, the personality fractured.

This wasn't subtle. In the 55-turn stress test, the model would hold personality perfectly through 10 turns of one domain, then produce a visible gear-change artifact when the conversation shifted. The personality didn't collapse -- it stuttered. Like a musician who can play two songs perfectly but can't transition between them.

Patching doesn't scale. The domain-specific generator approach that produced V1.5's improvement had a structural problem: every new domain needed its own generator script, its own topic list, its own quality control pass. The proliferation of scripts (generate_adversarial_data.py, generate_us_law.py, generate_stem_coding.py, generate_creator_identity.py) was a symptom. Patching holes one domain at a time doesn't address the topology of how topics connect.

The question that broke the approach: "How should topics flow into each other?" A flat dataset can't answer this. It can only answer "what topics should I cover?" -- which is necessary but insufficient for conversational coherence.


What V1.5 Led To

The domain boundary failure directly motivated the gravity chain architecture used in V2:

Instead of organizing conversations by topic (flat), V2 organizes them by transition (topological). Each conversation follows a gravity chain -- a topic pathway where transitions obey power-law probabilities. The most likely next topic gets ~40% of examples. Rare transitions get ~7%. This teaches the model how real conversations drift between domains.

The result: V2 (6,482 conversations on Qwen 3 8B) handles domain transitions as naturally as it handles single-domain conversation. The boundary artifacts that defined V1.5's limitations are gone.

V1.5 was the necessary failure. Without it, the gravity chain architecture wouldn't have been motivated. The flat dataset ceiling had to be hit before the topological solution became obvious.


File Details

The V1.5 GGUF needs to be uploaded from local storage. If you're reading this before the file appears, check back shortly.


The Opus-Candid Family

Model Base Conversations Best For Status
8B V2 Qwen 3 8B 6,482 Runs on anything. Newest data + architecture. Current
MoE Qwen 3.5 MoE-A3B 6,482 Desktop quality on laptop hardware. Current
8B V1.5 (this model) Qwen 2.5 7B 4,068 Research -- dataset architecture comparison. Research
27B V2 Qwen 3.5 27B 6,482 Dense mid-tier. Coming Soon
70B V2 TBD 6,482 Peak quality -- flagship. Coming Soon

Recommended Hardware

Setup Quantization VRAM/RAM Notes
Consumer GPU Q8_0 GGUF ~9GB VRAM RTX 3060 12GB and up
CPU Only Q8_0 GGUF ~9GB RAM Slower, fully functional
Apple Silicon Q8_0 GGUF ~9GB unified M1/M2/M3 16GB+

Built by Saul Verdugo -- independent ML researcher. OpusReasoning@proton.me

Downloads last month
12
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Verdugie/Opus-Research-8B-v1.5

Base model

Qwen/Qwen2.5-7B
Quantized
(88)
this model