You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

BrighTO-Voice: AI-Powered Voice Analytics Platform

BrighTO-Voice: Nền tảng Phân tích Giọng nói bằng AI

Overview | Tổng quan

BrighTO-Voice is a state-of-the-art voice analytics system developed by BrighTO Technology that transforms voice interactions into structured, actionable intelligence. Built on a hybrid architecture combining WavLM audio encoder and Qwen2.5 language model, it analyzes voice recordings to extract comprehensive speaker profiles in real-time with structured JSON output.

BrighTO-Voice là hệ thống phân tích giọng nói tiên tiến do BrighTO Technology phát triển, chuyển đổi tương tác giọng nói thành thông tin có cấu trúc và có thể hành động. Được xây dựng trên kiến trúc kết hợp bộ mã hóa âm thanh WavLM và mô hình ngôn ngữ Qwen2.5, hệ thống phân tích bản ghi giọng nói để trích xuất hồ sơ người nói toàn diện theo thời gian thực với đầu ra JSON có cấu trúc.

Key Features | Tính năng chính

BrighTO-Voice goes beyond simple transcription to understand the speaker and the context.

BrighTO-Voice vượt xa việc chép lời đơn thuần để hiểu người nói và ngữ cảnh.

Category	Extracted Attributes
Speaker Profile	Gender, age group (child, teen, young, adult, middle, senior)
Emotional State	Valence, arousal, and top-3 emotions with confidence scores
Attitude Analysis	Cooperative, defensive, frustrated, neutral, and 12+ attitudes
Voice Characteristics	Pitch, energy, speaking speed, tension, breathiness
QC & Risk Assessment	Teller/agent score and customer risk evaluation
Audio Quality	Background noise, distortion, speech overlap, stuttering

Danh mục	Thuộc tính trích xuất
Hồ sơ người nói	Giới tính, nhóm tuổi (trẻ em, thiếu niên, thanh niên, trưởng thành, trung niên, cao tuổi)
Trạng thái cảm xúc	Valence, arousal, và top-3 cảm xúc với điểm tin cậy
Phân tích thái độ	Hợp tác, phòng thủ, thất vọng, trung lập, và 12+ thái độ khác
Đặc điểm giọng nói	Cao độ, năng lượng, tốc độ nói, căng thẳng, hơi thở
Đánh giá QC & Rủi ro	Điểm giao dịch viên và đánh giá rủi ro khách hàng
Chất lượng âm thanh	Nhiễu nền, méo tiếng, chồng lấp giọng nói, nói lắp

Model Architecture | Kiến trúc mô hình

BrighTO-Voice utilizes a cutting-edge multimodal architecture designed for production deployment.

BrighTO-Voice sử dụng kiến trúc đa phương thức tiên tiến được thiết kế để triển khai sản xuất.

Components | Thành phần

Component	Description
Audio Encoder	WavLM-Large (315M params) - Captures rich acoustic and paralinguistic features
Audio Projector	Custom SOTA projector with sinusoidal positional encoding (64 audio tokens)
Language Model	Qwen2.5-1.5B-Instruct - Logical reasoning and structured JSON generation
Adapter	LoRA fine-tuning for efficient domain adaptation

Thành phần	Mô tả
Bộ mã hóa âm thanh	WavLM-Large (315M params) - Thu thập đặc trưng âm thanh và paralinguistic phong phú
Bộ chiếu âm thanh	Projector SOTA tùy chỉnh với positional encoding dạng sin (64 audio tokens)
Mô hình ngôn ngữ	Qwen2.5-1.5B-Instruct - Suy luận logic và tạo JSON có cấu trúc
Adapter	Fine-tuning LoRA để thích ứng miền hiệu quả

Architecture Highlights | Điểm nổi bật kiến trúc

Full WavLM Encoder Training: Unlike frozen approaches, we train the entire WavLM encoder for task-specific audio understanding
Modality Dropout (50%): Forces model to rely on audio features, not text shortcuts
Memory-Efficient Design: Internal projector dimension of 1024 with output projection to LLM space
ChatML Format: Proper instruction following with Qwen's native chat template

Training | Huấn luyện

Dataset | Dữ liệu

Attribute	Value
Total Samples	500,000+ diverse audio samples
Languages	Multilingual (optimized for English and Vietnamese)
Duration Range	1-25 seconds per sample
Label Source	Expert annotation + Gemini 2.0 synthesis
Augmentation	MUSAN noise, RIR reverb, codec simulation

Training Configuration | Cấu hình huấn luyện

Parameter	Value
Hardware	4× NVIDIA H200 141GB
Precision	BF16 mixed precision
Batch Size	24 per GPU (effective 96)
Epochs	3
WavLM LR	1e-6 (gentle adaptation)
Projector LR	1e-4
LoRA LR	2e-5

Performance Metrics | Chỉ số hiệu suất

The model achieves exceptional convergence with robust generalization.

Mô hình đạt được sự hội tụ xuất sắc với khả năng tổng quát hóa mạnh mẽ.

Metric	Value	Description
Validation Loss	0.0627	SOTA convergence (breakthrough from 0.09 plateau)
Perplexity	1.06	Near-perfect output confidence
Gender Accuracy	100%	Perfect classification
Emotion Top-3	100%	All top-3 emotions correctly identified
JSON Validity	100%	All outputs are valid JSON

Usage | Cách sử dụng

Input/Output Specifications | Thông số đầu vào/đầu ra

Specification	Value
Supported Formats	WAV, MP3, FLAC, OGG
Sample Rate	16kHz (recommended)
Max Duration	25 seconds (longer files auto-chunked)
Output Format	Structured JSON

Inference Example | Ví dụ suy luận

import torch
import torchaudio
from model import VoiceAnalysisLLM

# Load model | Tải mô hình
model = VoiceAnalysisLLM.from_pretrained(
    "thusinh1969/EMO_BrighTO_V1.0_PROD",
    device="cuda"
)
model.eval()

# Load audio | Tải âm thanh
audio, sr = torchaudio.load("customer_call.wav")
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)
audio = audio.mean(dim=0)  # Mono

# Generate profile | Tạo hồ sơ
prompt = "Analyze the speaker's voice characteristics. Output valid JSON."
result = model.generate(
    audio=audio.cuda(),
    prompt=prompt,
    temperature=0.0,  # Deterministic
    max_new_tokens=512
)

print(result)

Sample Output | Đầu ra mẫu

{
  "speaker": {
    "gender": "male",
    "age": "adult"
  },
  "emotion": {
    "valence": -0.4,
    "arousal": 0.3,
    "top3": [
      {"e": "sadness", "s": 0.5},
      {"e": "neutral", "s": 0.3},
      {"e": "calm", "s": 0.2}
    ]
  },
  "attitude": {
    "top3": [
      {"a": "neutral", "s": 0.6},
      {"a": "empathetic", "s": 0.2},
      {"a": "cooperative", "s": 0.2}
    ]
  },
  "voice": {
    "pitch": "mid",
    "energy": 0.4,
    "speed": 0.5,
    "tension": 0.3
  },
  "quality": {
    "noise": 0.1,
    "distortion": 0.05,
    "overlap": 0.0
  },
  "notes": "Subdued tone with possible disappointment or resignation."
}

Industry Applications | Ứng dụng ngành

BrighTO-Voice is designed for high-stakes enterprise environments.

BrighTO-Voice được thiết kế cho môi trường doanh nghiệp đòi hỏi cao.

Industry	Applications
Banking & Finance	Fraud detection via stress patterns, customer risk assessment, VIP identification
Call Centers	100% automated QA coverage (vs 2-5% manual), agent scoring, escalation prediction
Healthcare	Patient mental state assessment, telemedicine enhancement, elder care monitoring
Insurance	Claims fraud detection, customer satisfaction tracking, credibility evaluation
Media & Entertainment	Content moderation, real-time gaming emotion detection

Ngành	Ứng dụng
Ngân hàng & Tài chính	Phát hiện gian lận qua mẫu căng thẳng, đánh giá rủi ro khách hàng, nhận dạng VIP
Tổng đài	Phủ QA tự động 100% (so với 2-5% thủ công), chấm điểm nhân viên, dự đoán leo thang
Y tế	Đánh giá tâm lý bệnh nhân, nâng cao telemedicine, giám sát chăm sóc người cao tuổi
Bảo hiểm	Phát hiện gian lận bồi thường, theo dõi hài lòng khách hàng, đánh giá độ tin cậy
Truyền thông & Giải trí	Kiểm duyệt nội dung, phát hiện cảm xúc gaming thời gian thực

Technical Specifications | Thông số kỹ thuật

Specification	Production	Lite/Edge
Latency	~2 seconds	~0.8 seconds
Throughput	1000+ calls/minute (vLLM)	200+ calls/minute
VRAM Requirement	~6GB	~3GB
Model Size	1.5B parameters	0.5B parameters
Deployment	Cloud, On-Premise	Edge, Mobile

Product Suite | Bộ sản phẩm

BrighTO offers a comprehensive voice AI module suite for on-premise or API deployment.

BrighTO cung cấp bộ module AI giọng nói toàn diện để triển khai on-premise hoặc API.

Module	Description	Mô tả
ASR (Speech Recognition)	Multilingual recognition optimized for real-world noise	Nhận dạng đa ngôn ngữ tối ưu cho nhiễu thực tế
TTS (Text-to-Speech)	Vietnamese standard & advanced (code-switching, cloning, expression)	Tiếng Việt chuẩn & nâng cao (chuyển mã, nhân bản giọng, biểu cảm)
Speaker Verification	Basic to banking-grade security	Từ cơ bản đến bảo mật cấp ngân hàng
Anti-Spoofing	Basic to security-grade deepfake detection	Từ cơ bản đến phát hiện deepfake cấp bảo mật
Audio Profiler	This model - Production to Edge variants	Mô hình này - Phiên bản Production đến Edge

Commercial & Licensing | Thương mại & Bản quyền

License | Giấy phép

Commercial / Proprietary. Usage, redistribution, or derivative works require written approval from BrighTO Technology.

Thương mại / Độc quyền. Việc sử dụng, phân phối lại hoặc tạo bản phái sinh cần có chấp thuận bằng văn bản từ BrighTO Technology.

Deployment Options | Tùy chọn triển khai

Full License: Complete package licensing | Cấp phép trọn gói
API Access: Pay-per-use API integration | Tích hợp API theo lượt dùng
Custom Integration: Deployment, optimization, quality monitoring support | Hỗ trợ triển khai, tối ưu, giám sát chất lượng

Authorized Distributor | Nhà phân phối được ủy quyền

SphinX Joint Stock Company (sphinxjsc.com) is authorized to package, provide API services, and distribute according to customer requirements.

Công ty Cổ phần SphinX (sphinxjsc.com) được ủy quyền đóng gói, cung cấp dịch vụ API và phân phối theo yêu cầu khách hàng.

Contact | Liên hệ

Purpose	Contact
Commercial Inquiries	`nguyen@brighto.ai`, `nghia@brighto.ai`
API & Distribution (SphinX)	`duc@sphinxjsc.com`
Technical Support	`nguyen@hatto.ai`

Citation | Trích dẫn

@misc{brighto-voice-2026,
  title={BrighTO-Voice: AI-Powered Voice Analytics Platform},
  author={BrighTO Technology},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/thusinh1969/EMO_BrighTO_V1.0_PROD}
}

Empowering Businesses Through Voice Intelligence

Trao quyền cho Doanh nghiệp thông qua Trí tuệ Giọng nói

Downloads last month: -

Model tree for thusinh1969/BrighTO_Audio_Profiler_V1.0_PROD

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Finetuned

(1432)

this model

Evaluation results

Validation Loss on BrighTO Internal (500K samples)
self-reported

0.063
Perplexity on BrighTO Internal (500K samples)
self-reported

1.060
Gender Accuracy on BrighTO Internal (500K samples)
self-reported

1.000
Emotion Top-3 Accuracy on BrighTO Internal (500K samples)
self-reported

1.000