SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

Anonymous Authors
SAC Model Architecture

Overview of SAC: Semantic and speaker feature supervision are applied only during codec training, with their respective encoders kept frozen to preserve the integrity of extracted features.

Abstract

Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models (SLMs). However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic–acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. To further enhance timbre modeling, we introduce explicit speaker feature supervision during codec training. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, demonstrating superior perceptual quality and intelligibility. Moreover, SAC substantially outperforms state-of-the-art codecs in semantic representation, achieving a level comparable to that of self-supervised learning (SSL) continuous embeddings. Finally, our analysis of speech disentanglement highlights the effectiveness of the dual-stream design, offering new potential for controllable speech applications.

High-Bitrate Speech Codec Samples

Table 1. Speech reconstruction samples from high-bitrate codec models. High-bitrate versions are used where multiple configurations exist. Samples~1-2 are clean, Samples~3-4 are noisy, and Samples~5-6 are Chinese utterances.
Ground Truth SAC X-Codec2 MagiCodec WavTokenizer X-codec XY-Tokenizer BigCodec SemantiCodec
sample1
sample2
sample3
sample4
sample5
sample6

Low-Bitrate Speech Codec Samples

Table 2. Speech reconstruction examples from low-bitrate codec models. For models available at multiple bitrates, the low-bitrate version is used.
Ground Truth SAC WavTokenizer XCodec SpeechTokenizer (RVQ-1) SemantiCodec
sample1
sample2
sample3
sample4
sample5
sample6

Speech Disentanglement Examples

Ground Truth SAC-F SAC-S SemantiCodec-F SemantiCodec-S SpeechTokenizer (RVQ-1:8) SpeechTokenizer (RVQ-1)
sample1
sample2
sample3
sample4
Table 3. Speech disentanglement examples from different speech codecs. “x-F” denotes full reconstruction using codec x, while “x-S” denotes semantic-only reconstruction using codec x.