Separate and Reconstruct:
Asymmetric Encoder-Decoder for Speech Separation

Affiliation

Demo

WSJ0_MIX

Original Audio - Mixed

Spectrogram of Mixed Audio

Original Audio - Speaker 1

Spectrogram of Separated Audio - Speaker 1

Original Audio - Speaker 2

Spectrogram of Separated Audio - Speaker 2

Original Audio - Mixed

Spectrogram of Mixed Audio

Separated Audio - Speaker 1

Spectrogram of Separated Audio - Speaker 1

Separated Audio - Speaker 2

Spectrogram of Separated Audio - Speaker 2

WHAM

Original Audio - Mixed

Spectrogram of Mixed Audio

Original Audio - Speaker 1

Spectrogram of Separated Audio - Speaker 1

Original Audio - Speaker 2

Spectrogram of Separated Audio - Speaker 2

Original Audio - Mixed

Spectrogram of Mixed Audio

Separated Audio - Speaker 1

Spectrogram of Separated Audio - Speaker 1

Separated Audio - Speaker 2

Spectrogram of Separated Audio - Speaker 2

WHAMR

Original Audio - Mixed

Spectrogram of Mixed Audio

Original Audio - Speaker 1

Spectrogram of Separated Audio - Speaker 1

Original Audio - Speaker 2

Spectrogram of Separated Audio - Speaker 2

Original Audio - Mixed

Spectrogram of Mixed Audio

Separated Audio - Speaker 1

Spectrogram of Separated Audio - Speaker 1

Separated Audio - Speaker 2

Spectrogram of Separated Audio - Speaker 2

Abstract

Since the recent success of a time-domain speech separation, further improvements have been made by expanding the length and channel of a feature sequence to increase the amount of computation. When temporally expanded to a long sequence, the feature is segmented into chunks as a dual-path model in most studies of speech separation. In particular, it is common for the process of separating features corresponding to each speaker to be located in the final stage of the network. However, it is more advantageous and intuitive to proactively expand the feature sequence to include the number of speakers as an extra dimension. In this paper, we present an asymmetric strategy in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, as Siamese network, in addition to cross-speaker processing. By using the Siamese network in the decoder, without using speaker information, the network directly learns to discriminate the features using a separation objective. With a common split layer, intermediate encoder features for skip connections are also split for the reconstruction decoder based on the U-Net structure. In addition, instead of segmenting the feature sequence into chunks and processing as dual-path, we design global and local Transformer blocks to directly process long sequences. The experimental results demonstrated that this separation-and-reconstruction framework is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of inter- and intra-chunk processing in dual-path structure. Finally, the presented model including both of these achieved state-of-the-art performance with much less computation than before in various benchmark datasets.

Summary

Introduction

The paper presents a novel approach to speech separation using an asymmetric encoder-decoder network named SepReformer. The goal is to improve the efficiency and performance of separating mixed speech signals into individual components. Traditional methods, like the dual-path model, segment long sequences into chunks, which increases computational load and complexity. The proposed SepReformer addresses these issues with a more direct and efficient approach.

Key Methods

1. Asymmetric Encoder-Decoder Structure

  • Encoder: Processes the input signal to generate feature representations.
  • Early Split: The encoded feature sequence is split early in the network, corresponding to the number of speakers, which allows the network to handle each speaker's features separately from an early stage.
  • Decoder: Reconstructs the separated speech signals from the encoded features. Uses a shared decoder structure inspired by Siamese networks to handle the split sequences.

2. Weight-Sharing Siamese Network

The decoder uses a weight-sharing mechanism across multiple branches, similar to a Siamese network, to process separated features. This approach enhances the discriminative power of the network by ensuring that each decoder branch learns distinct characteristics of each speaker's features.

3. Transformer Blocks

  • Global Transformer: Captures long-range dependencies in the sequence.
  • Local Transformer: Focuses on fine-grained, short-term dependencies.

4. Cross-Speaker (CS) Module

This module within the decoder allows interaction between features of different speakers, enhancing the network's ability to separate and reconstruct overlapping speech elements.

5. Multi-Loss Training

The network is trained with a multi-loss objective, where intermediate layers are optimized to progressively refine the separated outputs. This progressive reconstruction approach helps in achieving better performance by guiding the network to focus on discriminative learning at multiple stages.

Results and Evaluation

The SepReformer model demonstrates state-of-the-art performance on various benchmark datasets, including WSJ0-2Mix, WHAM!, WHAMR!, and LibriMix. Key findings include:

  • Significant reduction in computational load compared to traditional dual-path models.
  • Superior performance in terms of signal-to-noise ratio improvement (SI-SNR) and other metrics.
  • Effectiveness of the early split and shared decoder structure in improving separation quality.

Conclusion

The proposed SepReformer method offers a novel and efficient approach to speech separation by leveraging an asymmetric encoder-decoder structure, weight-sharing Siamese networks, and Transformer-based sequence processing. This results in improved separation performance with reduced computational requirements, making it a promising solution for real-time speech separation applications.

Limitations

Our study focuses on 2-speaker mixture situations to assess our models in various model sizes and in extensive datasets including noise and reverberation. Consequently, we believe that further investigation is needed to validate more than 2-speaker mixture scenarios.

Additionally, an important future direction is to separate mixtures for an unknown number of speakers, as it is impractical to assume that the number of speakers to be separated is known in advance. Finally, although we experimentally validated our SepRe method, we believe that further investigation is necessary to understand its underlying mechanisms.

Future Study

  • Validation for More Than 2-Speaker Mixtures: Further investigation is needed to validate the SepReformer for scenarios involving more than two speakers.
  • Unknown Number of Speakers: Developing methods to separate mixtures with an unknown number of speakers, as it is impractical to assume the number of speakers in advance.
  • Understanding Underlying Mechanisms: Further research is required to understand the underlying mechanisms of the SepRe method, despite its experimental validation.

BibTeX


        @article{
          TBD,
          title={TBD},
          author={TBD},
          journal={TBD},
          year={TBD},
          volume={TBD},
          number={TBD},
          pages={TBD},
          doi={TBD}
        }