Time-domain Speech Separation Networks with Graph Encoding Auxiliary
This time domain speech separation networks video discusses that end-to-end time-domain speech separation with masking strategy has shown its performance advantage, where a 1-D convolutional layer is used as the speech encoder to encode a sliding window of waveform to a latent feature representation, i.e. an embedding vector. A large window leads to low resolution in the speech processing, on the other hand, a small window offers high resolution but at the expense of high computational cost. In this work, we propose a graph encoding technique to model the fine structural knowledge of speech samples in a window of reasonable size. Specifically, we build a graph representation for each latent representation, and encode the structural details with a graph convolutional network encoder. The encoded graph feature representation complements the original latent feature representation and benefits the separation and reconstruction of speech. Experiments on various models and datasets show that the proposed encoding technique significantly improves the speech quality over other time-domain speech encoders.
This time domain speech separation networks video introduces a graph representation to the latent representation extraction. The video confirm that the graph encoder is effective through a number of experiment.