asteroid.data.avspeech_dataset module¶

asteroid.data.avspeech_dataset.get_frames(video)[source]¶

class asteroid.data.avspeech_dataset.Signal(video_path: Union[str, pathlib.Path], audio_path: Union[str, pathlib.Path], embed_dir: Union[str, pathlib.Path], sr=16000, video_start_length=0, fps=25, signal_len=3)[source]¶

Bases: object

This class holds the video frames and the audio signal.

Parameters:	video_path (str,Path) – Path to video (mp4). audio_path (str,Path) – Path to audio (wav). embed_dir (str,Path) – Path to directory that stores embeddings. sr (int) – sampling rate of audio. video_start_length – video part no. [1] fps (int) – fps of video. signal_len (int) – length of the signal

Note

each video consists of multiple parts which consists of fps*signal_len frames.

get_embed()[source]¶

get_audio()[source]¶

class asteroid.data.avspeech_dataset.AVSpeechDataset(input_df_path: Union[str, pathlib.Path], embed_dir: Union[str, pathlib.Path], n_src=2)[source]¶

Bases: sphinx.ext.autodoc.importer._MockObject

Audio Visual Speech Separation dataset as described in [1].

Parameters:	input_df_path (str,Path) – path for combination dataset. embed_dir (str,Path) – path where embeddings are stored. n_src (int) – number of sources.

References: [1] “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation” Ephrat et. al https://arxiv.org/abs/1804.03619

dataset_name = 'AVSpeech'[source]¶

static encode(x: numpy.ndarray, p=0.3, stft_encoder=None, EPS=1e-08)[source]¶

static decode(tf_rep: numpy.ndarray, p=0.3, stft_decoder=None, final_len=48000)[source]¶