Exploring CNN-based architectures for Multimodal Salient Event Detection in Videos