Text this: Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild