इसका टेक्स्ट मैसेज भेजे: Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language