Summary: | Causal inference is an active area of research in computer science and statistics as it is used to understand casual conclusions that traditional statistics cannot. A naive way to conclude the cause of an outcome is by using correlations, but this is not always accurate because there may be other variables that indirectly affect an outcome. Causal inference aims to find the root cause by considering those variables called confounders. Frequently, confounding variables are attributes in existing data, but sometimes they can be missing from the existing data. In those cases, data analysts have to look for confounders from outside sources such as tables, knowledge graphs, and text. Our focus is to look for confounding variables from visual data such as videos and images. Discovering confounders from visual data is a challenge because videos and images are unstructured unlike tables and graphs. Thus, it is difficult to identify features and also extract them from visual data. Additionally, the identified and extracted features must be relevant to the casual question being studied. With the recent advancement in visual language models (VLMs) such as GPT-4V(ision), VLMs can provide a versatile solution to the confounder discovery and feature extraction problem when using visual data. This thesis proposal investigates confounder discovery, feature extraction, and casual inference from visual data by utilizing the power of VLMs.
|