Dynamic Neural Network for Efficient Video Recognition
Recognizing real-world videos is a challenging task that requires the use of deep learning models. These models, however, require extensive computational resources to achieve robust recognition. One of the main challenges when dealing with real-world videos is the high correlation of information acr...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/151649 |
_version_ | 1811081679793553408 |
---|---|
author | Pan, Bowen |
author2 | Oliva, Aude |
author_facet | Oliva, Aude Pan, Bowen |
author_sort | Pan, Bowen |
collection | MIT |
description | Recognizing real-world videos is a challenging task that requires the use of deep learning models. These models, however, require extensive computational resources to achieve robust recognition. One of the main challenges when dealing with real-world videos is the high correlation of information across frames. This results in redundancy in either temporal or spatial feature maps of the models, or both. The amount of redundancy largely depends on the dynamics and events captured in the video. For example, static videos typically have more temporal redundancy, while videos focusing on objects tend to have more channel redundancy.
To address this challenge, we propose a novel approach that reduces redundancy by using an input-dependent policy to determine the necessary features for both temporal and channel dimensions. By doing so, we can identify the most relevant information for each frame, thus reducing the overall computational load. After computing the necessary features, we reconstruct the remaining redundant features from those using cheap linear operations. This not only reduces the computational cost of the model but also keeps the capacity of the original model intact.
Moreover, our proposed approach has the potential to improve the accuracy of real-world video recognition by reducing overfitting caused by the redundancy of information across frames. By focusing on the most relevant information, our model can better capture the unique characteristics of each video, resulting in more accurate predictions. Overall, our approach represents a significant step forward in the field of real-world video recognition and has the potential to enable the development of more efficient and accurate deep learning models for this task. |
first_indexed | 2024-09-23T11:50:46Z |
format | Thesis |
id | mit-1721.1/151649 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T11:50:46Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1516492023-08-01T03:50:37Z Dynamic Neural Network for Efficient Video Recognition Pan, Bowen Oliva, Aude Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Recognizing real-world videos is a challenging task that requires the use of deep learning models. These models, however, require extensive computational resources to achieve robust recognition. One of the main challenges when dealing with real-world videos is the high correlation of information across frames. This results in redundancy in either temporal or spatial feature maps of the models, or both. The amount of redundancy largely depends on the dynamics and events captured in the video. For example, static videos typically have more temporal redundancy, while videos focusing on objects tend to have more channel redundancy. To address this challenge, we propose a novel approach that reduces redundancy by using an input-dependent policy to determine the necessary features for both temporal and channel dimensions. By doing so, we can identify the most relevant information for each frame, thus reducing the overall computational load. After computing the necessary features, we reconstruct the remaining redundant features from those using cheap linear operations. This not only reduces the computational cost of the model but also keeps the capacity of the original model intact. Moreover, our proposed approach has the potential to improve the accuracy of real-world video recognition by reducing overfitting caused by the redundancy of information across frames. By focusing on the most relevant information, our model can better capture the unique characteristics of each video, resulting in more accurate predictions. Overall, our approach represents a significant step forward in the field of real-world video recognition and has the potential to enable the development of more efficient and accurate deep learning models for this task. S.M. 2023-07-31T19:56:01Z 2023-07-31T19:56:01Z 2023-06 2023-07-13T14:26:22.379Z Thesis https://hdl.handle.net/1721.1/151649 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Pan, Bowen Dynamic Neural Network for Efficient Video Recognition |
title | Dynamic Neural Network for Efficient Video Recognition |
title_full | Dynamic Neural Network for Efficient Video Recognition |
title_fullStr | Dynamic Neural Network for Efficient Video Recognition |
title_full_unstemmed | Dynamic Neural Network for Efficient Video Recognition |
title_short | Dynamic Neural Network for Efficient Video Recognition |
title_sort | dynamic neural network for efficient video recognition |
url | https://hdl.handle.net/1721.1/151649 |
work_keys_str_mv | AT panbowen dynamicneuralnetworkforefficientvideorecognition |