Boston University Libraries OpenBU
    JavaScript is disabled for your browser. Some features of this site may not work without it.
    View Item 
    •   OpenBU
    • Theses & Dissertations
    • Boston University Theses & Dissertations
    • View Item
    •   OpenBU
    • Theses & Dissertations
    • Boston University Theses & Dissertations
    • View Item

    Learning temporal variations for action recognition

    Thumbnail
    Date Issued
    2020
    Author(s)
    Zeng, Qili
    Share to FacebookShare to TwitterShare by Email
    Export Citation
    Download to BibTex
    Download to EndNote/RefMan (RIS)
    Metadata
    Show full item record
    Permanent Link
    https://hdl.handle.net/2144/41909
    Abstract
    As a core problem in video analysis, action recognition is of great significance for many higher-level tasks, both in research and industrial applications. With more and more video data being produced and shared daily, effective automatic action recognition methods are needed. Although, many deep-learning methods have been proposed to solve the problem, recent research reveals that single-stream, RGB-based networks are always outperformed by two-stream networks using both RGB and optical flow as inputs. This dependence on optical flow, which indicates a deficiency in learning motion, is present not only in 2D networks but also in 3D networks. This is somewhat surprising since 3D networks are explicitly designed for spatio-temporal learning. In this thesis, we assume that this deficiency is caused by difficulties associated with learning from videos exhibiting strong temporal variations, such as sudden motion, occlusions, acceleration, or deceleration. Temporal variations occur commonly in real-world videos and force a neural network to account for them, but often are not useful for recognizing actions at coarse granularity. We propose a Dynamic Equilibrium Module (DEM) for spatio-temporal learning through adaptive Eulerian motion manipulation. The proposed module can be inserted into existing networks with separate spatial and temporal convolutions, like the R(2+1)D model, to effectively handle temporal video variations and learn more robust spatio-temporal features. We demonstrate performance gains due to the use of DEM in the R(2+1)D model on miniKinetics, UCF-101, and HMDB-51 datasets.
    Collections
    • Boston University Theses & Dissertations [6905]


    Boston University
    Contact Us | Send Feedback | Help
     

     

    Browse

    All of OpenBUCommunities & CollectionsIssue DateAuthorsTitlesSubjectsThis CollectionIssue DateAuthorsTitlesSubjects

    Deposit Materials

    LoginNon-BU Registration

    Statistics

    Most Popular ItemsStatistics by CountryMost Popular Authors

    Boston University
    Contact Us | Send Feedback | Help