GithubHelp home page GithubHelp logo

papers-note's People

Watchers

 avatar

papers-note's Issues

Actor and Action Video Segmentation from a Sentence [2018-08-02]

Actor and Action Video Segmentation from a Sentence

https://arxiv.org/pdf/1803.07485.pdf

Textual Encoder

Word2Vec :

using the pretrained model on 'GoogleNews'

each words = 300 dimension vec

each sentence padding to have the same size (eg:15x300)

CNN:

  • details:

    temporal filter size = 2x2

    channel = 300(same as word2vec representation)

  • ablation study:

    51.8 for lstm

    52.1 for bi-lstm

    53.6 for cnn

Video Encoder

I3D Two-srteam

  • detials:

I3d last max-pooling layer --> average pooling over temporal dimention --> l2norm for each spatial position in feature map

  • ablation study:

    49.5 for flow_only

    53.6 for RGB_only

    55.1for two-stream

tanh() is better

Decoding with dynamic filters

r676tcmbbu 6kuqfd l j i

bottom up top down?

fgcts t47 sjj_6i aeoaay

Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks [2019-07-03]

Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks

https://arxiv.org/pdf/1901.10323.pdf

Task:

实时手势动作识别

Framework:

首先对输入的视频用滑动窗(t, t+8)取8帧作为一段,放入网络中,得到Detector的输出(二维),大于0.5则该段有动作,以同样的起始时间取32帧作为一段(t, t+32), 放入网络得到Classifier的输出(84维:83个动作和None)。Classifier的输出按规则取最大值,即为该段的动作类别。提出Post-processing和Single-time Activation的方法来解决一些针对性的问题,使得模型更准确。
36b8957291e59a9d566e4fe1f777d48

Sliding window fashion

滑动窗取视频段

  • Detector(scale: 8, stride: 1)

  • Classifier (scale: 32, stride: 1)

A Detector + A Classifier

  • Detector(ResNet-10): 二分类(有无手势);同时作为Classifier的触发器(检测到有手势出现则进入分类器,无手势则进入下一段视频)

  • Classifier(ResNeXt-101):softmax(pre) 判断出现的手势属于哪一类

Detector 具体(Post-processing)

  • 解决问题: 在实时识别手势的过程中,可能会存在动作幅度太大跑出画面的情况,此时依然是在手势动作的过程中,但Detector输出的confidence score很低。

  • 方案: 提出Post-processing,保留之前段落预测的confidence score(文中保留4个值)和当前confidence score共同组成一个五位的数组,取中位数作为最终判断是否存在手势动作的预测值(二分类argmax(pre)=1则有动作,反之没有)。

Classifier 具体(Single-time Activation)

  • 解决问题: 一个完整的手势动作可分为三个过程(准备期preparation, 峰值期nucleus,结束期retraction),在准备期时Detector已检测出动作并送入Classifier,但许多动作在preparation期往往是很相似的(如下图),会产生CS值很高的错误预测。
    b391b13a7517d7248ee58859395b52e

  • 方案: 对不同时期的预测采用不同的权重。

公式:

首先定义一个常量t:

0aacc19876d55322958bf3b49dacaee

其中t是常量,u代表平均的ground truth动作时长(Ego数据集该值为38),s为步长,这里通过移动的步长来判断到第几个位置。论文中s取1,4文章没有解释,我认为这里的4是将每个动作时长分成四段,前四分之一表示准备期。因此需要s与之对应,例如果s取2,则4应改为取2.

其次对当前所取动作状态加权,权值为:

506b098cec6d5b05d2da3e5ca7b605f

其中j是指检测到一个动作状态下的时间索引,当首次检测到动作时,为0,之后j递加,直到动作状态结束j重新赋值为0。该公式中t为定义好的常量(9),权值W随j的增加而增大,j=t时权值为0.5。

当检测到出整个段落后,将权值与预测结果相乘取平均,在均值中取最大的两个值,若这两个值的差大于某个阈值,则输出得分高的动作,作为该段的手势动作类别。如果遍历完j段依然不满足这个要求,则取出最大的值,该值大于0.15则将其作为最终的分类结果。

7a0a4cbd6a413750a28baf79d54fa77

实验结果

EgoGesture dataset

train: 1239 videos 14416gesture
val: 411 videos 4768gesture
test: 431 videos 4977gesture

40bf5d8a7daca2697b7c41de49a9b6f

15bf06a19944408529318fb692af061

c9a6990e983a3445bc321e27d5c7d9b

81a089f1984325095942afd3f335766

实验结果分析:

1),Depth图效果更好,解释:Depth图filter out背景信息,可以更focus在手势上。

2),检测器8帧效果最好,解释:该模型下Detector的设计至关重要,不能有遗漏取值应尽可能的小。

复现细节

待更新

TALL:Temporal activity localization via language query(2018-03-28)

Abstrace

1

This paper proposes a Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips output alignment scores and action boundary regression results for candidate clipes.For evaluation,this paper builds Charades-STA based on Charades datasets.,even a more complex sentence queries in Charades-STA for test.

Model

_20180330105210

Visual Encoder

For one video clip , we consider itself ( as the central clip ) and its surrouning clips ( as context clips ) .We uniformly sample n frames from each clip , useing extractor to extract the central clip , for the context clips , we use a pooling layer to calculate a pre-context feature and a post-context feature.

Sentence Encoder

LSTM.

Off-the-shelf Skip-thought

Multi-modal processing module

The input dmension of the FC layer is 2*d and the output is d

Temporal Localization Regression Networks

Temporal localization regression network takes the multi-modal representation as input , and has two sibling output layers ,1) ailgnment score between setence and the video clip , 2) clip location regression offsets—parameterized one and unparameterized one(better performance).

Training

Loss function

We design a multi-task loss L

qq 20180330110958

Sampling training examples

sampling

We use multi-scale temporal sliding windows with frames and 80% overlap ( at test time we only use coarsely sample clips)
aligning quest: 1) IOU 2) nLOU 3) one to one

Charades-STA

dataset

  1. split one sentence to some sub-sentences by a set of conjunctions

  2. keywords maping

  3. human check

Text-to-clip Video Retrieval with Early Fusion and Re-Captioning[2018-08-01]

Text-to-clip Video Retrieval with Early Fusion and Re-Captioning

https://arxiv.org/pdf/1804.05113.pdf

Segment Prososals

pf2mt2o u83asfasz jh29
Input:video V --> encodes all frames in V using C3D --> predicting a relative R(center,length) --> C3D for R

loss function:

cs 5f y3iou xb15 lrb8

Example1: R-C3D

[R-C3D: Region Convolutional 3D Network for Temporal Activity Detection.
https://arxiv.org/pdf/1703.07814.pdf]

7 iiz_g98 6 culhucx96

Example2 :

[Jointly Localizing and Describing Events for Dense Video Captioning.
https://arxiv.org/pdf/1804.08274.pdf]

q 0 1 fc_ lx1 8ie q e 4

Early Fusion

word-by-word fusion: using LSTM return a similarity

2a zcdry gpgxkmcw 50 m

Caption loss

9 v 19msmdyku u hjqb

Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks [2019-07-05]

Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks

Paper:https://arxiv.org/pdf/1905.09646.pdf
Notes from others:https://zhuanlan.zhihu.com/p/66928045
Code:

Task

提出一种基于group结构的轻量级空间attention模块,对每个group的每个空间位置进行加权。
个人理解(为了提升group结构表达能力,对每个分组加了一个attention,因为该attention结构主要的操作是pooling和dot product,对每个group,只增加了两个缩放偏移参数,所以轻量)

Attention module

image

Step1, 特征图分组并按空间位置拆分。

暂不考虑batch,假设当前为(H,W,C),分为G个group,则每个group维度为(H,W,C/G). 对于每个group的特征图,可将其在空间维度上展开,则表示为:
image

即分为HxW个小段,每个小段表示一个位置的特征,维度为1x1xC/G。

Step2, 对每个group进行空间关联表达。

image

g 的维度为 1x1xC/G

image

这里ci 也可用以下公式求得:

image

减均值除标准差的normalize:

image

Step3, 对原始x增加权重信息。

  • 对每个group的空间表达特征增加偏移量参数,得到attention的mask:

image

这里其实也可以用矩阵来表示,可能效果更直观一些。整个attention模块的参数只在这个部分体现,即总的参数量为group数的2倍。

  • 将attention mask通过sigmoid得到attention score,将attention score 乘在原始X上,表示对空间上的不同位置赋以不同的重要程度:
    image

这里 H x W 个xi^组成的特征即为空间加权后的group特征, 维度为 HxWxC/G。

实验参数及结果

Ablation

  • G=64结果最好(Figure 5);

  • Norm很重要不可去掉(差一个点,可见Table3)

  • 两个偏移量参数的初始值对比如下(0,1结果最好,Table2)

image

Experiments on Object Detection

image

[S-CNN]Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs (2018-03-29)

Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs

Abstract

This paper exploits the effectiveness of deep networks in temporal action locallization via three segment-based 3D Convnets:

(1) a proposal network -- identifies candidate segments in a long video that may contain actions

(2) a classification network (very important for training) -- serve as initialization for the localization network

(3) a localization network -- fine-tunes the learned classification network to localize each action instance

Detailed descriptions of Segment-CNN

model

Multi - scale segment generation

multi-scale segment generation

Each frame is resized to 171 X 128 pixels

For untrimmed video X, this paper conducts temporal sliding windows of varied lengths as 16,32,64,128,256,512 frames with 75% overlap(sampling also 16 frames)

Network architecture

Their deep networks use C3D as the basic archietecture in all stages .

cnv1a(64) - pool1(1,1) -

-conv2a(128) - pool2(2,2) -

-conv3a(256) - conv3b(256) - pool3(2,2) -

-conv4a(512) - conv4b(512) - pool4(2,2) -

-conv5a(512) - conv5b(512) - pool5(2,2) -

-fc6(4096) - fc7(4096) - fc8(K+1)

Each input for this deep network is a segment s of dimention 171 X 128 X 16 .

Training procedure

And Impact of individual networks

Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)

1) The proposal network

propose

label k:{0,1}

For each segment of the trimmed video , set its label as positive.

For candidate segments from an untrimmed video , assign a label for each ground truth (>0.7 or [lagest & >0.5] + ; <0.3 - )

scnn propose

reduce the number of operations conducted on background segments

2) The classification network

classification

label: background and action k:{1....K}

In order to balance the number of training data for each class , this paper reduce the number of background instances to

scnn 2

better pormance

3) The localization network

location

Proposing this localization network with a new loss function , which tacks IOU with ground truth instance into consideration

The new loss function is formed by combining L(softmax) and L(overlap):

loss

scnn 2

better pormance

[SSN & TAG]Temporal Action Detection with Structured Segment Networks[2018-08-06]

Temporal Action Detection with Structured Segment Networks

https://arxiv.org/pdf/1704.06228.pdf

SSN Structured Segment Network

ni43os1 sbhermux _qsl0i

  • Tress-Stage Structured

    starting -- course -- ending

  • Structured Temporal Pyramid Pooling

    course : two-level ( split number )

    starting & ending : one-level

  • Two classifiers

    activity

    completeness

  • Location Regression and Multi-Task Loss

    refine the boundary

TAG Temporal Actionness Grouping

ab1_ zqnuuvvs_ cb1f w

  • Actionness probabilities -- binary classifier

  • Complemented actionness -- watershed algorithnm

Code :

CTAP :Complementary Temporal Action Proposal Generation[2018-08-07]

CTAP :Complementary Temporal Action Proposal Generation

Two type proposal generation (three baseline)

q9bh hi hmp nlxl9025 o

  • Sliding window

    advantage: cover all segment

    drawback: imprecise

  • Action score based

    typical -- TAG

    advantage: precise

    drawback: 1) generate wrong proposal . 2) omit some correct proposal

    how to solve: 1) offset loss -- adjust boundaries . 2) this paper

Complementary Temporal Action Proposal Generator

7 twj ld t pr88pvrk3mlx

Initial Proposal Generation

  • Video pre-processing

    two-stream

    a long video -->a sequence of unit-level features

  • Actionness score

    binary classifier -->actionness score for each unit

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.