papers-note's People
papers-note's Issues
Actor and Action Video Segmentation from a Sentence [2018-08-02]
Actor and Action Video Segmentation from a Sentence
https://arxiv.org/pdf/1803.07485.pdf
Textual Encoder
Word2Vec :
using the pretrained model on 'GoogleNews'
each words = 300 dimension vec
each sentence padding to have the same size (eg:15x300)
CNN:
-
details:
temporal filter size = 2x2
channel = 300(same as word2vec representation)
-
ablation study:
51.8 for lstm
52.1 for bi-lstm
53.6 for cnn
Video Encoder
I3D Two-srteam
- detials:
I3d last max-pooling layer --> average pooling over temporal dimention --> l2norm for each spatial position in feature map
-
ablation study:
49.5 for flow_only
53.6 for RGB_only
55.1for two-stream
tanh() is better
Decoding with dynamic filters
bottom up top down?
Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks [2019-07-03]
Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks
https://arxiv.org/pdf/1901.10323.pdf
Task:
实时手势动作识别
Framework:
首先对输入的视频用滑动窗(t, t+8)取8帧作为一段,放入网络中,得到Detector的输出(二维),大于0.5则该段有动作,以同样的起始时间取32帧作为一段(t, t+32), 放入网络得到Classifier的输出(84维:83个动作和None)。Classifier的输出按规则取最大值,即为该段的动作类别。提出Post-processing和Single-time Activation的方法来解决一些针对性的问题,使得模型更准确。
Sliding window fashion
滑动窗取视频段
-
Detector(scale: 8, stride: 1)
-
Classifier (scale: 32, stride: 1)
A Detector + A Classifier
-
Detector(ResNet-10): 二分类(有无手势);同时作为Classifier的触发器(检测到有手势出现则进入分类器,无手势则进入下一段视频)
-
Classifier(ResNeXt-101):softmax(pre) 判断出现的手势属于哪一类
Detector 具体(Post-processing)
-
解决问题: 在实时识别手势的过程中,可能会存在动作幅度太大跑出画面的情况,此时依然是在手势动作的过程中,但Detector输出的confidence score很低。
-
方案: 提出Post-processing,保留之前段落预测的confidence score(文中保留4个值)和当前confidence score共同组成一个五位的数组,取中位数作为最终判断是否存在手势动作的预测值(二分类argmax(pre)=1则有动作,反之没有)。
Classifier 具体(Single-time Activation)
-
解决问题: 一个完整的手势动作可分为三个过程(准备期preparation, 峰值期nucleus,结束期retraction),在准备期时Detector已检测出动作并送入Classifier,但许多动作在preparation期往往是很相似的(如下图),会产生CS值很高的错误预测。
-
方案: 对不同时期的预测采用不同的权重。
公式:
首先定义一个常量t:
其中t是常量,u代表平均的ground truth动作时长(Ego数据集该值为38),s为步长,这里通过移动的步长来判断到第几个位置。论文中s取1,4文章没有解释,我认为这里的4是将每个动作时长分成四段,前四分之一表示准备期。因此需要s与之对应,例如果s取2,则4应改为取2.
其次对当前所取动作状态加权,权值为:
其中j是指检测到一个动作状态下的时间索引,当首次检测到动作时,为0,之后j递加,直到动作状态结束j重新赋值为0。该公式中t为定义好的常量(9),权值W随j的增加而增大,j=t时权值为0.5。
当检测到出整个段落后,将权值与预测结果相乘取平均,在均值中取最大的两个值,若这两个值的差大于某个阈值,则输出得分高的动作,作为该段的手势动作类别。如果遍历完j段依然不满足这个要求,则取出最大的值,该值大于0.15则将其作为最终的分类结果。
实验结果
EgoGesture dataset
train: 1239 videos 14416gesture
val: 411 videos 4768gesture
test: 431 videos 4977gesture
实验结果分析:
1),Depth图效果更好,解释:Depth图filter out背景信息,可以更focus在手势上。
2),检测器8帧效果最好,解释:该模型下Detector的设计至关重要,不能有遗漏取值应尽可能的小。
复现细节
待更新
TALL:Temporal activity localization via language query(2018-03-28)
Abstrace
This paper proposes a Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips output alignment scores and action boundary regression results for candidate clipes.For evaluation,this paper builds Charades-STA based on Charades datasets.,even a more complex sentence queries in Charades-STA for test.
Model
Visual Encoder
For one video clip , we consider itself ( as the central clip ) and its surrouning clips ( as context clips ) .We uniformly sample n frames from each clip , useing extractor to extract the central clip , for the context clips , we use a pooling layer to calculate a pre-context feature and a post-context feature.
Sentence Encoder
LSTM.
Off-the-shelf Skip-thought
Multi-modal processing module
The input dmension of the FC layer is 2*d and the output is d
Temporal Localization Regression Networks
Temporal localization regression network takes the multi-modal representation as input , and has two sibling output layers ,1) ailgnment score between setence and the video clip , 2) clip location regression offsets—parameterized one and unparameterized one(better performance).
Training
Loss function
We design a multi-task loss L
Sampling training examples
We use multi-scale temporal sliding windows with frames and 80% overlap ( at test time we only use coarsely sample clips)
aligning quest: 1) IOU 2) nLOU 3) one to one
Charades-STA
-
split one sentence to some sub-sentences by a set of conjunctions
-
keywords maping
-
human check
Text-to-clip Video Retrieval with Early Fusion and Re-Captioning[2018-08-01]
Text-to-clip Video Retrieval with Early Fusion and Re-Captioning
https://arxiv.org/pdf/1804.05113.pdf
Segment Prososals
Input:video V --> encodes all frames in V using C3D --> predicting a relative R(center,length) --> C3D for R
loss function:
Example1: R-C3D
[R-C3D: Region Convolutional 3D Network for Temporal Activity Detection.
https://arxiv.org/pdf/1703.07814.pdf]
Example2 :
[Jointly Localizing and Describing Events for Dense Video Captioning.
https://arxiv.org/pdf/1804.08274.pdf]
Early Fusion
word-by-word fusion: using LSTM return a similarity
Caption loss
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks [2019-07-05]
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks
Paper:https://arxiv.org/pdf/1905.09646.pdf
Notes from others:https://zhuanlan.zhihu.com/p/66928045
Code:
Task
提出一种基于group结构的轻量级空间attention模块,对每个group的每个空间位置进行加权。
个人理解(为了提升group结构表达能力,对每个分组加了一个attention,因为该attention结构主要的操作是pooling和dot product,对每个group,只增加了两个缩放偏移参数,所以轻量)
Attention module
Step1, 特征图分组并按空间位置拆分。
暂不考虑batch,假设当前为(H,W,C),分为G个group,则每个group维度为(H,W,C/G). 对于每个group的特征图,可将其在空间维度上展开,则表示为:
即分为HxW个小段,每个小段表示一个位置的特征,维度为1x1xC/G。
Step2, 对每个group进行空间关联表达。
g 的维度为 1x1xC/G
这里ci 也可用以下公式求得:
减均值除标准差的normalize:
Step3, 对原始x增加权重信息。
- 对每个group的空间表达特征增加偏移量参数,得到attention的mask:
这里其实也可以用矩阵来表示,可能效果更直观一些。整个attention模块的参数只在这个部分体现,即总的参数量为group数的2倍。
这里 H x W 个xi^组成的特征即为空间加权后的group特征, 维度为 HxWxC/G。
实验参数及结果
Ablation
-
G=64结果最好(Figure 5);
-
Norm很重要不可去掉(差一个点,可见Table3)
-
两个偏移量参数的初始值对比如下(0,1结果最好,Table2)
Experiments on Object Detection
[S-CNN]Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs (2018-03-29)
Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs
Abstract
This paper exploits the effectiveness of deep networks in temporal action locallization via three segment-based 3D Convnets:
(1) a proposal network -- identifies candidate segments in a long video that may contain actions
(2) a classification network (very important for training) -- serve as initialization for the localization network
(3) a localization network -- fine-tunes the learned classification network to localize each action instance
Detailed descriptions of Segment-CNN
Multi - scale segment generation
Each frame is resized to 171 X 128 pixels
For untrimmed video X, this paper conducts temporal sliding windows of varied lengths as 16,32,64,128,256,512 frames with 75% overlap(sampling also 16 frames)
Network architecture
Their deep networks use C3D as the basic archietecture in all stages .
cnv1a(64) - pool1(1,1) -
-conv2a(128) - pool2(2,2) -
-conv3a(256) - conv3b(256) - pool3(2,2) -
-conv4a(512) - conv4b(512) - pool4(2,2) -
-conv5a(512) - conv5b(512) - pool5(2,2) -
-fc6(4096) - fc7(4096) - fc8(K+1)
Each input for this deep network is a segment s of dimention 171 X 128 X 16 .
Training procedure
And Impact of individual networks
Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)
1) The proposal network
label k:{0,1}
For each segment of the trimmed video , set its label as positive.
For candidate segments from an untrimmed video , assign a label for each ground truth (>0.7 or [lagest & >0.5] + ; <0.3 - )
reduce the number of operations conducted on background segments
2) The classification network
label: background and action k:{1....K}
In order to balance the number of training data for each class , this paper reduce the number of background instances to
better pormance
3) The localization network
Proposing this localization network with a new loss function , which tacks IOU with ground truth instance into consideration
The new loss function is formed by combining L(softmax) and L(overlap):
better pormance
[SSN & TAG]Temporal Action Detection with Structured Segment Networks[2018-08-06]
Temporal Action Detection with Structured Segment Networks
https://arxiv.org/pdf/1704.06228.pdf
SSN Structured Segment Network
-
Tress-Stage Structured
starting -- course -- ending
-
Structured Temporal Pyramid Pooling
course : two-level ( split number )
starting & ending : one-level
-
Two classifiers
activity
completeness
-
Location Regression and Multi-Task Loss
refine the boundary
TAG Temporal Actionness Grouping
-
Actionness probabilities -- binary classifier
-
Complemented actionness -- watershed algorithnm
Code :
CTAP :Complementary Temporal Action Proposal Generation[2018-08-07]
CTAP :Complementary Temporal Action Proposal Generation
Two type proposal generation (three baseline)
-
Sliding window
advantage: cover all segment
drawback: imprecise
-
Action score based
typical -- TAG
advantage: precise
drawback: 1) generate wrong proposal . 2) omit some correct proposal
how to solve: 1) offset loss -- adjust boundaries . 2) this paper
Complementary Temporal Action Proposal Generator
Initial Proposal Generation
-
Video pre-processing
two-stream
a long video -->a sequence of unit-level features
-
Actionness score
binary classifier -->actionness score for each unit
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.