New papers for 2022-07-26 Tue!

💻 cs

📚 mask (total: 9)

📃 Deep Pneumonia: Attention-Based Contrastive Learning for Class-Imbalanced Pneumonia Lesion Recognition in Chest X-rays

Authors: Xinxu Wei, Haohan Bai, Xianshi Zhang, Yongjie Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.11393
Pdf link: https://arxiv.org/pdf/2207.11393
Abstract:
Computer-aided X-ray pneumonia lesion recognition is important for accurate diagnosis of pneumonia. With the emergence of deep learning, the identification accuracy of pneumonia has been greatly improved, but there are still some challenges due to the fuzzy appearance of chest X-rays. In this paper, we propose a deep learning framework named Attention-Based Contrastive Learning for Class-Imbalanced X-Ray Pneumonia Lesion Recognition (denoted as Deep Pneumonia). We adopt self-supervised contrastive learning strategy to pre-train the model without using extra pneumonia data for fully mining the limited available dataset. In order to leverage the location information of the lesion area that the doctor has painstakingly marked, we propose mask-guided hard attention strategy and feature learning with contrastive regulation strategy which are applied on the attention map and the extracted features respectively to guide the model to focus more attention on the lesion area where contains more discriminative features for improving the recognition performance. In addition, we adopt Class-Balanced Loss instead of traditional Cross-Entropy as the loss function of classification to tackle the problem of serious class imbalance between different classes of pneumonia in the dataset. The experimental results show that our proposed framework can be used as a reliable computer-aided pneumonia diagnosis system to assist doctors to better diagnose pneumonia cases accurately.

📃 Progressive Scene Text Erasing with Self-Supervision

Authors: Xiangcheng Du, Zhao Zhou, Yingbin Zheng, Xingjiao Wu, Tianlong Ma, Cheng Jin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.11469
Pdf link: https://arxiv.org/pdf/2207.11469
Abstract:
Scene text erasing seeks to erase text contents from scene images and current state-of-the-art text erasing models are trained on large-scale synthetic data. Although data synthetic engines can provide vast amounts of annotated training samples, there are differences between synthetic and real-world data. In this paper, we employ self-supervision for feature representation on unlabeled real-world scene text images. A novel pretext task is designed to keep consistent among text stroke masks of image variants. We design the Progressive Erasing Network in order to remove residual texts. The scene text is erased progressively by leveraging the intermediate generated results which provide the foundation for subsequent higher quality results. Experiments show that our method significantly improves the generalization of the text erasing task and achieves state-of-the-art performance on public benchmarks.

📃 Marior: Margin Removal and Iterative Content Rectification for Document Dewarping in the Wild

Authors: Jiaxin Zhang, Canjie Luo, Lianwen Jin, Fengjun Guo, Kai Ding
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.11515
Pdf link: https://arxiv.org/pdf/2207.11515
Abstract:
Camera-captured document images usually suffer from perspective and geometric deformations. It is of great value to rectify them when considering poor visual aesthetics and the deteriorated performance of OCR systems. Recent learning-based methods intensively focus on the accurately cropped document image. However, this might not be sufficient for overcoming practical challenges, including document images either with large marginal regions or without margins. Due to this impracticality, users struggle to crop documents precisely when they encounter large marginal regions. Simultaneously, dewarping images without margins is still an insurmountable problem. To the best of our knowledge, there is still no complete and effective pipeline for rectifying document images in the wild. To address this issue, we propose a novel approach called Marior (Margin Removal and \Iterative Content Rectification). Marior follows a progressive strategy to iteratively improve the dewarping quality and readability in a coarse-to-fine manner. Specifically, we divide the pipeline into two modules: margin removal module (MRM) and iterative content rectification module (ICRM). First, we predict the segmentation mask of the input image to remove the margin, thereby obtaining a preliminary result. Then we refine the image further by producing dense displacement flows to achieve content-aware rectification. We determine the number of refinement iterations adaptively. Experiments demonstrate the state-of-the-art performance of our method on public benchmarks. The resources are available at https://github.com/ZZZHANG-jx/Marior for further comparison.

📃 MAR: Masked Autoencoders for Efficient Action Recognition

Authors: Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang, Yiliang Lv, Changxin Gao, Nong Sang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.11660
Pdf link: https://arxiv.org/pdf/2207.11660
Abstract:
Standard approaches for video recognition usually operate on the full input videos, which is inefficient due to the widely present spatio-temporal redundancy in videos. Recent progress in masked video modelling, i.e., VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts given only limited visual contents. Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos. MAR contains the following two indispensable components: cell running masking and bridging classifier. Specifically, to enable the ViT to perceive the details beyond the visible patches easily, cell running masking is presented to preserve the spatio-temporal correlations in videos, which ensures the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this, a bridging classifier is proposed to bridge the semantic gap between the ViT encoded features for reconstruction and the features specialized for classification. Our proposed MAR reduces the computational cost of ViT by 53% and extensive experiments show that MAR consistently outperforms existing ViT models with a notable margin. Especially, we found a ViT-Large trained by MAR outperforms the ViT-Huge trained by a standard training scheme by convincing margins on both Kinetics-400 and Something-Something v2 datasets, while our computation overhead of ViT-Large is only 14.5% of ViT-Huge.

📃 Affective Behaviour Analysis Using Pretrained Model with Facial Priori

Authors: Yifan Li, Haomiao Sun, Zhaori Liu, Hu Han
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.11679
Pdf link: https://arxiv.org/pdf/2207.11679
Abstract:
Affective behaviour analysis has aroused researchers' attention due to its broad applications. However, it is labor exhaustive to obtain accurate annotations for massive face images. Thus, we propose to utilize the prior facial information via Masked Auto-Encoder (MAE) pretrained on unlabeled face images. Furthermore, we combine MAE pretrained Vision Transformer (ViT) and AffectNet pretrained CNN to perform multi-task emotion recognition. We notice that expression and action unit (AU) scores are pure and intact features for valence-arousal (VA) regression. As a result, we utilize AffectNet pretrained CNN to extract expression scores concatenating with expression and AU scores from ViT to obtain the final VA features. Moreover, we also propose a co-training framework with two parallel MAE pretrained ViT for expression recognition tasks. In order to make the two views independent, we random mask most patches during the training process. Then, JS divergence is performed to make the predictions of the two views as consistent as possible. The results on ABAW4 show that our methods are effective.

📃 Semantic-guided Multi-Mask Image Harmonization

Authors: Xuqian Ren, Yifan Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2207.11722
Pdf link: https://arxiv.org/pdf/2207.11722
Abstract:
Previous harmonization methods focus on adjusting one inharmonious region in an image based on an input mask. They may face problems when dealing with different perturbations on different semantic regions without available input masks. To deal with the problem that one image has been pasted with several foregrounds coming from different images and needs to harmonize them towards different domain directions without any mask as input, we propose a new semantic-guided multi-mask image harmonization task. Different from the previous single-mask image harmonization task, each inharmonious image is perturbed with different methods according to the semantic segmentation masks. Two challenging benchmarks, HScene and HLIP, are constructed based on $150$ and $19$ semantic classes, respectively. Furthermore, previous baselines focus on regressing the exact value for each pixel of the harmonized images. The generated results are in the `black box' and cannot be edited. In this work, we propose a novel way to edit the inharmonious images by predicting a series of operator masks. The masks indicate the level and the position to apply a certain image editing operation, which could be the brightness, the saturation, and the color in a specific dimension. The operator masks provide more flexibility for users to edit the image further. Extensive experiments verify that the operator mask-based network can further improve those state-of-the-art methods which directly regress RGB images when the perturbations are structural. Experiments have been conducted on our constructed benchmarks to verify that our proposed operator mask-based framework can locate and modify the inharmonious regions in more complex scenes. Our code and models are available at https://github.com/XuqianRen/Semantic-guided-Multi-mask-Image-Harmonization.git.

📃 Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer

Authors: Yingyi Chen, Xi Shen, Yahui Liu, Qinghua Tao, Johan A.K. Suykens
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.11971
Pdf link: https://arxiv.org/pdf/2207.11971
Abstract:
The success of Vision Transformer (ViT) in various computer vision tasks has promoted the ever-increasing prevalence of this convolution-free network. The fact that ViT works on image patches makes it potentially relevant to the problem of jigsaw puzzle solving, which is a classical self-supervised task aiming at reordering shuffled sequential image patches back to their natural form. Despite its simplicity, solving jigsaw puzzle has been demonstrated to be helpful for diverse tasks using Convolutional Neural Networks (CNNs), such as self-supervised feature representation learning, domain generalization, and fine-grained classification. In this paper, we explore solving jigsaw puzzle as a self-supervised auxiliary loss in ViT for image classification, named Jigsaw-ViT. We show two modifications that can make Jigsaw-ViT superior to standard ViT: discarding positional embeddings and masking patches randomly. Yet simple, we find that Jigsaw-ViT is able to improve both in generalization and robustness over the standard ViT, which is usually rather a trade-off. Experimentally, we show that adding the jigsaw puzzle branch provides better generalization than ViT on large-scale image classification on ImageNet. Moreover, the auxiliary task also improves robustness to noisy labels on Animal-10N, Food-101N, and Clothing1M as well as adversarial examples. Our implementation is available at https://yingyichen-cyy.github.io/Jigsaw-ViT/.

📃 What is Healthy? Generative Counterfactual Diffusion for Lesion Localization

Authors: Pedro Sanchez, Antanas Kascenas, Xiao Liu, Alison Q. O'Neil, Sotirios A. Tsaftaris
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2207.12268
Pdf link: https://arxiv.org/pdf/2207.12268
Abstract:
Reducing the requirement for densely annotated masks in medical image segmentation is important due to cost constraints. In this paper, we consider the problem of inferring pixel-level predictions of brain lesions by only using image-level labels for training. By leveraging recent advances in generative diffusion probabilistic models (DPM), we synthesize counterfactuals of "How would a patient appear if X pathology was not present?". The difference image between the observed patient state and the healthy counterfactual can be used for inferring the location of pathology. We generate counterfactuals that correspond to the minimal change of the input such that it is transformed to healthy domain. This requires training with healthy and unhealthy data in DPMs. We improve on previous counterfactual DPMs by manipulating the generation process with implicit guidance along with attention conditioning instead of using classifiers. Code is available at https://github.com/vios-s/Diff-SCM.

📃 Localization of Coordinated Cyber-Physical Attacks in Power Grids Using Moving Target Defense and Deep Learning

Authors: Yexiang Chen, Subhash Lakshminarayana, Fei Teng
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2207.12339
Pdf link: https://arxiv.org/pdf/2207.12339
Abstract:
As one of the most sophisticated attacks against power grids, coordinated cyber-physical attacks (CCPAs) damage the power grid's physical infrastructure and use a simultaneous cyber attack to mask its effect. This work proposes a novel approach to detect such attacks and identify the location of the line outages (due to the physical attack). The proposed approach consists of three parts. Firstly, moving target defense (MTD) is applied to expose the physical attack by actively perturbing transmission line reactance via distributed flexible AC transmission system (D-FACTS) devices. MTD invalidates the attackers' knowledge required to mask their physical attack. Secondly, convolution neural networks (CNNs) are applied to localize line outage position from the compromised measurements. Finally, model agnostic meta-learning (MAML) is used to accelerate the training speed of CNN following the topology reconfigurations (due to MTD) and reduce the data/retraining time requirements. Simulations are carried out using IEEE test systems. The experimental results demonstrate that the proposed approach can effectively localize line outages in stealthy CCPAs.

📚 multimodal (total: 15)

📃 Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Authors: Riccardo Franceschini, Enrico Fini, Cigdem Beyan, Alessandro Conti, Federica Arrigoni, Elisa Ricci
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2207.11482
Pdf link: https://arxiv.org/pdf/2207.11482
Abstract:
Emotion recognition is involved in several real-world applications. With an increase in available modalities, automatic understanding of emotions is being performed more accurately. The success in Multimodal Emotion Recognition (MER), primarily relies on the supervised learning paradigm. However, data annotation is expensive, time-consuming, and as emotion expression and perception depends on several factors (e.g., age, gender, culture) obtaining labels with a high reliability is hard. Motivated by these, we focus on unsupervised feature learning for MER. We consider discrete emotions, and as modalities text, audio and vision are used. Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature. Our end-to-end feature learning approach has several differences (and advantages) compared to existing MER methods: i) it is unsupervised, so the learning is lack of data labelling cost; ii) it does not require data spatial augmentation, modality alignment, large number of batch size or epochs; iii) it applies data fusion only at inference; and iv) it does not require backbones pre-trained on emotion recognition task. The experiments on benchmark datasets show that our method outperforms several baseline approaches and unsupervised learning methods applied in MER. Particularly, it even surpasses a few supervised MER state-of-the-art.

📃 Towards Smart Fake News Detection Through Explainable AI

Authors: Athira A B, S D Madhu Kumar, Anu Mary Chacko
Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2207.11490
Pdf link: https://arxiv.org/pdf/2207.11490
Abstract:
People now see social media sites as their sole source of information due to their popularity. The Majority of people get their news through social media. At the same time, fake news has grown exponentially on social media platforms in recent years. Several artificial intelligence-based solutions for detecting fake news have shown promising results. On the other hand, these detection systems lack explanation capabilities, i.e., the ability to explain why they made a prediction. This paper highlights the current state of the art in explainable fake news detection. We discuss the pitfalls in the current explainable AI-based fake news detection models and present our ongoing research on multi-modal explainable fake news detection model.

📃 A Simplistic and Cost-Effective Design for Real-World Development of an Ambient Assisted Living System for Fall Detection and Indoor Localization: Proof of Concept

Authors: Nirmalya Thakur, Chia Y. Han
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2207.11623
Pdf link: https://arxiv.org/pdf/2207.11623
Abstract:
Falls, highly common in the constantly increasing global aging population, can have a variety of negative effects on their health, well-being, and quality of life, including restricting their capabilities to conduct Activities of Daily Living (ADLs), which are crucial for one's sustenance. Timely assistance during falls is highly necessary, which involves tracking the indoor location of the elderly during their diverse navigational patterns associated with ADLs to detect the precise location of a fall. With the decreasing caregiver population on a global scale, it is important that the future of intelligent living environments can detect falls during ADLs while being able to track the indoor location of the elderly in the real world. To address these challenges, this work proposes a cost-effective and simplistic design paradigm for an Ambient Assisted Living system that can capture multimodal components of user behaviors during ADLs that are necessary for performing fall detection and indoor localization in a simultaneous manner in the real world. Proof of concept results from real-world experiments are presented to uphold the effective working of the system. The findings from two comparison studies with prior works in this field are also presented to uphold the novelty of this work. The first comparison study shows how the proposed system outperforms prior works in the areas of indoor localization and fall detection in terms of the effectiveness of its software design and hardware design. The second comparison study shows that the cost for the development of this system is the least as compared to prior works in these fields, which involved real-world development of the underlining systems, thereby upholding its cost-effective nature.

📃 Explored An Effective Methodology for Fine-Grained Snake Recognition

Authors: Yong Huang, Aderon Huang, Wei Zhu, Yanming Fang, Jinghua Feng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.11637
Pdf link: https://arxiv.org/pdf/2207.11637
Abstract:
Fine-Grained Visual Classification (FGVC) is a longstanding and fundamental problem in computer vision and pattern recognition, and underpins a diverse set of real-world applications. This paper describes our contribution at SnakeCLEF2022 with FGVC. Firstly, we design a strong multimodal backbone to utilize various meta-information to assist in fine-grained identification. Secondly, we provide new loss functions to solve the long tail distribution with dataset. Then, in order to take full advantage of unlabeled datasets, we use self-supervised learning and supervised learning joint training to provide pre-trained model. Moreover, some effective data process tricks also are considered in our experiments. Last but not least, fine-tuned in downstream task with hard mining, ensambled kinds of model performance. Extensive experiments demonstrate that our method can effectively improve the performance of fine-grained recognition. Our method can achieve a macro f1 score 92.7% and 89.4% on private and public dataset, respectively, which is the 1st place among the participators on private leaderboard.

📃 Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis

Authors: Teng Sun, Wenjie Wang, Liqiang Jing, Yiran Cui, Xuemeng Song, Liqiang Nie
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2207.11652
Pdf link: https://arxiv.org/pdf/2207.11652
Abstract:
Existing studies on multimodal sentiment analysis heavily rely on textual modality and unavoidably induce the spurious correlations between textual words and sentiment labels. This greatly hinders the model generalization ability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sentiment analysis. This task aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization. To this end, we embrace causal inference, which inspects the causal relationships via a causal graph. From the graph, we find that the spurious correlations are attributed to the direct effect of textual modality on the model prediction while the indirect one is more reliable by considering multimodal semantics. Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis, which captures the direct effect of textual modality via an extra text model and estimates the indirect one by a multimodal model. During the inference, we first estimate the direct effect by the counterfactual inference, and then subtract it from the total effect of all modalities to obtain the indirect effect for reliable prediction. Extensive experiments show the superior effectiveness and generalization ability of our proposed framework.

📃 Towards Using Fully Observable Policies for POMDPs

Authors: András Attila Sulyok, Kristóf Karacs
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2207.11737
Pdf link: https://arxiv.org/pdf/2207.11737
Abstract:
Partially Observable Markov Decision Process (POMDP) is a framework applicable to many real world problems. In this work, we propose an approach to solve POMDPs with multimodal belief by relying on a policy that solves the fully observable version. By defininig a new, mixture value function based on the value function from the fully observable variant, we can use the corresponding greedy policy to solve the POMDP itself. We develop the mathematical framework necessary for discussion, and introduce a benchmark built on the task of Reconnaissance Blind TicTacToe. On this benchmark, we show that our policy outperforms policies ignoring the existence of multiple modes.

📃 Cross-Modal 3D Shape Generation and Manipulation

Authors: Zezhou Cheng, Menglei Chai, Jian Ren, Hsin-Ying Lee, Kyle Olszewski, Zeng Huang, Subhransu Maji, Sergey Tulyakov
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.11795
Pdf link: https://arxiv.org/pdf/2207.11795
Abstract:
Creating and editing the shape and color of 3D objects require tremendous human effort and expertise. Compared to direct manipulation in 3D interfaces, 2D interactions such as sketches and scribbles are usually much more natural and intuitive for the users. In this paper, we propose a generic multi-modal generative model that couples the 2D modalities and implicit 3D representations through shared latent spaces. With the proposed model, versatile 3D generation and manipulation are enabled by simply propagating the editing from a specific 2D controlling modality through the latent spaces. For example, editing the 3D shape by drawing a sketch, re-colorizing the 3D surface via painting color scribbles on the 2D rendering, or generating 3D shapes of a certain category given one or a few reference images. Unlike prior works, our model does not require re-training or fine-tuning per editing task and is also conceptually simple, easy to implement, robust to input domain shifts, and flexible to diverse reconstruction on partial 2D inputs. We evaluate our framework on two representative 2D modalities of grayscale line sketches and rendered color images, and demonstrate that our method enables various shape manipulation and generation tasks with these 2D modalities.

📃 Towards Complex Document Understanding By Discrete Reasoning

Authors: Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, Tat-Seng Chua
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2207.11871
Pdf link: https://arxiv.org/pdf/2207.11871
Abstract:
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language, which is an emerging research topic for both Natural Language Processing and Computer Vision. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs by extending the TAT-QA dataset. These documents are sampled from real-world financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer questions on this dataset. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. Extensive experiments show that the MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our new TAT-DQA dataset would facilitate the research on deep understanding of visually-rich documents combining vision and language, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future.

📃 GA2MIF: Graph and Attention-based Two-stage Multi-source Information Fusion for Conversational Emotion Detection

Authors: Jiang Li, Xiaoping Wang, Guoqing Lv, Zhigang Zeng
Subjects: Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2207.11900
Pdf link: https://arxiv.org/pdf/2207.11900
Abstract:
To enable artificial intelligence in providing empathetic services, multimodal Emotion Recognition in Conversation (ERC) plays an influential role in the field of human-computer interaction and conversational robotics. Multimodal data modeling is an up-and-coming research area in recent years, which is inspired by human multi-sensory integration capabilities. Up until now, there are few studies on multimodal-based conversational emotion recognition. Most of existing Multimodal ERC methods do not model cross-modal interactions and are incapable of extracting inter-modal complementary information. Several graph-based approaches claim to capture inter-modal complementary information, but it is difficult to obtain optimal solution using graph-based models due to the heterogeneity of multimodal data. In this work, we introduce a Graph and Attention-based Two-stage Multi-source Information Fusion (GA2MIF) approach for multimodal fusion. Our GA2MIF focuses on contextual modeling and cross-modal modeling leveraging Multi-head Directed Graph ATtention networks (MDGATs) and Multi-head Pairwise Cross-modal ATtention networks (MPCATs), respectively. Extensive experiments on two common datasets show that proposed GA2MIF can effectively capture intra-modal local and long-range contextual information as well as inter-modal complementary information, and outperforms existing State-Of-The-Art (SOTA) baselines by an absolute margin.

📃 Unleashing the potential of price-based congestion management schemes: a unifying approach to compare alternative models under multiple objectives

Authors: Ennio Cascetta, Marcello Montanino
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2207.12041
Pdf link: https://arxiv.org/pdf/2207.12041
Abstract:
A wide range of price-based congestion management schemes were proposed in the literature ranging from marginal cost road pricing to trip based multimodal pricing. The underlying models were formulated under different theoretical assumptions and with varying, and sometimes conflicting objectives. This paper presents a unifying framework under which different approaches can be compared based on their respective assumptions. The unifying modelling framework is referred to as trip pricing model, which extends path-differentiated pricing to multimodal paths, i.e., trips on whatever mode or combinations of modes, and generalizes road pricing schemes (which are shown to be a special case). By setting both positive (tolls) and negative (incentives) prices, revenue-neutral schemes are also shown to be special cases, with no a-priori assumption on which paths/modes to toll/incentivize. For model comparison, the pricing design problem is formulated in a multi-objective optimization framework, which combines traffic efficiency, environmental sustainability, users' acceptance, social and spatial equity, as pricing objectives. The trip pricing scheme is compared with traditional road pricing schemes, and with their revenue-neutral variants. First-best and second-best pricing schemes, designed in single- and multi-objective optimization frameworks, are compared on the Nguyen-Dupuis network. Results suggest a vast potential for multi-objective trip congestion pricing over single-objective road pricing. In addition, the application of both positive and negative prices is shown to significantly increase the expected acceptance of pricing schemes and to preserve efficiency and sustainability objectives. The results should promote the design of more effective pricing policies, i.e., set of pricing rules easier to communicate, based on now available technologies of passengers' tracking and pricing.

📃 Representational Ethical Model Calibration

Authors: Robert Carruthers, Isabel Straw, James K Ruffle, Daniel Herron, Amy Nelson, Danilo Bzdok, Delmiro Fernandez-Reyes, Geraint Rees, Parashkev Nachev
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2207.12043
Pdf link: https://arxiv.org/pdf/2207.12043
Abstract:
Equity is widely held to be fundamental to the ethics of healthcare. In the context of clinical decision-making, it rests on the comparative fidelity of the intelligence -- evidence-based or intuitive -- guiding the management of each individual patient. Though brought to recent attention by the individuating power of contemporary machine learning, such epistemic equity arises in the context of any decision guidance, whether traditional or innovative. Yet no general framework for its quantification, let alone assurance, currently exists. Here we formulate epistemic equity in terms of model fidelity evaluated over learnt multi-dimensional representations of identity crafted to maximise the captured diversity of the population, introducing a comprehensive framework for Representational Ethical Model Calibration. We demonstrate use of the framework on large-scale multimodal data from UK Biobank to derive diverse representations of the population, quantify model performance, and institute responsive remediation. We offer our approach as a principled solution to quantifying and assuring epistemic equity in healthcare, with applications across the research, clinical, and regulatory domains.

📃 Cross-Modal Contrastive Representation Learning for Audio-to-Image Generation

Authors: HaeChun Chung, JooYong Shim, Jong-Kook Kim
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2207.12121
Pdf link: https://arxiv.org/pdf/2207.12121
Abstract:
Multiple modalities for certain information provide a variety of perspectives on that information, which can improve the understanding of the information. Thus, it may be crucial to generate data of different modality from the existing data to enhance the understanding. In this paper, we investigate the cross-modal audio-to-image generation problem and propose Cross-Modal Contrastive Representation Learning (CMCRL) to extract useful features from audios and use it in the generation phase. Experimental results show that CMCRL enhances quality of images generated than previous research.

📃 A Letter on Progress Made on Husky Carbon: A Legged-Aerial, Multi-modal Platform

Authors: Adarsh Salagame, Shoghair Manjikian, Chenghao Wang, Kaushik Venkatesh Krishnamurthy, Shreyansh Pitroda, Bibek Gupta, Tobias Jacob, Benjamin Mottis, Eric Sihite, Milad Ramezani, Alireza Ramezani
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2207.12254
Pdf link: https://arxiv.org/pdf/2207.12254
Abstract:
Animals, such as birds, widely use multi-modal locomotion by combining legged and aerial mobility with dominant inertial effects. The robotic biomimicry of this multi-modal locomotion feat can yield ultra-flexible systems in terms of their ability to negotiate their task spaces. The main objective of this paper is to discuss the challenges in achieving multi-modal locomotion, and to report our progress in developing our quadrupedal robot capable of multi-modal locomotion (legged and aerial locomotion), the Husky Carbon. We report the mechanical and electrical components utilized in our robot, in addition to the simulation and experimentation done to achieve our goal in developing a versatile multi-modal robotic platform.

📃 GraphCFC: A Directed Graph based Cross-modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition

Authors: Jiang Li, Xiaoping Wang, Guoqing Lv, Zhigang Zeng
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2207.12261
Pdf link: https://arxiv.org/pdf/2207.12261
Abstract:
Emotion Recognition in Conversation (ERC) plays a significant part in Human-Computer Interaction (HCI) systems since it can provide empathetic services. Multimodal ERC can mitigate the drawbacks of uni-modal approaches. Recently, Graph Neural Networks (GNNs) have been widely used in a variety of fields due to their superior performance in relation modeling. In multimodal ERC, GNNs are capable of extracting both long-distance contextual information and inter-modal interactive information. Unfortunately, since existing methods such as MMGCN directly fuse multiple modalities, redundant information may be generated and heterogeneous information may be lost. In this work, we present a directed Graph based Cross-modal Feature Complementation (GraphCFC) module that can efficiently model contextual and interactive information. GraphCFC alleviates the problem of heterogeneity gap in multimodal fusion by utilizing multiple subspace extractors and Pair-wise Cross-modal Complementary (PairCC) strategy. We extract various types of edges from the constructed graph for encoding, thus enabling GNNs to extract crucial contextual and interactive information more accurately when performing message passing. Furthermore, we design a GNN structure called GAT-MLP, which can provide a new unified network framework for multimodal learning. The experimental results on two benchmark datasets show that our GraphCFC outperforms the state-of-the-art (SOTA) approaches.

📃 Continuous ErrP detections during multimodal human-robot interaction

Authors: Su Kyoung Kim, Michael Maurus, Mathias Trampler, Marc Tabie, Elsa Andrea Kirchner
Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2207.12267
Pdf link: https://arxiv.org/pdf/2207.12267
Abstract:
Human-in-the-loop approaches are of great importance for robot applications. In the presented study, we implemented a multimodal human-robot interaction (HRI) scenario, in which a simulated robot communicates with its human partner through speech and gestures. The robot announces its intention verbally and selects the appropriate action using pointing gestures. The human partner, in turn, evaluates whether the robot's verbal announcement (intention) matches the action (pointing gesture) chosen by the robot. For cases where the verbal announcement of the robot does not match the corresponding action choice of the robot, we expect error-related potentials (ErrPs) in the human electroencephalogram (EEG). These intrinsic evaluations of robot actions by humans, evident in the EEG, were recorded in real time, continuously segmented online and classified asynchronously. For feature selection, we propose an approach that allows the combinations of forward and backward sliding windows to train a classifier. We achieved an average classification performance of 91% across 9 subjects. As expected, we also observed a relatively high variability between the subjects. In the future, the proposed feature selection approach will be extended to allow for customization of feature selection. To this end, the best combinations of forward and backward sliding windows will be automatically selected to account for inter-subject variability in classification performance. In addition, we plan to use the intrinsic human error evaluation evident in the error case by the ErrP in interactive reinforcement learning to improve multimodal human-robot interaction.

📚 navigation (total: 6)

📃 A Simplistic and Cost-Effective Design for Real-World Development of an Ambient Assisted Living System for Fall Detection and Indoor Localization: Proof of Concept

Authors: Nirmalya Thakur, Chia Y. Han
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2207.11623
Pdf link: https://arxiv.org/pdf/2207.11623
Abstract:
Falls, highly common in the constantly increasing global aging population, can have a variety of negative effects on their health, well-being, and quality of life, including restricting their capabilities to conduct Activities of Daily Living (ADLs), which are crucial for one's sustenance. Timely assistance during falls is highly necessary, which involves tracking the indoor location of the elderly during their diverse navigational patterns associated with ADLs to detect the precise location of a fall. With the decreasing caregiver population on a global scale, it is important that the future of intelligent living environments can detect falls during ADLs while being able to track the indoor location of the elderly in the real world. To address these challenges, this work proposes a cost-effective and simplistic design paradigm for an Ambient Assisted Living system that can capture multimodal components of user behaviors during ADLs that are necessary for performing fall detection and indoor localization in a simultaneous manner in the real world. Proof of concept results from real-world experiments are presented to uphold the effective working of the system. The findings from two comparison studies with prior works in this field are also presented to uphold the novelty of this work. The first comparison study shows how the proposed system outperforms prior works in the areas of indoor localization and fall detection in terms of the effectiveness of its software design and hardware design. The second comparison study shows that the cost for the development of this system is the least as compared to prior works in these fields, which involved real-world development of the underlining systems, thereby upholding its cost-effective nature.

📃 A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

Authors: Jason Armitage, Leonardo Impett, Rico Sennrich
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2207.11717
Pdf link: https://arxiv.org/pdf/2207.11717
Abstract:
In a busy city street, a pedestrian surrounded by distractions can pick out a single sign if it is relevant to their route. Artificial agents in outdoor Vision-and-Language Navigation (VLN) are also confronted with detecting supervisory signal on environment features and location in inputs. To boost the prominence of relevant features in transformer-based architectures without costly preprocessing and pretraining, we take inspiration from priority maps - a mechanism described in neuropsychological studies. We implement a novel priority map module and pretrain on auxiliary tasks using low-sample datasets with high-level representations of routes and environment-related references to urban features. A hierarchical process of trajectory planning - with subsequent parameterised visual boost filtering on visual inputs and prediction of corresponding textual spans - addresses the core challenges of cross-modal alignment and feature-level localisation. The priority map module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers and attains state-of-the-art performance on the Touchdown benchmark for VLN. Code and data are referenced in Appendix C.

📃 A Closed-Loop Perception, Decision-Making and Reasoning Mechanism for Human-Like Navigation

Authors: Wenqi Zhang, Kai Zhao, Peng Li, Xiao Zhu, Yongliang Shen, Yanna Ma, Yingfeng Chen, Weiming Lu
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2207.11901
Pdf link: https://arxiv.org/pdf/2207.11901
Abstract:
Reliable navigation systems have a wide range of applications in robotics and autonomous driving. Current approaches employ an open-loop process that converts sensor inputs directly into actions. However, these open-loop schemes are challenging to handle complex and dynamic real-world scenarios due to their poor generalization. Imitating human navigation, we add a reasoning process to convert actions back to internal latent states, forming a two-stage closed loop of perception, decision-making, and reasoning. Firstly, VAE-Enhanced Demonstration Learning endows the model with the understanding of basic navigation rules. Then, two dual processes in RL-Enhanced Interaction Learning generate reward feedback for each other and collectively enhance obstacle avoidance capability. The reasoning model can substantially promote generalization and robustness, and facilitate the deployment of the algorithm to real-world robots without elaborate transfers. Experiments show our method is more adaptable to novel scenarios compared with state-of-the-art approaches.

📃 Non-cascaded Control Barrier Functions for the Safe Control of Quadrotors

Authors: Weifeng Zeng, Huanhui Cao, Wenjie Lu, Hao Xiong
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2207.12027
Pdf link: https://arxiv.org/pdf/2207.12027
Abstract:
Researchers have developed various cascaded controllers and non-cascaded controllers for the navigation and control of quadrotors in recent years. It is vital to ensure the safety of a quadrotor both in normal state and in abnormal state if a controller tends to make the quadrotor unsafe. To this end, this paper proposes a non-cascaded Control Barrier Function (CBF) for a quadrotor controlled by either cascaded controllers or a non-cascaded controller. Incorporated with a Quadratic Programming (QP), the non-cascaded CBF can simultaneously regulate the magnitude of the total thrust and the torque of the quadrotor determined a controller, so as to ensure the safety of the quadrotor both in normal state and in abnormal state. The non-cascaded CBF establishes a non-conservative forward invariant safe region, in which the controller of a quadrotor is fully or partially effective in the navigation or the pose control of the quadrotor. The non-cascaded CBF is applied to a quadrotor performing trajectory tracking and a quadrotor performing aggressive roll maneuvers in simulations to evaluate the effectiveness of the non-cascaded CBF.

📃 A Hybrid Model and Learning-Based Adaptive Navigation Filter

Authors: Barak Or, Itzik Klein
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2207.12082
Pdf link: https://arxiv.org/pdf/2207.12082
Abstract:
The fusion between an inertial navigation system and global navigation satellite systems is regularly used in many platforms such as drones, land vehicles, and marine vessels. The fusion is commonly carried out in a model-based extended Kalman filter framework. One of the critical parameters of the filter is the process noise covariance. It is responsible for the real-time solution accuracy, as it considers both vehicle dynamics uncertainty and the inertial sensors quality. In most situations, the process noise is covariance assumed to be constant. Yet, due to vehicle dynamics and sensor measurement variations throughout the trajectory, the process noise covariance is subject to change. To cope with such situations, several adaptive model-based Kalman filters were suggested in the literature. In this paper, we propose a hybrid model and learning-based adaptive navigation filter. We rely on the model-based Kalman filter and design a deep neural network model to tune the momentary system noise covariance matrix, based only on the inertial sensor readings. Once the process noise covariance is learned, it is plugged into the well-established, model-based Kalman filter. After deriving the proposed hybrid framework, field experiment results using a quadrotor are presented and a comparison to model-based adaptive approaches is given. We show that the proposed method obtained an improvement of 25% in the position error. Furthermore, the proposed hybrid learning method can be used in any navigation filter and also in any relevant estimation problem.

📃 Resilient Navigation and Path Planning System for High-speed Autonomous Race Car

Authors: Daegyu Lee, Chanyoung Jung, Andrea Finazzi, Hyunki Seong, D.Hyunchul Shim
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2207.12232
Pdf link: https://arxiv.org/pdf/2207.12232
Abstract:
This paper describes resilient navigation and planning algorithm for high-speed autonomous race, Indy Autonomous Challenge (IAC). The IAC is a competition with full-scale autonomous race car that can drive up to 290 km/h(180mph). Due to its high-speed and heavy vibration of the race car, GPS/INS system is prone to be degraded. These degraded GPS measurements can cause a critical localization error leading to a serious crashing accidents. To this end, we propose a robust navigation system to implement multi-sensor fusion Kalman filter. In this study, we present how to identify the degradation of measurement based on probabilistic approaches. Based on this approach, we could compute optimal measurement values for Kalman filter correction step. At the same time, we present the other resilient navigation system so that race car can follow the race track in fatal localization failure situations. In addition, this paper also covers the optimal path planning algorithm for obstacle avoidance. To take account for original optimal racing line, obstacles, vehicle dynamics, we propose a road-graph-based path planning algorithm to guarantee our race car drives in-bounded conditions. In the experiments, we will evaluate our designed localization system can handle the degraded data, and sometimes prevent serious crashing accidents while high-speed driving. In addition, we will describe how we successfully completed the obstacle avoidance challenge.

yhytoto12 / new-arxiv-papers Goto Github PK

new-arxiv-papers's People

Contributors

Watchers

new-arxiv-papers's Issues

💻 cs

📚 mask (total: 9)

📃 Deep Pneumonia: Attention-Based Contrastive Learning for Class-Imbalanced Pneumonia Lesion Recognition in Chest X-rays

📃 Progressive Scene Text Erasing with Self-Supervision

📃 Marior: Margin Removal and Iterative Content Rectification for Document Dewarping in the Wild

📃 MAR: Masked Autoencoders for Efficient Action Recognition

📃 Affective Behaviour Analysis Using Pretrained Model with Facial Priori

📃 Semantic-guided Multi-Mask Image Harmonization

📃 Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer

📃 What is Healthy? Generative Counterfactual Diffusion for Lesion Localization

📃 Localization of Coordinated Cyber-Physical Attacks in Power Grids Using Moving Target Defense and Deep Learning

📚 multimodal (total: 15)

📃 Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

📃 Towards Smart Fake News Detection Through Explainable AI

📃 A Simplistic and Cost-Effective Design for Real-World Development of an Ambient Assisted Living System for Fall Detection and Indoor Localization: Proof of Concept

📃 Explored An Effective Methodology for Fine-Grained Snake Recognition

📃 Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis

📃 Towards Using Fully Observable Policies for POMDPs

📃 Cross-Modal 3D Shape Generation and Manipulation

📃 Towards Complex Document Understanding By Discrete Reasoning

📃 GA2MIF: Graph and Attention-based Two-stage Multi-source Information Fusion for Conversational Emotion Detection

📃 Unleashing the potential of price-based congestion management schemes: a unifying approach to compare alternative models under multiple objectives

📃 Representational Ethical Model Calibration

📃 Cross-Modal Contrastive Representation Learning for Audio-to-Image Generation

📃 A Letter on Progress Made on Husky Carbon: A Legged-Aerial, Multi-modal Platform

📃 GraphCFC: A Directed Graph based Cross-modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition

📃 Continuous ErrP detections during multimodal human-robot interaction

📚 navigation (total: 6)

📃 A Simplistic and Cost-Effective Design for Real-World Development of an Ambient Assisted Living System for Fall Detection and Indoor Localization: Proof of Concept

📃 A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

📃 A Closed-Loop Perception, Decision-Making and Reasoning Mechanism for Human-Like Navigation

📃 Non-cascaded Control Barrier Functions for the Safe Control of Quadrotors

📃 A Hybrid Model and Learning-Based Adaptive Navigation Filter

📃 Resilient Navigation and Path Planning System for High-speed Autonomous Race Car

Recommend Projects

Recommend Topics

Recommend Org

Jobs